You are on page 1of 23

Accepted Manuscript

Tree-based ensemble methods and their applications in analytical chemistry


Dong-Sheng Cao, Jian-Hua Huang, Yi-Zeng Liang, Qing-Song Xu, Liang-Xiao
Zhang
PII: S0165-9936(12)00233-6
DOI: http://dx.doi.org/10.1016/j.trac.2012.07.012
Reference: TRAC 13930
To appear in: Trends in Analytical Chemistry
Please cite this article as: D-S. Cao, J-H. Huang, Y-Z. Liang, Q-S. Xu, L-X. Zhang, Tree-based ensemble methods
and their applications in analytical chemistry, Trends in Analytical Chemistry (2012), doi: http://dx.doi.org/10.1016/
j.trac.2012.07.012
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers
we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and
review of the resulting proof before it is published in its final form. Please note that during the production process
errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Tree-based ensemble methods and
their applications in analytical
chemistry

Dong-Sheng Cao, Qing-Song Xu, Liang-Xiao Zhang, Jian-Hua
Huang, Yi-Zeng Liang


Large amounts of data from high-throughput analytical instruments
have generally become more and more complex, bringing a number of
challenges to statistical modeling. To understand complex data further,
new statistically-efficient approaches are urgently needed to:
(1) select salient features from the data;
(2) discard uninformative data;
(3) detect outlying samples in data;
(4) visualize existing patterns of the data;
(5) improve the prediction accuracy of the data; and, finally,
(6) feed back to the analyst understandable summaries of information
from the data.
We review current developments in tree-based ensemble methods to
mine effectively the knowledge hidden in chemical and biology data. We
report on applications of these algorithms to variable selection, outlier
detection, supervised pattern analysis, cluster analysis, tree-based
kernel and ensemble learning.
Through this report, we wish to inspire chemists to take greater
interest in decision trees and to obtain greater benefits from using the
tree-based ensemble techniques.

Keywords: Chemometrics; Classification and regression tree (CART); Cluster analysis;
Complex data; Ensemble algorithm; Kernel method; Outlier detection; Pattern analysis;
Tree-based ensemble; Variable selection



Dong-Sheng Cao, Jian-Hua Huang, Yi-Zeng Liang*

Research Center of Modernization of Traditional Chinese Medicines, Central South
University, Changsha 410083, P. R. China

Qing-Song Xu

School of Mathematics and Statistics, Central South University, Changsha 410083, P. R.

China

Liang-Xiao Zhang

Key Laboratory of Separation Science for Analytical Chemistry, Dalian Institute of Chemical
Physics, Chinese Academy of Sciences, Dalian 116023, China



Corresponding author.
Tel.: +86 731 88830824; Fax: +86 731 88830831;
E-mail: yizeng_liang@263.net



1. Introduction

Traditionally, multivariate statistical techniques, including partial least squares
(PLS), principal-component analysis (PCA) and Fisher discriminant analysis (FDA),
play an important role in chemistry [1]. They have been widely used in analytical
chemistry for model modeling and data analysis. However, these approaches were
greatly challenged due to the emergence of more complex data resulting from
modern high-throughput analytical instruments, which usually show the following
characteristics:
(1) Outliers are commonly generated by experimental errors or uncontrolled factors;
(2) Chemical and biological data from high-throughput analytical experiments have
a large number of variables, most of which are irrelevant to or would even
interfere with our analysis. Also, the sample size is comparatively small. This is
the so-called large p, small n problem that has proved to be very challenging
in statistical learning;
(3) Most chemical and biological data usually contain many different patterns,
which may represent certain chemical or biological mechanisms of action or
features. Representative examples include metabolic track analysis of
metabolomics data, including different metabolic patterns in different stages
(e.g., patients treated with drugs at different stages) and
protein-structure/function prediction (e.g., -helix, -sheet, +, /);
(4) Almost all current chemical and biological problems are linearly inseparable due
to prolongation and expansion of analytical systems, particularly for emergence
of various omics studies (i.e., genomics, proteomics and metabolomics) and
systems biology. This inevitably calls for the development of more accurate
prediction models, which should be able to efficiently cope with nonlinearity in
data.
Also, complex data characteristically also contain extensive missing values,
mixtures of different data types, multiple classes, and badly unbalanced data sets.

Furthermore, these characteristics cross-link together to form more complex
problems, rendering our analysis more difficult and challenging. As an example,
significant interaction between outlier detection and variable selection was observed
in our previous study [2]. Likewise, we also found that patterns in data underlying
the specific space of measurements could be influenced by irrelevant measurements
existing in data.
Currently, a trend has evolved to introduce new mathematical techniques more
extensively to chemical studies on complex data analysis. Over the past decade, a
spectrum of new techniques, known as tree-based ensemble methods, has emerged to
solve problems in chemistry and biology. They have proved effective enough in
computation and universal in applications [35], compared to the existing popular
methods, especially traditional chemometric methods.
The theory of decision tree was extensively developed in 1980s, and two popular
tree algorithms were proposed: classification and regression tree (CART) by
Breiman et al. [6] and C4.5 by Quinlan ]. The excellent advantages of decision tree
are that it can not only build a readily interpretable model but also do automatic
stepwise variable selection and complexity reduction. It has been popular in
chemistry and biology and in the other fields of science after publication of the
important papers by Breiman [8] and Schapire[9], in which ensemble versions (e.g.,
bagging, boosting) of decision tree were proposed and thereby its disadvantages (e.g.,
low accuracy, instability) were overcome or even successfully changed into
advantages.
In 2002 and later, several excellent reference books on decision tree were
published [10,11]. In recent years, applications of tree-based ensemble methods to
various fields of chemistry and biology were introduced by different workers
[1215].
Here, we mainly focus on study and review of CART, and readers who would like
to learn C4.5 could refer to related papers.
In this article, we give a brief introduction to the theory and the algorithm of
decision tree, and we review applications in the field of chemistry and biology to
draw chemists attention to the tree-based ensemble methods and to the benefits
from using these techniques.


2. Theory and algorithm

2.1. Classification and regression tree (CART)
Classification and regression tree (CART), proposed by Breiman et al., is a
non-parametric statistical technique. The goal of CART is to explain response y by
selecting some useful independent variables from a large pool of variables (see Fig.
1). The CART procedure is generally made up of three steps.
In the first step, the full tree is built using a binary split procedure. Starting from
the root node including all training samples (e.g., node 1 in Fig. 1B), a yes-no

question at some variable x
(j)
is asked, and the samples for which the answer is yes
(x
(j)
, is a parameter of the model) are assigned to the left branch and the others
(x
(j)
> ) are assigned to the right branch. For example, node 1 including all samples
is further split into node 2 (x
1
0.335) and node 3 (x
1
> 0.335) by asking the first
variable x
1
(see Fig. 1B). Thus, a parent node is divided into two child nodes, each of
which can then be subdivided independently. At each split, the variable to be split
and split parameter could be determined by maximizing impurity decrease [see
Equation (3) in sub-section 2.2.). We can repeat the above process certain times until
some stopping criterion is reached (e.g., each node contains at least a predetermined
number of samples or only samples of one class). Thus, each sample is assigned to
one of the terminal nodes (terminal nodes refer to nodes without child nodes) based
on the answers to the questions.
The full tree is usually an overgrown model, which closely describes the training
set and generally shows overfitting (i.e. training samples are split well but prediction
for unknown samples is poor).
In the second step of CART, the strategy of pruning the full tree is proposed to
overcome the overfitting. Pruning could lead to several different sub-trees with a
smaller number of terminal nodes. Pruning starts from the bottom of a tree. In each
pruning step, a pair of terminal nodes from a parent node is pruned away (e.g.,
terminal nodes 6 and 7 can be simultaneously pruned away so that node 5 becomes a
new terminal node). We repeat the pruning step several times to obtain a set of
sub-trees with different terminal nodes.
The third step of CART is to select an optimal sub-tree based on the quality of
prediction for new samples. This is done according to cost-complexity criterion C

C = Q + L (1)
where Q is the misclassification cost (e.g., misclassification rates) of the sub-tree
generated by pruning and regularization parameter determines the trade-off
between the misclassification cost and the model complexity measured by L (i.e. the
number of terminal nodes). The value of can usually be determined by an
independent validation set or cross validation.
As Breiman pointed out, CART carries out a thorough, but not optimal, search in
two spaces (i.e. sample and variable) of variable sub-sets relevant to classification
and sample sub-sets with specific similarity under different variable subspace (see
Fig. 1). That is, on the one hand, it can easily select informative components from
hundreds to thousands of variables. On the other hand, it can also recursively divide
the total sample space into several rectangular areas with specific similarity (e.g., the
samples under the same terminal node; see also the gray region in Fig. 1 A).
CART intrinsically produces homogenous sub-sets of data. However, the CART
program is usually unstable. A slight fluctuation in data usually leads to a completely
different tree structure reflecting different ad hoc rules in data. Hence, collection and
compilation of plenty of trees by the ensemble strategy help look deeply into the
intrinsic structure of the data.


2.2. Variable-importance ranking
In a tree classifier containing L terminal nodes, the importance of a candidate
variable x
(j)
can be evaluated as follows:
L 1
( j) m ( j)
m 1
J( ) I(v(m) )

=
= =

x x (2)
where v(m) denotes the selected split variable at node m. If variable x
(j)
is selected as
splitting variable at node m, I(v(m) = x
(j)
) = 1, otherwise, I(v(m) = x
(j)
) = 0.
Corresponding impurity decrease
m
evaluates the importance of the variable
selected to split the region at node m. If parent node m is split into two child nodes
(e.g., m
l
and m
r
) at node m, impurity decrease
m
at node m is defined as:
l r
m m l m r m
= - p - p (3)
where
m
,
l
m
and
r
m
are, respectively, the impurity at parent node m, left child
node and right child node, and
l
p

and
r
p are the proportions of samples which fall
into the left child node and right child node, respectively. Node impurity is usually
measured by the Gini index:

T
t t
t=1
= p (1 p )

(4)
where T is the class number, p
t
is the sample proportion of class t. Thus, all impurity
decreases regarding variable x
(j)
are added up to obtain the importance measure for
the variable.

2.3. Sample-proximity matrix
We can construct a sample-proximity matrix by using the ensemble strategy to
establish the relationship between samples (see Fig. 2). Given sample matrix X of
size n p, each row denotes a sample and each column a variable. The
corresponding class label is recorded in vector y of size n 1 with element from 1 to
T for the T classification case. To begin with, sample-proximity matrix PROX of
size n n with all elements equal to 0 is generated. Then, with the help of the
Monte-Carlo (MC) procedure, the samples for each class are randomly divided into
training samples and validation samples. The training samples are combined to
obtain the final training set [e.g., (X
train
, y
train
)]. Accordingly, the validation samples
are combined to obtain the final validation set [e.g., (X
val
, y
val
)]. Then, the training
samples obtained are first used to grow a classification tree and the validation
samples are then used to prune the overgrown tree to obtain the optimal pruning
level (e.g., L
best
). Instead of obtaining an optimal tree, a sub-optimal tree, which is
between optimal and overgrown, is constructed using a fuzzy pruning strategy:
randomly generate pruning level L between 1 and L
best
, and then prune the

overgrown tree. The fuzzy pruning strategy helps in effectively exploiting the
information of internal nodes, but does not totally destroy the structure of the tree.
After that, the entire samples (e.g., X) are predicted by the sub-optimal tree, and
thereby each sample falls into one of the terminal nodes.
It is worth noting that the samples under the same terminal node may possess
some specific similarity to some extent, rather than only being limited to class
similarity (i.e. the samples in one terminal node are more similar than those in the
other terminal nodes). If two samples i and j turn up in the same terminal node,
sample-proximity measure PROX(i, j) is increased by 1. Herein, PROX can be
considered a similarity measure based on tree-based similarity and reflects the size
of similarity among training samples.
We can repeat the above process many times (e.g., ntree) to establish a large
number of tree models. Accordingly, the sample-proximity matrix is changed by the
results of the tree. The bigger PROX(i, j) is, the more similar training sample i and
training sample j is. At the end of construction, the sample-proximity matrix is
normalized by dividing by the number of trees. Note that the proximity between a
sample and itself is always set to 1[(i.e. PROX(i, i) = 1]. Likewise, the predictive
proximity matrix PPROX, which reflects the similarity between training samples
and new samples, can be constructed in a similar way. The concept of sample
proximity is of great importance for understanding the applications to tree-based
ensemble methods (e.g., outlier detection, pattern analysis, cluster analysis and
tree-based kernel).


3. Applications to tree-based ensemble methods

Over the past few years, various theoretical studies and applications related to CART
were reported in analytical chemistry, basically in QSAR/QSPR, mass spectrometry
(MS), food analysis, -omics studies, and chromatography. CART was also reviewed
recently [16]. Theoretical studies put more emphasis on improving the prediction
accuracy and the instability of CART [17]. Applying global optimal algorithms to
grow the optimal tree model was reported recently [1820]. Variable selection based
on CART is another hot research topic [2123]. Among all papers regarding CART,
those on ensemble variants make up a substantial part of the CART world.

3.1. Ensemble-variable selection
The development of robust, high-throughput analytical techniques allows the
simultaneous measurement of hundreds to thousands of features on a single chemical
or biological sample. Thus, the data generated in studies are usually described by a
large number of highly correlated variables. In large feature/small sample size
problems, both model performance and robustness of the variable-selection process
are important, so a robust feature-selection technique is extensively needed and used
in analytical chemistry. Several methods for variable selection have been proposed,

including commonly used univariate statistical measures {e.g., F-statistic, Fishers
variance ratio and correlation coefficient), genetic algorithm (GA), particle swarm
optimization algorithm (PSO), stepwise regression regarding different criteria, the
lasso, uninformative variable elimination (UVE) [24]}.
The general procedure of the tree-based ensemble feature selection can be
summarized as follows:
(1) Apply Monte Carlo techniques to the original data Z = {(x
1
, y
1
), (x
2
, y
2
), , (x
n
,
y
n
)} and obtain a training set and a validation set. Here, x
i
, i = 1, 2, , n, n is
the number of samples and y
i
is the corresponding class label or response value.
(2) Construct a decision tree model. The training set is used for growing a tree and
the validation set is used for pruning this tree. Feature importance and prediction
accuracy can be finally gained based on the current split.
(3) Repeat steps (1) and (2) a number of times (e.g., ntree = 1000) to obtain a large
number of tree models. Simultaneously, we can record variable importance
b (j)
J ( ) x and prediction accuracy Acc for each tree.
(4) The final importance can be determined through averaging the importance
weighted by prediction accuracy for each tree.

ntree
(j) b (j)
b=1
1
J( ) = (J ( ) Acc)
ntree

x x (5)
Instability of tree structure will lead to unstable variable-importance ranking. The
analysis of high-throughput data by cross-prediction in the Monte-Carlo schemes can
intensively probe a large model population. The informative features selected by
individual tree models cannot necessarily reflect their true behavior, since they are
seriously affected by random division of training set and validation set, especially
for short fat data (i.e., p >> n).
However, variable importance constructed by ensemble strategy can provide more
objective information. Tree-based ensemble-variable selection can effectively
guarantee that the variables indeed reflect the real contribution to classification, not
deriving from the chanciness. The variable importance is thereby more stable and
reliable [25,26]. Robust variable-selection techniques would allow domain experts to
have more confidence in the selected variables, as, in most cases, these variables are
subsequently analyzed further, requiring much time and effort, especially in
biomedical applications [27].
Ensemble-variable selection, in particular tree-based ensemble-variable selection,
has been developed and widely applied in chemistry and biology [2833]. Using a
similar idea, a Monte Carlo feature-selection method for supervised classification
was proposed to select the most discriminant genes in leukemia and lymphoma
datasets [28]. Comparisons with the commonly-used statistical techniques in
chemometrics (e.g., genetic algorithms on multiple linear regression, UVE-PLS)
were reported on prediction performance and selected variables [32]. The stability of
ensemble-variable-selection techniques was also further studied using several
stability indices on microarray datasets and MS datasets [33].

3.2. Outlier detection
Detection and removal of outliers from measured data is the important step before
modeling [34]. The interpretability and the prediction performance of a model built
using data without outliers could be improved. Algorithms have been developed for
outlier detection and proved effective, including conventional Mahalanobis distance
[35], Cook distance, and others {e.g., minimum covariance determinant (MCD),
robust PCA [36], resampling by half-means (RHM) and smallest half-volume (SHV)
[37] and the MC method [38]}.
We propose an outlier-detection method by sample proximity, PROX, and
predictive proximity, PPROX. Outliers are treated as samples having small
proximities to all other samples. Since the data in some classes is more spread out
than others, outlyingness is defined only with respect to other data in the same class
as the given sample. For molecule i, a measure of outlyingness is computed
according to out(i) =
2
-1
[ (i,j) ]

PROX , where the sum is over all j in the same class


as molecule i. This quantity will be large if proximities PROX(i, j) from i to the
other molecule j in the same class are generally small. Next, normalize out(i), i =1,...,
n, by subtracting the median from each out(i), divided by the mean absolute
deviation from the median. The values less than zero are set to zero. Thus, The
bigger out(i) for some sample i is, the more outlying sample i is from the main body
of the class including sample i. Likewise, for test samples, predictive proximity
PPROX can be used for checking their outlyingness according to the training set.
However, somewhat differently, for test sample k, its outlyingness is calculated
according to out(k)=
2
-1
[ (k,j) ]

PPROX , where the sum is over all j in different


classes in the training set. Thus, the minimum of out(k) for different classes in the
training set is regarded as the measure of outlyingness for test sample k. Finally, an
outlyingness rank could be obtained to evaluate the outlyingness of each sample.
Generally, if the measure out(k) is greater than 10, the sample should be carefully
inspected.
Outlier detection by decision tree is different from other distance-based outlier
detection methods (e.g., MCD and robust PCA). It is based on a novel tree-based
similarity metric. Outliers detected by tree-based similarity are not necessarily the
samples far away from the main body of the data (i.e. the outliers detected by robust
PCA). The philosophy underlying outlier detection by decision tree is that an outlier
is a case whose proximities to all other cases are small. By inspecting the similarity
relationships between the query point and all other samples, we could effectively
determine whether or not the query sample is an outlier/novelty.

3.3. Pattern analysis
Analyzing patterns hidden in data is necessary in various omics studies. In many
cases, researchers may be interested in the sample-proximity measure determined by
the sub-set of informative components, which are the most significant for
classification. Clear patterns in analytical data could subsequently be found by

sample proximity. Based on the sample-proximity matrix, a semi-supervised method
dedicated to pattern analysis, called Monte Carlo Tree (MCTree) [39], was
developed for -omics data analysis, and can be summarized as follows:
(1) Construct the sample-proximity matrix in sub-section 2.3. using the omics data
at hand. This measure of proximity has two advantages: It is supervised
because the class information dictates the structure of the trees. Also, since
irrelevant components contribute little to the tree ensemble, they have little
influence on the proximity measure.
(2) A direct visualization for the proximity matrix, called sample-proximity plot,
can easily be made to give some insights into -omics samples. Moreover,
multi-dimensional scaling (MDS), as a non-linear mapping algorithm, can also
be employed to map the proximity into a lower-dimensional space. MDS starts
with a matrix of item-item similarities (e.g., PROX), and then assigns a location
to each item in p-dimensional space, where p is specified a priori. For
sufficiently small p (e.g., 2 or 3), the resulting locations may be displayed in a
graph.
In our previous work, MCTree was applied to uncover the underlying structure of
two sets of metabolomics data (childhood obesity data with tree patterns and chronic
hepatic schistosomiasis data with six patterns). After the tuning parameters are
suitably set, clear pictures reflecting metabolic track trends are visualized by the
sample-proximity plot and the MDS plot. Meanwhile, the important metabolites
contributing to scenarios could be discovered and better interpreted. Moreover, some
interesting phenomena suggesting the effects of drug therapy could also be found.
Comparison with PCA strongly indicated the superiority of MCTree dealing with
multiple-class data, especially when data patterns are subject to serious interference
by irrelevant variables with larger variations [39].
As an illustrative example, Fig. 3 shows pattern-analysis results using chronic
hepatic schistosomiasis data with only four patterns, in which A represents
metabolite-importance ranking, and B is the heat map of sample-proximity matrix
with hierarchical cluster analysis (HCA). It can easily be seen that clear patterns
could be found by MCTree. More interestingly, HCA shows that two groups (c, d)
with drug therapy are each other closer than the other two groups (a, b), and the
control group a is farthest in distance.

3.4. Cluster analysis
In cluster analysis, the measured data consist of only a set of sample vectors without
class labels. There is no figure of merit to optimize, leaving the field open to
ambiguous conclusions. The usual goal is to cluster the data to see if it falls into
different locations, each of which can be assigned some specific meanings. Several
methods for clustering are proposed, based on different similarity/distance criteria.
Representative examples in chemistry include K-means, HCA, self-organizing map
neural network (SOM-NN) [40] and non-linear mapping (NLM) [41].
Due to the intrinsic cluster characteristic of tree, a different approach, called
tree-cluster analysis (TCA), is proposed to analyze the data structure. This method is

summarized as follows:
(1) TCA first considers the original data as class 1 and creates a synthetic second
class of the same size labeled as class 2. The synthetic second class is created by
sampling at random from the univariate distributions of the original data. The
first feature is sampled from the n values x
(1)
. The second feature is sampled
independently from the n values x
(2)
, and so forth. Thus, class 2 has the
distribution of independent random variables, each having the same univariate
distribution as the corresponding variable in the original data. Now there are two
classes by combining class 1 with class 2.
(2) The artificial two-class problem can be carried out to obtain sample-proximity
matrix. The part of sample-proximity matrix corresponding to class 1 (i.e. the
original data) is then extracted as further analysis by visualization, MDS and
HCA. Thus, the cluster trend in the original data can easily be discovered.
The philosophy underlying clustering with aid of TCA generally differs from
other cluster techniques, in which the similarity or distance criteria are directly
calculated. However, TCA aims to construct a sample-proximity matrix of the
analyzed data by transforming cluster analysis to a classification problem. As a good
illustration of TCA for cluster analysis, the iris dataset of 150 samples with three
species (50 setosa, 50 virginica and 50 versicolor) are employed. Fig. 4 shows the
cluster results by TCA, in which two visual means (heat map of sample proximity
and the MDS plot) are displayed. It can easily be seen that three clusters are
obviously found by the TCA approach.
Apart from TCA, some cluster algorithms using CART, studied and applied to
chemical problems, were based on the ability of CART to generate homogeneous
data. A method, called auto-associative multivariate regression trees (AAMRT), was
proposed to cluster two sets of chemical data [42]. It was found that AAMRT
produced clusters of similar quality to the K-means technique and more recent
approaches to cluster analysis. Its ensemble variant was subsequently developed by
Smyth [43]. Using exactly the same idea, an HCA approach based on a set of PLS
models, called PLS-Trees [44], was presented and successfully applied to two QSAR
datasets and hyper-spectral images of liver tissue.

3.5. Tree-based kernel methods
A critical step toward applying kernel methods to solve the problems in chemistry
(e.g., NIR analysis, QSAR/QSPR) is to select or to devise an effective kernel
meeting different scientific tasks at hand (e.g., graph kernel, string kernel, spectrum
kernel, and pairwise kernel) [45]. Kernel inherently measures the similarity between
samples. Once a suitable kernel is determined, we could combine some algorithms
(e.g., PCA, PLS, FDA) with this kernel to produce more flexible, powerful
algorithms [46,47].
By considering the sample-proximity matrix as a kernel matrix, a novel tree
kernel Fisher discriminant analysis (TKFDA) was developed [48]. A set of simulated
data and two sets of metabolomics data regarding impaired fasting glucose and
human diabetes were applied to demonstrate its validity. Comparisons with PLS,

FDA and support vector machine (SVM) based on radial basis kernel clearly suggest
that the tree kernel so obtained can effectively decrease the influence of
uninformative variables and thereby focus on the intrinsic measure of sample
similarity. Simultaneously, potential biomarkers can be successfully discovered by
variable-importance ranking. Moreover, TKFDA can also deal with the non-linear
relationship by such a kernel. This tree kernel is also equally applicable to other
algorithms (e.g., PLS and SVM). Recently, we proposed a tree kernel SVM by
combing linear SVM with this tree kernel to predict pharmacokinetic properties of
drugs. Satisfactory results could be obtained compared to SVMs based on the other
commonly-used kernels.

3.6. Tree-based ensemble learning
Although the CART technique proved very useful in chemical-data analysis, it has
two major drawbacks inaccuracy and instability. However, ensemble learning,
including bagging, random forest (RF) [49] and boosting [50], could effectively
overcome these disadvantages. These techniques aim at combining the outputs of
many relatively weak tree learners to produce a powerful committee, although
the implementation details are different.
Breimans bagging (short for bootstrap aggregating) is one of the earliest,
simplest ensemble algorithms with a surprisingly good performance. The bagging
tree builds tree ensembles (CART without pruning) from bootstrap samples (i.e.
samples of the same size as the training set drawn with replacement from it), and
then combines all classifiers into a single one. It is worth noting that many of the
training samples may be repeated in the resulting bootstrap sample set while others
may be left out (approximately 1/e 37% of all the samples are not presented in the
bootstrap-sample set). Bagging seems to work especially well for unstable CART
classifiers, in that it can dramatically reduce the variance of these algorithms with
the help of averaging.
RF directly derives from bagging and is seen as a variant of bagging. The only
difference between RF and bagging is that RF chooses the best split among a
randomly selected sub-set of mtry (mtry < p) variables, rather than all variables, at
each node. Thus, each tree in RF is constructed using the bootstrap samples of the
training data and random variable selection in tree induction. Final predictions are
made by aggregating (majority vote) the predictions of all analogous trees. RF
improves the performance of bagging by reducing the correlation of trees [11].
Boosting is another general, effective method for producing an accurate
prediction rule by combining rough and moderately rough rules of thumb. The
algorithm generates a set of weak learners, and combines them through weighted
majority voting of the classes predicted by the weak learners. Each weak tree learner
is obtained using samples drawn from an iteratively updated distribution of the entire
training samples. Thus, boosting maintains a distribution or set of weights over the
original training set and adjusts these weights after each classifier is learned by weak
learners. The adjustments increase the weight of samples that are misclassified by
weak learners and decrease the weight of samples that are correctly classified. Hence,

consecutive classifiers gradually focus on those increasingly hard-to-classify
samples. Boosting can be considered as an iterative reweighting procedure by
sequentially applying a tree learner to reweighted versions of the training data,
whose current weights are modified based on how accurately the previous learners
predict these samples.
Fig. 5 shows a simulated example, indicating the potential capability of three
ensemble approaches. It can be seen from Fig. 5B that the performance of CART can
be greatly improved by ensemble learning (CART 8.2% versus boosting 3.7% and
RF 4.3%).
Applications of these techniques are almost everywhere in analytical chemistry,
including, but not limited to, QSAR/QSPR, NIR analysis, -omics studies, MS
classification, and RNA/protein-function prediction. Recently, there were excellent
reviews and further improvements reported [5155]. As illustrative examples,
Massart et al. applied boosting to five chemical datasets (three NIR datasets,
including wheat data, cream data and green tea data, one HIV dataset and one
chromatographic retention dataset), and demonstrated significant performance
improvement compared to CART [12]. Gene selection and classification of
microarray data using RF was investigated [56]. Applying RF to identify biomarker
panels in serum-gel-electrophoresis data for the detection and the staging of prostate
cancer was proposed [57]. Performance comparison of boosting and RF with other
commonly-used modeling tools was also proposed in several QSAR/QSPR datasets
[53,54].
Although these procedures have been successfully used in several cases, there are
still some problems and limitations in practical uses (e.g., transparency and
interpretability of the model). Apart from accuracy improvement, they usually
produce the black-box models, which are difficult for users to understand. In most
cases, chemists or biologists may be more interested in chemical or biological
interpretation for the system analyzed. Further studies are needed to open these black
boxes.


4. Conclusion

Since 1990, various possible applications of decision tree and its ensemble variants
in the different fields of chemistry and biology have been extensively investigated.
Apart from the studies mentioned in this article, tree-based ensemble methods share
several outstanding advantages of decision tree, including high capability in
handling mixed or badly unbalanced datasets, extensive missing values and
immunities to outliers, collinearities, and flexibility with no formal assumption on
data structure, and can therefore cope effectively with more complex data. These
studies proved that the tree-based ensemble techniques are very efficient, so they can
play an important role in chemical and biological modeling. We hope that this article
will stimulate broader interest in applications of decision tree. Further investigations

on the topics are still proceeding, and we expect the decision-tree techniques to gain
popularity among chemists as a widely-used approach in the future.

Acknowledgements
This study was supported by the National Natural Science Foundation of China
(Grants No. 21075138 and No. 20975115) and the International Cooperation Project
on Traditional Chinese Medicines of the Ministry of Science and Technology of
China (Grant No. 2007DFA40680). The studies met with the approval of the Review
Board of Central South University.

References
[1] S. Wold, M. Sjostrom, L. Eriksson, Chemometr. Intell. Lab. Syst. 58 (2001) 109.
[2] D.S. Cao, Y.Z. Liang, Q.S. Xu, Y.F. Yun, H.D. Li, J. Comput. Aid. Mol. Des. 25
(2011) 67.
[3] D.S. Palmer, N.M. O'Boyle, R.C. Glen, J.B.O. Mitchell, J. Chem. Inf. Model. 47
(2006) 150.
[4] M.H. Zhang, Q.S. Xu, D.L. Massart, Anal. Chem. 77 (2005) 1423.
[5] Z.-C. Li, Y.-H. Lai, L.-L. Chen, X. Zhou, Z. Dai, X.-Y. Zou, Anal. Chim. Acta
718 (2012) 32.
[6] L. Breiman, J.H. Friedman, R.A. Olsen, C.J. Stone, Classification and
Regression Trees, Wadsworth International, CA, USA, 1984.
[7] R. Quinlan J, C4.5, Morgan Kaufmann, San Mateo, CA, USA, 1993.
[8] L. Breiman, Machine Learning 24 (1996) 123.
[9] Y. Freund, R.E. Schapire, J. Comput. Syst. Sci. 55 (1997) 119.
[10] S.D. Brown, A.J. Myles, D.B. Stephen, T. Roma, W. Beata, Comprehensive
Chemometrics, Elsevier, Oxford, UK, 2009, p. 541
[11] J.H. Friedman, T. Hastie, R. Tibshirani, The Elements of Statistical Learning:
Data Mining, Inference and Prediction, Springer-Verlag, New York, USA, 2008.
[12] M.H. Zhang, Q.S. Xu, F. Daeyaert, P.J. Lewi, D.L. Massart, Anal. Chim. Acta
544 (2005) 167.
[13] D.-S. Cao, Q.-N. Hu, Q.-S. Xu, Y.-N. Yang, J.-C. Zhao, H.-M. Lu, L.-X. Zhang,
Y.-Z. Liang, Anal. Chim. Acta 692 (2011) 50.
[14] P. He, C.-J. Xu, Y.-Z. Liang, K.-T. Fang, Chemometr. Intell. Lab. Syst. 70 (2004)
39.
[15] D.-S. Cao, Y.-N. Yang, J.-C. Zhao, J. Yan, S. Liu, Q.-N. Hu, Q.-S. Xu, Y.-Z.
Liang, J. Chemometr. 26 (2012) 7.
[16] A.J. Myles, R.N. Feudale, Y. Liu, N.A. Woody, S.D. Brown, J. Chemometr. 18
(2004) 275.
[17] A.J. Myles, S.D. Brown, J. Chemometr. 17 (2003) 531.
[18] Y.-P. Zhou, L.-J. Tang, J. Jiao, D.-D. Song, J.-H. Jiang, R.-Q. Yu, J. Chem. Inf.
Model. 49 (2009) 1144.
[19] B. Hemmateenejad, M. Shamsipur, V. Zare-Shahabadi, M. Akhond, Anal. Chim.
Acta 704 (2011) 57.
[20] S. Izrailev, D. Agrafiotis, J. Chem. Inf. Model. 41 (2000) 176.

[21] F. Questier, R. Put, D. Coomans, B. Walczak, Y.V. Heyden, Chemometr. Intell.
Lab. Syst. 76 (2005) 45.
[22] M.P. Gomez-Carracedo, J.M. Andrade, G.V.S.M. Carrera, J. Aires-de-Sousa, A.
Carlosena, D. Prada, Chemometr. Intell. Lab. Syst. 102 (2010) 20.
[23] B. Debska, B. Guzowska-Swider, Anal. Chim. Acta 705 (2011) 261.
[24] V. Centner, D.-L. Massart, O.E. de Noord, S. de Jong, B.M. Vandeginste, C.
Sterna, Anal. Chem. 68 (1996) 3851.
[25] Z. He, W. Yu, Comput. Biol. Chem. 34 (2010) 215.
[26] D. Dutta, R. Guha, D. Wild, T. Chen, J. Chem. Inf. Model. 47 (2007) 989.
[27] Y. Saeys, I. Inza, P. Larranaga, Bioinformatics 23 (2007) 2507.
[28] M. Draminski, A. Rada-Iglesias, S. Enroth, C. Wadelius, J. Koronacki, J.
Komorowski, Bioinformatics 24 (2008) 110.
[29] L. Auret, C. Aldrich, Chemometr. Intell. Lab. Syst. 105 (2011) 157.
[30] D.-S. Cao, Q.-S. Xu, Y.-Z. Liang, X. Chen, H.-D. Li, Chemometr. Intell. Lab.
Syst. 103 (2010) 129.
[31] P.M. Granitto, C. Furlanello, F. Biasioli, F. Gasperi, Chemometr. Intell. Lab.
Syst. 83 (2006) 83.
[32] T. Hancock, R. Put, D. Coomans, Y. Vander Heyden, Y. Everingham,
Chemometr. Intell. Lab. Syst. 76 (2005) 185.
[33] Y. Saeys, T. Abeel, Y. Van de Peer, W. Daelemans, B. Goethals, K. Morik,
Robust Feature Selection Using Ensemble Feature Selection Techniques,
Springer-Verlag, Berlin, Germany, 2008, p. 313.
[34] L. Yi-Zeng, O.M. Kvalheim, Chemometr. Intell. Lab. Syst. 32 (1996) 1.
[35] R. De Maesschalck, D. Jouan-Rimbaud, D.L. Massart, Chemometr. Intell. Lab.
Syst. 50 (2000) 1.
[36] P.J. Rousseeuw, M. Debruyne, S. Engelen, M. Hubert, Crit. Rev. Anal. Chem. 36
(2006) 221.
[37] W.J. Egan, S.L. Morgan, Anal. Chem. 70 (1998) 2372.
[38] D.-S. Cao, Y.-Z. Liang, Q.-S. Xu, H.-D. Li, X. Chen, J. Comput. Chem. 31
(2010) 592.
[39] D.-S. Cao, B. Wang, M.-M. Zeng, Y.-Z. Liang, Q.-S. Xu, L.-X. Zhang, H.-D. Li,
Q.-N. Hu, Analyst (Cambridge, UK) 136 (2011) 947.
[40] K. Wongravee, G.R. Lloyd, C.J. Silwood, M. Grootveld, R.G. Brereton, Anal.
Chem. 82 (2009) 628.
[41] M. Daszykowski, B. Walczak, D.L. Massart, Chemometr. Intell. Lab. Syst. 65
(2003) 97.
[42] C. Smyth, D. Coomans, Y. Everingham, T. Hancock, Chemometr. Intell. Lab.
Syst. 80 (2006) 120.
[43] C. Smyth, D. Coomans, J. Chemometr. 21 (2007) 364.
[44] L. Eriksson, J. Trygg, S. Wold, J. Chemometr. 23 (2009) 569.
[45] D.S. Cao, J.C. Zhao, Y.N. Yang, C.X. Zhao, J. Yan, S. Liu, Q.N. Hu, Q.S. Xu,
Y.Z. Liang, Environ. Res. 23 (2012) 141.
[46] D.-S. Cao, Y.-Z. Liang, Q.-S. Xu, Q.-N. Hu, L.-X. Zhang, G.-H. Fu, Chemometr.
Intell. Lab. Syst. 107 (2011) 106.

[47] D.-S. Cao, J.-H. Huang, J. Yan, L.-X. Zhang, Q.-N. Hu, Q.-S. Xu, Y.-Z. Liang,
Chemometr. Intell. Lab. Syst. 114 (2012) 19.
[48] D.S. Cao, M.M. Zeng, L.Z. Yi, B. Wang, Q.S. Xu, Q.N. Hu, L.X. Zhang, H.M.
Lu, Y.Z. Liang, Anal. Chim. Acta 706 (2011) 97.
[49] L. Breiman, Machine Learning 45 (2001) 5.
[50] H.F. Jerome, Ann. Stat. 29 (2001) 1189.
[51]D.-S. Cao, Q.-S. Xu, Y.-Z. Liang, L.-X. Zhang, H.-D. Li, Chemometr. Intell. Lab.
Syst. 100 (2010) 1.
[52] D.-S. Cao, Y.-Z. Liang, Q.-S. Xu, L.-X. Zhang, Q.-N. Hu, H.-D. Li, J.
Chemometr. 25 (2011) 201.
[53] V. Svetnik, T. Wang, C. Tong, A. Liaw, R.P. Sheridan, Q. Song, J. Chem. Inf.
Model. 45 (2005) 786.
[54] V. Svetnik, A. Liaw, C. Tong, J.C. Culberson, R.P. Sheridan, B.P. Feuston, J.
Chem. Inf. Model. 43 (2003) 1947.
[55] P. He, K.-T. Fang, Y.-Z. Liang, B.-Y. Li, Anal. Chim. Acta 543 (2005) 181.
[56] R. Diaz-Uriarte, S. Alvarez de Andres, BMC Bioinformatics 7 (2006) 3.
[57] Y. Fan, T.B. Murphy, J.C. Byrne, L. Brennan, J.M. Fitzpatrick, R.W.G. Watson,
J. Proteome Res. 10 (2011) 1361.


Captions

Figure 1. (A) A simple, partitioned two-dimensional training set. (B) Construction of
CART for training data via recursive partitioning of the set. In this process, CART
produces the homogenous sub-sets of data by using useful variables, as indicated in
the gray region. The samples in the gray region possess tree-based similarity. The
gray region in A corresponds to node 4 in B.

Figure 2. Graphical representation of the proposed tree-based ensemble approaches.
A large number of local rules underlying data structure are explored by tree
ensemble. First, a number of CART models with different structures are grown by
means of the Monte Carlo strategy. As each tree is constructed, CART-based sample
similarity can be computed to generate a sample-proximity matrix, which gives an
intrinsic measure of similarities between samples. At the same time,
variable-importance ranking can be obtained to rank the variables.

Figure 3. Pattern-analysis results of metabolomics data using MCTree. A: metabolite
importance B: sample-proximity plot. The sample-proximity matrix is directly
visualized by the heat map. Hierarchical clustering analysis is simultaneously
performed on the sample proximity. Clearly, four big clusters indicating four patterns
of this metabolomics data can be obtained using MCTree: (a) the control group; (b)
chronic schistosomiasis patients; (c) patients treated with Danggui Buxue decoction
for one month; and, (d) patients treated with Danggui Buxue decoction for two
months.


Figure 4. Cluster results of TCA using two visual ways. A: sample-proximity plot,
which is directly visualized by the heat map. Hierarchical clustering analysis is
simultaneously performed on the sample proximity. Clearly, three big clusters
indicating three species of the iris data can be obtained using TCA. B: the MDS plot,
which is used to map the sample proximity to two-dimensional space. Likewise,
three species are obviously observed.

Figure 5. (A) The simulated data for two class problems that are marked by +1 (red)
and -1 (blue). (B) Test error rate for boosting tree (red curve) and random forest
(RF) (black curve) as a function of the number of iterations. Also shown is the test
error rate for CART with 16 terminal nodes. The test set is generated from the same
distribution as the training set. The test error rates for boosting tree, RF and CART
are 3.7%, 4.3% and 8.2%, respectively. Accuracy of CART could be greatly
improved by ensemble.








Fig. 1.







Fig. 2.






Fig. 3.






Fig. 4.








Fig. 5.








Highlights

Decision tree can do automatic stepwise variable selection and
complexity reduction

Decision trees can cope effectively with complex chemical data

Tree-based ensemble approaches could be applied to solve various
chemometric problems

You might also like