Professional Documents
Culture Documents
C = Q + L (1)
where Q is the misclassification cost (e.g., misclassification rates) of the sub-tree
generated by pruning and regularization parameter determines the trade-off
between the misclassification cost and the model complexity measured by L (i.e. the
number of terminal nodes). The value of can usually be determined by an
independent validation set or cross validation.
As Breiman pointed out, CART carries out a thorough, but not optimal, search in
two spaces (i.e. sample and variable) of variable sub-sets relevant to classification
and sample sub-sets with specific similarity under different variable subspace (see
Fig. 1). That is, on the one hand, it can easily select informative components from
hundreds to thousands of variables. On the other hand, it can also recursively divide
the total sample space into several rectangular areas with specific similarity (e.g., the
samples under the same terminal node; see also the gray region in Fig. 1 A).
CART intrinsically produces homogenous sub-sets of data. However, the CART
program is usually unstable. A slight fluctuation in data usually leads to a completely
different tree structure reflecting different ad hoc rules in data. Hence, collection and
compilation of plenty of trees by the ensemble strategy help look deeply into the
intrinsic structure of the data.
2.2. Variable-importance ranking
In a tree classifier containing L terminal nodes, the importance of a candidate
variable x
(j)
can be evaluated as follows:
L 1
( j) m ( j)
m 1
J( ) I(v(m) )
=
= =
x x (2)
where v(m) denotes the selected split variable at node m. If variable x
(j)
is selected as
splitting variable at node m, I(v(m) = x
(j)
) = 1, otherwise, I(v(m) = x
(j)
) = 0.
Corresponding impurity decrease
m
evaluates the importance of the variable
selected to split the region at node m. If parent node m is split into two child nodes
(e.g., m
l
and m
r
) at node m, impurity decrease
m
at node m is defined as:
l r
m m l m r m
= - p - p (3)
where
m
,
l
m
and
r
m
are, respectively, the impurity at parent node m, left child
node and right child node, and
l
p
and
r
p are the proportions of samples which fall
into the left child node and right child node, respectively. Node impurity is usually
measured by the Gini index:
T
t t
t=1
= p (1 p )
(4)
where T is the class number, p
t
is the sample proportion of class t. Thus, all impurity
decreases regarding variable x
(j)
are added up to obtain the importance measure for
the variable.
2.3. Sample-proximity matrix
We can construct a sample-proximity matrix by using the ensemble strategy to
establish the relationship between samples (see Fig. 2). Given sample matrix X of
size n p, each row denotes a sample and each column a variable. The
corresponding class label is recorded in vector y of size n 1 with element from 1 to
T for the T classification case. To begin with, sample-proximity matrix PROX of
size n n with all elements equal to 0 is generated. Then, with the help of the
Monte-Carlo (MC) procedure, the samples for each class are randomly divided into
training samples and validation samples. The training samples are combined to
obtain the final training set [e.g., (X
train
, y
train
)]. Accordingly, the validation samples
are combined to obtain the final validation set [e.g., (X
val
, y
val
)]. Then, the training
samples obtained are first used to grow a classification tree and the validation
samples are then used to prune the overgrown tree to obtain the optimal pruning
level (e.g., L
best
). Instead of obtaining an optimal tree, a sub-optimal tree, which is
between optimal and overgrown, is constructed using a fuzzy pruning strategy:
randomly generate pruning level L between 1 and L
best
, and then prune the
overgrown tree. The fuzzy pruning strategy helps in effectively exploiting the
information of internal nodes, but does not totally destroy the structure of the tree.
After that, the entire samples (e.g., X) are predicted by the sub-optimal tree, and
thereby each sample falls into one of the terminal nodes.
It is worth noting that the samples under the same terminal node may possess
some specific similarity to some extent, rather than only being limited to class
similarity (i.e. the samples in one terminal node are more similar than those in the
other terminal nodes). If two samples i and j turn up in the same terminal node,
sample-proximity measure PROX(i, j) is increased by 1. Herein, PROX can be
considered a similarity measure based on tree-based similarity and reflects the size
of similarity among training samples.
We can repeat the above process many times (e.g., ntree) to establish a large
number of tree models. Accordingly, the sample-proximity matrix is changed by the
results of the tree. The bigger PROX(i, j) is, the more similar training sample i and
training sample j is. At the end of construction, the sample-proximity matrix is
normalized by dividing by the number of trees. Note that the proximity between a
sample and itself is always set to 1[(i.e. PROX(i, i) = 1]. Likewise, the predictive
proximity matrix PPROX, which reflects the similarity between training samples
and new samples, can be constructed in a similar way. The concept of sample
proximity is of great importance for understanding the applications to tree-based
ensemble methods (e.g., outlier detection, pattern analysis, cluster analysis and
tree-based kernel).
3. Applications to tree-based ensemble methods
Over the past few years, various theoretical studies and applications related to CART
were reported in analytical chemistry, basically in QSAR/QSPR, mass spectrometry
(MS), food analysis, -omics studies, and chromatography. CART was also reviewed
recently [16]. Theoretical studies put more emphasis on improving the prediction
accuracy and the instability of CART [17]. Applying global optimal algorithms to
grow the optimal tree model was reported recently [1820]. Variable selection based
on CART is another hot research topic [2123]. Among all papers regarding CART,
those on ensemble variants make up a substantial part of the CART world.
3.1. Ensemble-variable selection
The development of robust, high-throughput analytical techniques allows the
simultaneous measurement of hundreds to thousands of features on a single chemical
or biological sample. Thus, the data generated in studies are usually described by a
large number of highly correlated variables. In large feature/small sample size
problems, both model performance and robustness of the variable-selection process
are important, so a robust feature-selection technique is extensively needed and used
in analytical chemistry. Several methods for variable selection have been proposed,
including commonly used univariate statistical measures {e.g., F-statistic, Fishers
variance ratio and correlation coefficient), genetic algorithm (GA), particle swarm
optimization algorithm (PSO), stepwise regression regarding different criteria, the
lasso, uninformative variable elimination (UVE) [24]}.
The general procedure of the tree-based ensemble feature selection can be
summarized as follows:
(1) Apply Monte Carlo techniques to the original data Z = {(x
1
, y
1
), (x
2
, y
2
), , (x
n
,
y
n
)} and obtain a training set and a validation set. Here, x
i
, i = 1, 2, , n, n is
the number of samples and y
i
is the corresponding class label or response value.
(2) Construct a decision tree model. The training set is used for growing a tree and
the validation set is used for pruning this tree. Feature importance and prediction
accuracy can be finally gained based on the current split.
(3) Repeat steps (1) and (2) a number of times (e.g., ntree = 1000) to obtain a large
number of tree models. Simultaneously, we can record variable importance
b (j)
J ( ) x and prediction accuracy Acc for each tree.
(4) The final importance can be determined through averaging the importance
weighted by prediction accuracy for each tree.
ntree
(j) b (j)
b=1
1
J( ) = (J ( ) Acc)
ntree
x x (5)
Instability of tree structure will lead to unstable variable-importance ranking. The
analysis of high-throughput data by cross-prediction in the Monte-Carlo schemes can
intensively probe a large model population. The informative features selected by
individual tree models cannot necessarily reflect their true behavior, since they are
seriously affected by random division of training set and validation set, especially
for short fat data (i.e., p >> n).
However, variable importance constructed by ensemble strategy can provide more
objective information. Tree-based ensemble-variable selection can effectively
guarantee that the variables indeed reflect the real contribution to classification, not
deriving from the chanciness. The variable importance is thereby more stable and
reliable [25,26]. Robust variable-selection techniques would allow domain experts to
have more confidence in the selected variables, as, in most cases, these variables are
subsequently analyzed further, requiring much time and effort, especially in
biomedical applications [27].
Ensemble-variable selection, in particular tree-based ensemble-variable selection,
has been developed and widely applied in chemistry and biology [2833]. Using a
similar idea, a Monte Carlo feature-selection method for supervised classification
was proposed to select the most discriminant genes in leukemia and lymphoma
datasets [28]. Comparisons with the commonly-used statistical techniques in
chemometrics (e.g., genetic algorithms on multiple linear regression, UVE-PLS)
were reported on prediction performance and selected variables [32]. The stability of
ensemble-variable-selection techniques was also further studied using several
stability indices on microarray datasets and MS datasets [33].
3.2. Outlier detection
Detection and removal of outliers from measured data is the important step before
modeling [34]. The interpretability and the prediction performance of a model built
using data without outliers could be improved. Algorithms have been developed for
outlier detection and proved effective, including conventional Mahalanobis distance
[35], Cook distance, and others {e.g., minimum covariance determinant (MCD),
robust PCA [36], resampling by half-means (RHM) and smallest half-volume (SHV)
[37] and the MC method [38]}.
We propose an outlier-detection method by sample proximity, PROX, and
predictive proximity, PPROX. Outliers are treated as samples having small
proximities to all other samples. Since the data in some classes is more spread out
than others, outlyingness is defined only with respect to other data in the same class
as the given sample. For molecule i, a measure of outlyingness is computed
according to out(i) =
2
-1
[ (i,j) ]