Professional Documents
Culture Documents
Scholarly Repository
Open Access Dissertations Electronic Theses and Dissertations
2017-05-02
Recommended Citation
Tang, Fei, "Random Forest Missing Data Approaches" (2017). Open Access Dissertations. 1852.
https://scholarlyrepository.miami.edu/oa_dissertations/1852
This Open access is brought to you for free and open access by the Electronic Theses and Dissertations at Scholarly Repository. It has been accepted for
inclusion in Open Access Dissertations by an authorized administrator of Scholarly Repository. For more information, please contact
repository.library@miami.edu.
UNIVERSITY OF MIAMI
By
Fei Tang
A DISSERTATION
May 2017
c 2017
Fei Tang
All Rights Reserved
UNIVERSITY OF MIAMI
Fei Tang
Approved:
Random forest (RF) missing data algorithms are an attractive approach for im-
puting missing data. They have the desirable properties of being able to handle
mixed types of missing data, they are adaptive to interactions and nonlinearity, and
they have the potential to scale to big data settings. Currently there are many dif-
ferent RF imputation algorithms, but relatively little guidance about their efficacy.
Using a large, diverse collection of data sets, imputation performance of various
RF algorithms was assessed under different missing data mechanisms. Algorithms
included proximity imputation, on the fly imputation, and imputation utilizing mul-
tivariate unsupervised and supervised splittingthe latter class representing a gener-
alization of a new promising imputation algorithm called missForest. Our findings
reveal RF imputation to be generally robust with performance improving with in-
creasing correlation. Performance was good under moderate to high missingness,
and even (in certain cases) when data was missing not at random. Real data analysis
using the RF imputation methods was conducted on the MESA data.
TABLE OF CONTENTS
Page
Chapter
1 PREFACE ...................................................................................................... 1
Bibliography .................................................................................................. 96
iii
List of Figures
6.1 Simulation 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.2 Simulation 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.3 Simulation 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.4 Simulation 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.5 Simulation 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.6 Importance measurement for all four variables. . . . . . . . . . . . 79
6.7 Minimal Depth measurement for all four variables. . . . . . . . . . 80
6.8 The correlation change due to missForest imputation. . . . . . . . . 81
6.9 Simulation 5 with nodesize of 200. . . . . . . . . . . . . . . . . . 82
6.10 The effect of different imputations methods with the Diabetes data. 83
iv
6.11 The effect of different imputations methods in GLM. . . . . . . . . 84
7.1 The percent of missingness for all the variables in MESA data. . . . 88
v
List of Tables
vi
Chapter 1
Preface
1
2
researchers are forced to either impute data or to discard missing values when miss-
ing data is encountered. Of course, to simply discard observations with missing
values(complete case analysis) is usually not a reasonable practice, as valuable in-
formation may be lost and inferential power compromised(Enders , 2010).It can
even cause selection bias in some cases. In addition, deleting observations with
missing values may result in very few observations left when there is a high number
of predictive variables that contain missing values, especially for high dimensional
data. As a result, it is advisable to impute the data before any analysis is perform.
Many statistical imputation methods have been developed for missing data.
However, in high dimensional and large scale data settings, such as genomic, pro-
teomic, neuroimaging, and other high-throughput problems, many of these methods
perform poorly. For example, it is recommended that all variables be included in
multiple imputation to make it proper in general and in order to not create bias in
the estimate of the correlations(Rubin, 1996). This can lead to overparameteriza-
tion when there are a large number of variables but the sample size is moderate; a
scenario often seen with modern genomic data. In addition, computational issues
can arise in the implementation of standard methods. An example is the occur-
rence of non-convexity due to missing data when maximizing the log-likelihood,
which is problematic and challenging to optimize using traditional methods such
as the EM algorithm(?). Among the available imputation methods, some are for
continuous data(KNN for instance)(Troyanskaya et al, 2001), some are for categor-
ical data(saturated multinomial model). MICE(Multivariate Imputation by Chained
Equations)(Van Buuren, 2007) is for mixed data types(i.e., data having both nomi-
nal and categorical variables), depending on tuning parameters or specification of a
parametric model. High dimensional data often feature mixed data types with com-
plicated interactions among variables, making it unfeasible to specify any paramet-
3
ric model. In addition, implementation of these methods can often break down in
challenging data settings(Liao et al , 2014). These methods tend to perform poorly
in high dimensional and large scale data settings, as they were never designed to
be regularized or they cannot be applied due to computational issues. An exam-
ple of the latter occurs when maximizing the log-likelihood, which can become
non-convex due to missing data, and therefore will be challenging to optimize us-
ing traditional methods such as the EM algorithm(?). Another serious issue is that
most methods cannot deal with complex interactions and nonlinearity of variables,
which is common with data from medical research. Standard multiple imputation
approaches do not automatically incorporate interaction effects, and not surpris-
ingly, this leads to biased parameter estimates when interactions are present(Doove
et al. , 2014). Although some techniques such as fully conditional specification of
the covariate(Bartlett et al. , 2015) can be used to try to solve this problem, these
techniques may be hard and inefficient to implement in settings where interactions
are expected to be complicated.
For these reasons there has been much interest in using machine learning meth-
ods for missing data imputation. A promising approach can be based on Breimans
random forests (abbreviated hereafter as RF; Breiman (2001)). RF have the de-
sired characteristic of being able to: (1) handle mixed types of missing data; (2)
address interactions and nonlinearity; (3) scale to high-dimensions, while being
free of data assumptions; (4) avoid overfitting; (5) address settings when there
are more variables than observations; and (6) yield measures of variable impor-
tance that potentially can be used for variable selection. Currently there are sev-
eral different RF missing data algorithms. This includes the original RF proximity
algorithm proposed by Breiman (Breiman, 2003) implemented in the randomFor-
est R-package (Liaw and Wiener, 2002). A different class of algorithms are the
4
2. For a given subject with missing Y with predictor values x1 ,... ,xp , take the
observed values of Y in the terminal nodes of all k trees
This process was embedded into MICE, and repeated to create multiple imputa-
tions. The approach is included in van Buurens MICE package in R. Independently
of Doove et al , Shah et al. (2014) also proposed using random forest for imputation
using a somewhat different approach:
3. Missing Y values are imputed by taking a normal draw, with residual variance
equal to the out of bag mean square error.
M AR : P (R|Zcomp ) = P (R|Zobs )
2. Chapter 3 reviews the existing random forest based approaches for handling
incomplete data.
5. Chapter 6 shows how VIMP and minimal depth are affected by different
methods of handling missing data.
Chapter 2
9
10
n
X
Ĉn (x) = argmax 1{Yi =Cj } 1{Xi 2Rn (x)} (2.1)
1jJ
i=1
The general principle of splitting rule is the reduction of tree impurity, because it
encourages the tree to push dissimilar cases apart.
Often used splitting rules include the twoing criterion, the entropy criterion, and the
Gini criterion. The Gini splitting rule is arguably the most popular and is defined
as follows. Let h be a tree node that is being split. Let p̂j (h) denote the proportion
of class j cases in h. Let s be a proposed split for a variable x that splits h into left
and right daughter nodes hL and hR , where
Let N = |h|, NL = |hL |, and NR = |hR | denote the number of cases in h, hL , and
hR (note that N = NL + NR ). The Gini node impurity for h is defined as
J
X
ˆ (h) p̂j (h)(1 p̂j (h)).
j=1
J
X
ˆ (hL ) = p̂j (hL )(1 p̂j (hL )).
j=1
where p̂j (hL ) is the class frequency for class j in hL . ˆ (hR ) is defined in a similar
way. The decrease in the node impurity is
h i
ˆ (h) ˆ ˆ
p̂(hL ) (hL ) + p̂(hR ) (hR ) .
h i
ˆ h) = p̂(hL ) ˆ (hL ) + p̂(hR ) ˆ (hR ) .
✓(s,
is refered to as the Gini index or Gini splitting rule. The best split on x is the
split-point s = ŝ maximizing the decrease in node impurity, which is equivalent to
ˆ h), with respect to s.
minimizing the Gini index, ✓(s,
The twoing criterion (Breiman et al., 1984, pp 104-106) is designed to find the
grouping of all J classes into two superclasses that leads to the greatest decrease in
node impurity when considered as a two-class problem. Under Gini impurity, the
12
" J #2
P̂ (h L ) P̂ (h R ) X
ˆ h) =
✓(s, p̂j (hL ) p̂j (hR ) .
4 j=1
J
X
ˆ (h) = p̂j (h) log p̂j (h).
j=1
J
X J
X
ˆ h) = P̂ (hL )
✓(s, p̂j (hL ) log p̂j (hL ) + P̂ (hR ) p̂j (h) log p̂j (hR ).
j=1 j=1
We shall denote the learning data by (Xi , Yi )1in , where Xi = (Xi,1 , . . . , Xi,p ) is
a p-dimensional feature and Yi = (Yi,1 , . . . , Yi,q ) is a q 1 dimensional response.
We shall denote a generic coordinate of the multivariate feature by X and refer to X
as a variable (i.e. covariate). For example, Xi refers to the value of the covariate for
Xi . Multivariate regression corresponds to the case when Yi,j are continuous. To
define the multivariate regression splitting rule we begin by considering univariate
(q = 1) regression.
Consider splitting a regression tree T at a node t. Let s be a proposed split for
a variable X that splits t into left and right daughter nodes tL := tL (s) and tR :=
tR (s) where tL are the cases {Xi s} and tR are the cases {Xi > s}. Regression
13
X
ˆ (t) = (Yi Y t )2 /N,
Xi 2t
P
where Y t = Xi 2t Yi /N is the sample mean for t and N = |t| is the sample size
of t (note that N = n only when t is the root node). The within sample variance for
a daughter node, say tL , is
X
ˆ (tL ) = 1 (Yi Y tL ) 2 , X(s, t) = {Xi 2 t, Xi s}
NL
i2X(s,t)
where Y tL is the sample mean for tL and NL is the sample size of tL . The decrease
in impurity under the split s for X equals
h i
ˆ (s, t) = ˆ (t) ˆ ˆ
p̂(tL ) (tL ) + p̂(tR ) (tR ) ,
1 X 1 X
D̂W (s, t) = (Yi Y tL ) 2 + (Yi Y tR ) 2 .
N i2t N i2t
L R
In other words, CART seeks the split-point ŝN that minimizes the weighted sample
variance.
14
We extend the weighted sample variance rule to the multivariate case q > 1 by
applying the splitting rule to each coordinate separately. We seek to minimize
q
( )
X X X
⇤
D̂W (s, t) = Wj (Yi,j Y t Lj ) 2 + (Yi,j Y t Rj ) 2
j=1 i2tL i2tR
where 0 < Wj < 1 are prespecified weights for weighting the importance of
coordinate j of the response Y, and Y tLj and Y tRj are the sample means of the
j-th coordinate in the left and right daughter nodes. Notice that such a splitting
rule can only be effective if each of the coordinates of Y are measured on the same
scale, otherwise we could have a coordinate j with say enormous values and its
contribution would dominate D̂W
⇤
(s, t). We can calibrate D̂W
⇤
(s, t) by using Wj but
it will be more convenient to assume that each coordinate has been standardized.
We will asssume
1 X 1 X 2
Yi,j = 0, Y = 1, 1 j q. (2.2)
N i2t N i2t i,j
The standardization is applied prior to splitting the node t. With some elemen-
tary manipulations, it is easily verified that minimizing D̂W
⇤
(s, t) is equivalent to
maximizing
8 !2 !2 9
X q < 1 X 1 X =
?
D̂W (s, t) = Wj Yi,j + Yi,j . (2.3)
: NL NR ;
j=1 i2tL i2tR
Rule (4.1) is the multivariate splitting rule used for multivariate continuous out-
comes.
15
Now we consider the effect of Gini splitting when Yi,j is categorical. First consider
the univariate case (i.e., the multiclass problem) where the outcome Y is a class
label Y 2 {1, . . . , K} where K 2. Consider growing a classification tree using
Gini splitting. Let ˆk (t) denote the class frequency for class k in a node t. The Gini
node impurity for t is defined as
K
X
ˆ (t) = ˆk (t)(1 ˆk (t)).
k=1
h i
ˆ (s, t) = ˆ (t) ˆ ˆ
p̂(tL ) (tL ) + p̂(tR ) (tR ) .
The quantity
Ĝ(s, t) = p̂(tL ) ˆ (tL ) + p̂(tR ) ˆ (tR )
K
X 2 K
X 2
Nk,L Nk,R
+
k=1
NL NR
2k=1 !2 !2 3
XK X X
= 4 1 Zi(k) +
1
Zi(k) 5, (2.4)
k=1
NL i2tL
NR i2tR
where Zi(k) = 1{Yi =k} . Notice the similarity to (4.1). When q = 1, the regression
16
where Zi(k),j = 1{Yi,j =k} . Notice that (4.2) is equivalent to (4.1), but with an addi-
tional summation over the class labels k = 1, . . . , Kj for each j. The normalization
1/Kj employed for a coordinate j is required to standardize the contribution of the
Gini split from that coordinate.
The equivalence between Gini splitting for categorical responses and weighted vari-
ance splitting for continuous responses now points to a means to split multivariate
mixed outcomes. Let QR ✓ {1, . . . , q} denote the coordinates of the continuous
outcomes in Y and let QC denote the coordinates having categorical outcomes.
17
S
Notice that QR QC = {1, . . . , q}. The mixed outcome splitting rule is
2 8 !2 !2 93
Kj < =
X 1 X 1 X 1 X
⇥(s, t) = Wj 4 Zi(k),j + Zi(k),j 5
K j : NL N R ;
j2QC k=1 i2tL i2tR
8 !2 !2 9
X < 1 X 1 X =
+ Wj Yi,j + Yi,j .
: NL N R ;
j2Q R i2t L i2t R
Segal Intrator and LeBlanc and Crowley use as the prediction rule the Kaplan-Meier
estimate of the survival distribution, and as the splitting rule a test for measuring
differences between distributions adapted to censored data such as the log-rank test
or Wilcoxon test. These statistics are weighted versions of the log-rank statistic
where the weights allow exibility in emphasizing differences between two survival
curves for early times (the left tail of the distribution), middle times or late times(the
right tail of the distribution). In particular an observation at time t is weighted by
where S is the Kaplan Meier estimate of the survival curve for both samples com-
bined. Thus we can obtain sensitivity to early occurring differences by taking ⇢ > 0
and ⇡ 0 while we emphasize differences in the middle by taking ⇢ ⇡ 0 and ⇡1
and we emphasize late differences if ⇢ ⇡ 0and > 0.
18
Three different survival splitting rules can be used: (i) a log-rank splitting rule,
the default splitting rule(ii) a conservation of events splitting rule (iii) a logrank
score rule.
Notation
Assume we are at node h of a tree during its growth and that we seek to split h
into two daughter nodes.We introduce some notation to help discuss how the the
various splitting rules work to determine the best split. Assume that within h are n
individuals. Denote their survival times and 0-1 censoring information by (T 1, 1 ),
. . . ,(T n, n ). An individual l will be said to be right censored at time Tl if l = 0,
otherwise the individual is said to have died at Tl if l = 1. In the case of death, Tl
will be refered to as an event time,and the death as an event. An individual l who
is right censored at Tl simply means the individual is known to have been alive at
Tl , but the exact time of death is unknown. A proposed split at node h on a given
predictor x is always of the form x c and x > c. Such a split forms two daughter
nodes (a left and right daughter) and two new sets of survival data. A good split
maximizes survival differences across the two sets of data. Let t1 < t2 < < tN be
the distinct death times in the parent node h, and let di,j and Yi,j equal the number
of deaths and individuals at risk at time ti in the daughter nodes j = 1, 2. Note that
Yi,j is the number of individuals in daughter j who are alive at time ti , or who have
an event (death) at time ti . More precisely,
where xl is the value of x for individual l = 1, ..., n. Finally, define Yi = Yi,1 + Yi,2
19
and di = di,1 + di,2 . Let nj be the total number of observations in daughter j. Thus,
n = n1 + n2 . Note that n1 = #{l : xl c} and n2 = #{l : xl > c}.
Log-rank splitting
N
X di
(di,1 Yi,1 )
i=1
Yi
L(x, c) = v
u N
uX Yi,1 Yi,1 Yi di
t (1 )( )di
i=1
Y i Y i Yi 1
The value |L(x, c)| is the measure of node separation. The larger the value for
|L(x, c)|, the greater the difference between the two groups, and the better the split
is. In particular, the best split at node h is determined by finding the predictor x?
and split value c? such that |L(x? , c? )| |L(x, c)| for all x and c.
Another useful splitting rule available is the log-rank score test introduced by Hothorn
and Lausen(Hothorn et al., 2003). To describe this rule, assume the predictor x has
been ordered so that x1 x2 ... xn . Now, compute the ranks for each survival
time Tl ,
Xl
k
al = l
k=1
n k +1
P
x c al n1 ā
S(x, c) = p l n1 2
n1 (1 n
)sa
20
where ā and s2a are the sample mean and sample variance of {al : l = 1, ..., n}.
Log-rank score splitting defines the measure of node separation by |S(x, c)|. Max-
imizing this value over x and c yields the best split.
2.4 VIMP
Random forest is appreciated not only for its superiority in prediction accuracy, but
also for its intrinsic ability to deal with complicated interactions. Multiple methods
have been developed to perform variable selection using Random Forest. These
methods are based on two measures: variable importance measure(VIMP), or min-
imal depth measure.
Variable importance, or VIMP, equals the amount of prediction error change,which
can be either increase or decrease, when a particular variable is noised up by permu-
tation, or randomly assignment of observations to the child nodes when the parent
nodes is split on this given variable.
The permutation importance is assessed by a comparing the prediction accuracy,
in terms of correct classification rate or mean squared error(MSE), of a tree before
and after random permutation of a predictor variable. By permutation the original
association with the response is destroyed and the accuracy is supposed to drop
for a relevant predictor. Therefore, when the accuracies between before and after
permutation are small, it indicates that the importance of Xj is small in predicting
of the response variable. In contrast, if the prediction accuracy drops substantially
after the permutation, it indicates a strong association between Xj and the response
variable. One algorithm of computing the importance score by permutation is as
follows.
ables using VIMP and a stepwise ascending variable introduction strategy. The test
based variable selection methods apply a permutation test framework to estimate
the significance of a variable’s importance. For instance, in a method suggested by
Hapfelmeier and Ulm, VIMP for a variable was recomputed after it was permuted.
This procedure was repeated many times to assess the empirical distribution of im-
portances under the null hypothesis of a certain variable being not predictive of the
outcome variable. A p-values, reflecting the likelihood of the original VIMP within
these empirical distribution, then can be calculated. Variables with p-values less
than a predefined threshold were selected.
Minimal depth, which is defined as the distance from the root node to the root of
the closest maximal v-subtree for variable v, is another measure that can be used for
variable selection(Ishwaran et al., 2010). It measures how far a case travels down
a tree to encounter the first split on variable v. Small minimal depth for a variable
implies high predictive ability of the variable. The smallest possible minimal depth
is 0, which means this variable is split at the root node.
Chapter 3
One commonly used algorithms for treating missing data in CART was based on
the idea of a surrogate split [Chapter 5.3, Breiman et al. (1984)]. If s is the best
split for a variable x, the surrogate split for s is the split s? using some other variable
x? such that s? and s are closest to one another in terms of predictive association
[Breiman et al. (1984)]. The CART algorithm uses the best surrogate split among
those variables not missing for the case to assign one that have a missing value for
the variable used to split a node.
24
25
Surrogate splitting method is not suited for random forests, since RF randomly
selects variables when splitting a node. A reasonable surrogate split may not exit
within a node, as the randomly selected variables to be split on may be uncorrelated.
Speed is one issue. A second concern is that to find a surrogate split is computation-
ally intensive and may become infeasible when growing a large number of trees for
random forests. A third concern is that surrogate splitting alters the interpretation
of a variable, which affects measures such as VIMP. For these reasons, a different
strategy is required for RF.
RF
Three strategies can and have been used in the random forest missing value impu-
tation.
1. Preimpute the data, then grow the forest, and update the original missing
values using certain criteria (such as proximity), based on the grown forest;
iterate for improved results.
2. Impute as the trees are grown; update by summary imputation; iterate for
improved results.
3. Preimpute, grow forest for each variable that has missing values, predict the
missing values using the grown forest, update the missing values with the
predicted values; iterate for improved results.
26
The proximity approach (Breiman, 2003; Liaw and Wiener, 2002) uses the strategy
one, while the adaptive tree method (Ishwaran et al., 2008) uses the strategy two. A
new imputation method named MissForest (Stekhoven and Buhlmann, 2012), fea-
turing predicting the missing values using forest grown using variable with missing
values as response variable, was proposed in 2011, and it falls into the third strategy.
The proximity approach works as follows. First, the data are roughly imputed by
replacing missing values for continuous variables with the median of non-missing
values, and by replacing missing values for categorical variable with the most fre-
quent occurring non-missing value. A RF is then fit with the roughly imputed data,
and a ’proximity matrix’ is calculated from the fitted RF. The proximity matrix, an
n ⇥ n symmetric matrix whose (i, j) entry records the frequency that case i and j
occur within the same terminal node, is then used for imputing the data. For contin-
uous variables, the missing values are imputed with the proximity weighted average
of the non-missing data; for integer variables, the missing values are imputed with
the integer value having the largest average proximity over non-missing data. The
updated data are then used as an input in RF, and the procedure is iterated. The
iterations end when a stable solution is reached.
The disadvantage of proximity approach is that OOB estimates for prediction
error are biased, generally on the order of 10-20% (Breiman, 2003). Further, be-
cause prediction error is biased, so are other measures based on it, such as VIMP. In
addition, the proximity approach does not work on predicting test data with missing
values. The adaptive tree imputation method addressed these issues by adaptively
impute missing data as a tree is grown, drawing randomly from the set of
27
non-missing in-bag data within the working node. The imputing procedure is sum-
marized as follows:
1. For each node h, impute missing data by drawing a random value from the
in-bag non-missing value prior to splitting.
2. After the splitting, reset the imputed data in the daughter nodes to missing.
Proceed as in Step 1 until the tree can no longer be split.
3. The final summary imputed value is the average of the case’s imputed in-bag
values for a continuous variable, or the most frequent in-bag imputed value
for a categorical variable.
3.4 RF imputation
Three general strategies have been used for RF missing data imputation:
(A) Preimpute the data; grow the forest; update the original missing values using
proximity of the data. Iterate for improved results.
(B) Simultaneously impute data while growing the forest; iterate for improved
results.
(C) Preimpute the data; grow a forest using in turn each variable that has missing
values; predict the missing values using the grown forest. Iterate for improved
results.
30
31
Here we describe proximity imputation (strategy A). In this procedure the data is
first roughly imputed using strawman imputation. A RF is fit using this imputed
data. Using the resulting forest, the n ⇥ n symmetric proximity matrix (n equals
the sample size) is determined where the (i, j) entry records the inbag frequency
that case i and j share the same terminal node. The proximity matrix is used to im-
pute the original missing values. For continuous variables, the proximity weighted
average of non-missing data is used; for categorical variables, the largest average
proximity over non-missing data is used. The updated data are used to grow a new
RF, and the procedure is iterated.
We use RFprx to refer to proximity imputation as described above. However,
when implementing RFprx we use a slightly modified version that makes use of
random splitting in order to increase computational speed. In random splitting,
nsplit, a non-zero positive integer, is specified by the user. A maximum of
nspit-split points are chosen randomly for each of the randomly selected mtry
splitting variables. This is in contrast to non-random (deterministic) splitting typ-
32
ically used by RF, where all possible split points for each of the potential mtry
splitting variables are considered. The splitting rule is applied to the nsplit ran-
domly selected split points and the tree node is split on the variable with random
split point yielding the best value, as measured by the splitting criterion. Random
splitting evaluates the splitting rule over a much smaller number of split points and
is therefore considerably faster than deterministic splitting.
The limiting case of random splitting is pure random splitting. The tree node is
split by selecting a variable and the split-point completely at random—no splitting
rule is applied; i.e. splitting is completely non-adaptive to the data. Pure random
splitting is generally the fastest type of random splitting. We also apply RFprx using
pure random splitting; this algorithm is denoted by RFprxR .
As an extension to the above methods, we implement iterated versions of RFprx
and RFprxR . To distinguish between the different algorithms, we write RFprx.k and
RFprxR.k when they are iterated k 1 times. Thus, RFprx.5 and RFprxR.5 indicates
that the algorithms were iterated 5 times, while RFprx.1 and RFprxR.1 indicates that
the algorithms were not iterated. However, as this latter notation is somewhat cum-
bersome, for notational simplicity we will simply use RFprx to denote RFprx.1 and
RFprxR for RFprxR.1 .
Specific details of OTFI can be found in (Ishwaran et al., 2008, 2016), but for
convenience we summarize the key aspects of OTFI below:
1. Only non-missing data is used to calculate the split-statistic for splitting a tree
node.
2. When assigning left and right daughter node membership if the variable used
to split the node has missing data, missing data for that variable is “imputed”
by drawing a random value from the inbag non-missing data.
3. Following a node split, imputed data are reset to missing and the process
is repeated until terminal nodes are reached. Note that after terminal node
assignment, imputed data are reset back to missing, just as was done for all
nodes.
4. Missing data in terminal nodes are then imputed using OOB non-missing
terminal node data from all the trees. For integer valued variables, a maximal
class rule is used; a mean rule is used for continuous variables.
It should be emphasized that the purpose of the “imputed data” in Step 2 is only
to make it possible to assign cases to daughter nodes—imputed data is not used to
calculate the split-statistic, and imputed data is only temporary and reset to missing
after node assignment. Thus, at the completion of growing the forest, the resulting
forest contains missing values in its terminal nodes and no imputation has been
done up to this point. Step 4 is added as a means for imputing the data, but this step
could be skipped if the goal is to use the forest in analysis situations. In particular,
step 4 is not required if the goal is to use the forest for prediction. This applies
even when test data used for prediction has missing values. In such a scenario, test
data assignment works in the same way as in step 2. That is, for missing test values,
values are imputed as in step 2 using the original grow distribution from the training
34
forest, and the test case assigned to its daughter node. Following this, the missing
test data is reset back to missing as in step 3, and the process repeated.
This method of assigning cases with missing data, which is well suited for
forests, is in contrast to surrogate splitting utilized by CART (Breiman et al., 1984).
To assign a case having a missing value for the variable used to split a node, CART
uses the best surrogate split among those variables not missing for the case. This
ensures every case can be classified optimally, whether the case has missing values
or not. However, while surrogate splitting works well for CART, the method is not
well suited for forests. Computational burden is one issue. Finding a surrogate
split is computationally expensive even for one tree, let alone for a large number
of trees. Another concern is that surrogate splitting works tangentially to random
feature selection used by forests. In RF, variables used to split a node are selected
randomly, and as such they may be uncorrelated, and a reasonable surrogate split
may not exist. Another concern is that surrogate splitting alters the interpretation of
a variable, which affects measures such as variable importance measures (Ishwaran
et al., 2008).
To denote the OTFI missing data algorithm, we will use the abbreviation RFotf .
As in proximity imputation, to increase computational speed, RFotf is implemented
using nsplit random splitting. We also consider OTFI under pure random split-
ting and denote this algorithm by RFotfR . Both algorithms are iterated in our studies.
RFotf , RFotfR will be used to denote a single iteration, while RFotf.5 , RFotfR.5 denotes
five iterations. Note that when OTFI algorithms are iterated, the terminal node im-
putation executed in step 4 uses inbag data and not OOB data after the first cycle.
This is because after the first cycle of the algorithm, no coherent OOB sample ex-
ists.
35
Thus, like OTF splitting, one can see that MIA results in a forest ensemble con-
structed without having to impute data.
1X 1X
D(s, t) = (Yi Y tL ) 2 + (Yi Y tR ) 2
n i2t n i2t
L R
where Y tL and Y tR are the sample means for tL and tR respectively. The best split
for X is the split-point s minimizing D(s, t). To extend the squared-error splitting
rule to the multivariate case q > 1, we apply univariate splitting to each response
coordinate separately. Let Yi = (Yi,1 , . . . , Yi,q )T denote the q 1 dimensional
outcome. For multivariate regression analysis, an averaged standardized variance
splitting rule is used. The goal is to minimize
q
( )
X X X
Dq (s, t) = (Yi,j Y t Lj ) 2 + (Yi,j Y t Rj ) 2
j=1 i2tL i2tR
where Y tLj and Y tRj are the sample means of the j-th response coordinate in the
left and right daughter nodes. Notice that such a splitting rule can only be effective
if each of the coordinates of the outcome are measured on the same scale, otherwise
we could have a coordinate j, with say very large values, and its contribution would
37
dominate Dq (s, t). We therefore calibrate Dq (s, t) by assuming that each coordinate
has been standardized according to
1X 1X 2
Yi,j = 0, Y = 1, 1 j q.
n i2t n i2t i,j
where Zi(k) = 1{Yi =k} . Now consider the multivariate classification scenario r > 1,
where each outcome coordinate Yi,j for 1 j r is a class label from {1, . . . , Kj }
for Kj 2. We apply Gini splitting to each coordinate yielding the extended Gini
splitting rule
2 8 !2 !2 9 3
r Kj < =
X X X X
G⇤r (s, t) = 4 1 1
Zi(k),j +
1
Zi(k),j 5 (4.2)
j=1
Kj k=1 : nL i2tL
nR i2tR
;
where Zi(k),j = 1{Yi,j =k} . Note that the normalization 1/Kj employed for a co-
38
ordinate j is required to standardize the contribution of the Gini split from that
coordinate.
Observe that (4.1) and (4.2) are equivalent optimization problems, with opti-
mization over Yi,j
⇤
for regression and Zi(k),j for classification. As shown in (Ish-
waran, 2015) this leads to similar theoretical splitting properties in regression and
classification settings. Given this equivalence, we can combine the two splitting
rules to form a composite splitting rule. The mixed outcome splitting rule ⇥(s, t)
is a composite standardized split rule of mean-squared error (4.1) and Gini index
splitting (4.2); i.e.,
⇥(s, t) = Dq⇤ (s, t) + G⇤r (s, t),
where p = q + r. The best split for X is the value of s maximizing ⇥(s, t).
X T X T
Mq (s, t) = Yi YL VL 1 Yi YL + Yi YR VR 1 Yi YR
i2tL i2tR
where VL and VR are the estimated covariance matrices for the left and right
daughter nodes. While this is a reasonable approach in low dimensional problems,
recall that we are applying Dq (s, t) to ytry of the feature variables which could be
large if the feature space dimension p is large. Also, because of missing data in the
features, it may be difficult to derive estimators for VL and VR , which is further
complicated if their dimensions are high. This problem becomes worse as the tree
39
The missForest algorithm recasts the missing data problem as a prediction problem.
Data is imputed by regressing each variable in turn against all other variables and
then predicting missing data for the dependent variable using the fitted forest. With
p variables, this means that p forests must be fit for each iteration, which can be
slow in certain problems. Therefore, we introduce a computationally faster version
of missForest, which we call mForest. The new algorithm is described as follows.
Do a quick strawman imputation of the data. The p variables in the data set are
randomly assigned into mutually exclusive groups of approximate size ap where
0 < a < 1. Each group in turn acts as the multivariate response to be regressed
on the remaining variables (of approximate size (1 a)p). Over the multivariate
responses, set imputed values back to missing. Grow a forest using composite mul-
tivariate splitting. As in RFunsv , missing values in the response are excluded when
using multivariate splitting: the split-rule is averaged over non-missing responses
only. Upon completion of the forest, the missing response values are imputed using
prediction. Cycle over all of the ⇡ 1/a multivariate regressions in turn; thus com-
pleting one iteration. Check if the imputation accuracy of the current imputed data
relative to the previously imputed data has increased beyond an ✏-tolerance value
(see Section 3.3 for measuring imputation accuracy). Stop if it has, otherwise repeat
until convergence.
40
v
uX
u n 2
u ⇤⇤
Ii,j Xi,j ⇤
Xi,j /nj
X u
1 u i=1
E (X ? , X ?? ) = u n ⇣ ⌘
#N j2N u
t
X
⇤ ⇤ 2
Ii,j Xi,j Xj /nj
i=1
Pn
1 X i=1 Ii,j
⇤⇤
1{Xi,j ⇤
6= Xi,j }
+ ,
#C j2C nj
⇤ Pn
where Xj = i=1
⇤
Ii,j Xi,j /nj . Note that standardized root-mean-squared er-
ror (RMSE) was used to assess imputation difference for nominal variables, and
misclassification error for factors.
Imputation Performance
5.1 Methods
Table 5.1 lists the nine experiments that were carried out. In each experiment, a
pre-specified target percentage of missing values was induced using one of three
different missing mechanisms (Rubin, 1976):
42
43
Table 5.1: Experimental design used for large scale study of RF missing data algo-
rithms.
Missing Percent
Mechanism Missing
EXPT-A MCAR 25
EXPT-B MCAR 50
EXPT-C MCAR 75
EXPT-D MAR 25
EXPT-E MAR 50
EXPT-F MAR 75
EXPT-G NMAR 25
EXPT-H NMAR 50
EXPT-I NMAR 75
Sixty data sets were used, including both real and synthetic data. Figure 5.1 il-
lustrates the diversity of the data. Displayed are data sets in terms of correlation (⇢),
sample size (n), number of variables (p), and the amount of information contained
in the data (I = log10 (n/p)). The correlation, ⇢, was defined as the L2 -norm of the
correlation matrix. If R = (⇢i,j ) denotes the p ⇥ p correlation matrix for a data set,
⇢ was defined to equal
✓ ◆ p
!1/2
p
1 X X
⇢= |⇢i,j |2 . (5.1)
2 j=1 k<j
44
This is similar to the usual definition of the L2 -norm for a matrix, but where we
have modifed the definition to remove the diagonal elements of R which equal 1,
as well as the contribution from the symmetric lower diagonal values.
Note that in the plot for p there are 10 data sets with p in the thousands—
these are a collection of well known gene expression data sets. The right-most plot
displays the log-information of a data set, I = log10 (n/p). The range of values on
the log-scale vary from 2 to 2; thus the information contained in a data set can
differ by as much as ⇡ 104 .
● ● ● ●
●●
●
● ●●●●
8000
● ● ●
2
●
4000
●
0.8
●
● ● ●●
● ●●
●●● ●●
● ●
●
●●
1
● ●●
6000
●●
●● ● ●
● ●
3000
● ●
0.6
● ●● ● ●
I = log10(n / p)
correlation
● ●
● ●●
●
●● ● ●
0
n
p
4000
2000
● ● ●●
0.4
● ●
● ● ●
●
● ●
●
−1
● ●●
● ● ●●● ●
2000
1000
● ● ● ● ●
● ●
0.2
● ● ●● ● ● ● ●
● ●
●● ● ●● ●
● ●●
●● ●
● ●● ● ● ●●
●● ● ●●
● ●
● ●● ● ●
−2
●
● ● ● ● ● ●
● ●●
● ●
●●
● ●●●● ●●●●●● ● ●●
● ●
● ●●●
●●
●●●● ●●● ●●
● ●
●
●● ●● ●
● ●●●
●
● ●●
●● ●
●
●●
●●
●
●●
●●
●
●●
●●
●
●●
●●
●
●●
●●
●
●●
●●
●
●●
●●
●
●●
●●
●
● ●
0
0.0
0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60
Data Sets Data Sets Data Sets Data Sets
Figure 5.1: Summary values for the 60 data sets used in the large scale RF missing
data experiment. The last panel displays the log-information, I = log10 (n/p), for
each data set.
The following procedures were used to induce missigness in the data. Let the target
missigness fraction be 0 < g < 1. For MCAR, data was set to missing randomly
without imposing column or row constraints to the data matrix. Specifically, the
45
data matrix was made into a long vector and ng of the entries selected at random
and set to missing.
For MAR, missing values were assigned by column. Let Xj = (X1,j , . . . , Xn,j )
be the n-dimensional vector containing the original values of the jth variable, 1
j p. Each coordinate of Xj was made missing according to the tail behavior
of a randomly selected covariate Xk , where k 6= j. The probability of selecting
coordinate Xi,j was
8
>
< F (Xi,k ) if Bj = 1
P {selecting Xi,j |Bj } /
>
: 1 F (Xi,k ) if Bj = 0
Notice that missingness in Xi,j depends on both observed and missing values. In
particular, missing values occur with higher probability in the right and left tails of
the empirical distribution. Therefore, this induces NMAR.
46
v
uX
u n 2
u ⇤
1i,j Xi,j Xi,j /nj
X u
1 u i=1
E (I ) = u n
#N j2N u
t
X 2
1i,j Xi,j Xj /nj
i=1
Pn
1 X i=1
⇤
1i,j 1{Xi,j 6= Xi,j }
+ ,
#C j2C nj
Pn
where Xj = i=1 (1i,j Xi,j ) /nj . To be clear regarding the standardized RMSE,
observe that the denominator in the first term is the variance of Xj over the arti-
47
ficially induced missing values, while the numerator is the MSE difference of Xj
and X⇤j over the induced missing values.
As a benchmark for assessing imputation accuracy we used strawman imputa-
tion described earlier, which we denote by s. Imputation error for a procedure I
was compared to s using relative imputation error defined as
E (I )
ER (I ) = 100 ⇥ .
E (s)
A value of less than 100 indicates a procedure I performing better than the straw-
man.
Randomized splitting was invoked with an nsplit value of 10. For random fea-
p
ture selection, mtry was set to p. For random outcome selection for RFunsv , we
p
set ytry to equal p. Algorithms RFotf , RFunsv and RFprx were iterated 5 times in
addition to be run for a single iteration. For mForest, the percentage of variables
used as responses was a = .05, .25. This implies that mRF0.05 used up to 20 regres-
sions per cycle, while mRF0.25 used 4. Forests for all procedures were grown using
a nodesize value of 1. Number of trees was set at ntree = 500. Each experi-
mental setting (Table 5.1) was run 100 times independently and results averaged.
For comparison, k-nearest neighbors imputation (hereafter denoted as KNN)
was applied using the impute.knn function from the R-package impute (Hastie
et al., 2015). For each data point with missing values, the algorithm determines the
k-nearest neighbors using a Euclidean metric, confined to the columns for which
that data point is not missing. The missing elements for the data point are then
imputed by averaging the non-missing elements of its neighbors. The number of
48
5.2 Results
In reporting the values for imputation accuracy, we have stratified data sets into low,
medium and high-correlation groups, where correlation, ⇢, was defined as in (5.1).
Low, medium and high-correlation groups were defined as groups whose ⇢ value
fell into the [0, 50], [50, 75] and [75, 100] percentile for correlations. Results were
stratified by ⇢ because we found it played a very heavy role in imputation per-
49
formance and was much more informative than other quantities measuring infor-
mation about a data set. Consider for example the log-information for a data set,
I = log10 (n/p), which reports the information of a data set by adjusting its sample
size by the number of features. While this is a reasonable measure, Figure 5.2 shows
that I is not nearly as effective as ⇢ in predicting imputation accuracy. The figure
displays the ANOVA effect sizes for ⇢ and I from a linear regression in which log
relative imputation error was used as the response. In addition to ⇢ and I, depen-
dent variables in the regression also included the type of RF procedure. The effect
size was defined as the estimated coefficients for the standardized values of ⇢ and
I. The two dependent variables ⇢ and I were standardized to have a mean of zero
and variance of one which makes it possible to directly compare their estimated
coefficients. The figure shows that both values are important for understanding im-
putation accuracy and that both exhibit the same pattern. Within a specific type of
missing data mechanism, say MCAR, importance of each variable decreases with
missingness of data (MCAR 0.25, MCAR 0.5, MCAR 0.75). However, while the
pattern of the two measures is similar, the effect size of ⇢ is generally much larger
than I. The only exceptions being the MAR 0.75 and NMAR 0.75 experiments, but
these two experiments are the least interesting. As will be discussed below, nearly
all methods performed poorly here.
Correlation
The imputation accuracy was summarized in Figure 5.3 and Table 5.2. Figure 5.4
presents the same results as Figure 5.3, with more compacted way for displaying.
Figure 5.3 and Table 5.2, which have been stratified by correlation group, show
the importance of correlation for RF imputation procedures. In general, imputa-
tion accuracy generally improves with correlation. Over the high correlation data,
50
0.3
ANOVA Effect Size
0.2
0.1
0.0
MCAR .25
MCAR .50
MCAR .75
NMAR .25
NMAR .50
NMAR .75
MCAR .25
MCAR .50
MCAR .75
NMAR .25
NMAR .50
NMAR .75
MAR .25
MAR .50
MAR .75
MAR .25
MAR .50
MAR .75
log−information correlation
Figure 5.2: ANOVA effect size for the log-information, I = log10 (n/p), and corre-
lation, ⇢ (defined as in (5.1)), from a linear regression using log relative imputation
error, log10 (ER (I )), as the response. In addition to I and ⇢, dependent variables
in the regression included type of RF procedure used. ANOVA effect sizes are the
estimated coefficients of the standardized variable (standardized to have mean zero
and variance 1).
Relative Imputation Error Relative Imputation Error Relative Imputation Error
50
60
70
80
90
100
110
50
60
70
80
90
100
110
50
60
70
80
90
100
110
RFotf RFotf RFotf
RFotf.5 RFotf.5 RFotf.5
RFotfR RFotfR RFotfR
RFotfR.5 RFotfR.5 RFotfR.5
RFunsv RFunsv RFunsv
RFunsv.5 RFunsv.5 RFunsv.5
RFprx RFprx RFprx
RFprx.5 RFprx.5 RFprx.5
RFprxR RFprxR RFprxR
MAR .25
NMAR .25
MCAR .25
50
60
70
80
90
100
110
50
60
70
80
90
100
110
50
60
70
80
90
100
110
NMAR .50
MCAR .50
50
60
70
80
90
100
110
50
60
70
80
90
100
110
50
60
70
80
90
100
110
low correlation
low correlation
low correlation
high correlation
high correlation
high correlation
NMAR .75
MCAR .75
medium correlation
medium correlation
medium correlation
mRF0.05 , mRF (mForest imputation, with 25%, 5% and 1 variable(s) used as the re-
RFprxR , RFprxR.5 (same as RFprx and RFprx.5 but using pure random splitting); mRF0.25 ,
1 and 5 iterations); RFprx , RFprx.5 (proximity imputation with 1 and 5 iterations);
pure random splitting); RFunsv , RFunsv.5 (multivariate unsupervised splitting with
tion with 1 and 5 iterations); RFotfR , RFotfR.5 (similar to RFotf and RFotf.5 but using
of correlation of a data set. Procedures are: RFotf , RFotf.5 (on the fly imputa-
Figure 5.3: Relative imputation error, ER (I ), stratified and averaged by level
51
52
● ●
100 ● 100 100
● ●
●●●●● ●●● ●● ●
Relative Imputation Error
RFo.1
RFo.5
RFor.1
RFor.5
RFu.1
RFu.5
RFp.1
RFp.5
RFpr.1
RFpr.5
mRF0.25
mRF0.05
mRF
KNN
RFo.1
RFo.5
RFor.1
RFor.5
RFu.1
RFu.5
RFp.1
RFp.5
RFpr.1
RFpr.5
mRF0.25
mRF0.05
mRF
KNN
low correlation medium correlation high correlation
.25
.50
.75
.25
.50
.75
.25
.50
.75
.25
.50
.75
.25
.50
.75
.25
.50
.75
Figure 5.4: Relative imputation error for a procedure over 60 data sets averaged
over 100 runs, stratified and averaged by level of correlation of a data set. Top row
displays values by procedure; bottom row displays values in terms of the 9 different
experiments (Table 1). Procedures were: RFo.1, RFo.5 (on the fly imputation with
1 and 5 iterations); RFor.1, RFor.5 (similar to RFo.1 and RFo.5, but using pure
random splitting); RFu.1, RFu.5 (multivariate unsupervised splitting with 1 and 5
iterations); RFp.1, RFp.5 (proximity imputation with 1 and 5 iterations); RFpr.1,
RFpr.5 (same as RFp.1 and RFp.5, but using pure random splitting); mRF0.25,
mRF0.05, mRF (mForest imputation, with 25%, 5% and 1 variable(s) used as the
response); KNN (k-nearest neighbor imputation). See Section 3.4 for more details
regarding parameter settings of procedures.
53
Medium Correlation
MCAR MAR NMAR
.25 .50 .75 .25 .50 .75 .25 .50 .75
RFotf 82.3 89.9 95.6 78.8 88.6 97.0 92.7 92.6 102.2
RFotf.5 76.2 82.1 90.0 83.4 79.1 93.4 99.6 89.1 100.8
RFotfR 83.1 91.4 96.0 80.3 90.3 97.4 92.2 96.1 105.3
RFotfR.5 82.4 84.1 93.1 88.2 84.2 95.1 112.0 97.1 104.5
RFunsv 80.4 88.4 95.9 76.1 87.7 97.5 87.3 92.7 104.7
RFunsv.5 73.2 78.9 89.3 78.8 79.0 92.4 98.8 92.8 104.2
RFprx 82.6 86.3 93.1 80.7 80.5 97.7 88.6 93.8 99.5
RFprx.5 77.1 84.1 93.3 86.5 77.0 92.1 98.1 93.7 101.0
RFprxR 81.2 85.4 93.1 80.4 82.4 96.3 89.2 97.2 101.3
RFprxR.5 76.1 80.8 92.0 82.1 77.7 95.1 102.1 96.6 105.1
mRF0.25 73.8 80.2 91.6 75.3 75.6 90.2 97.6 87.5 102.1
mRF0.05 70.9 80.1 95.2 70.1 76.6 93.0 87.4 87.9 103.4
mRF 69.6 80.1 95.0 71.3 74.6 92.4 86.9 87.8 103.1
KNN 79.8 93.5 105.3 80.2 96.0 98.7 93.9 98.3 102.1
High Correlation
MCAR MAR NMAR
.25 .50 .75 .25 .50 .75 .25 .50 .75
RFotf 72.3 83.7 94.6 65.5 83.3 98.4 66.5 84.8 100.4
RFotf.5 70.9 72.1 80.9 69.5 70.9 91.0 70.1 70.8 97.3
RFotfR 68.6 81.0 93.6 59.5 87.1 98.9 61.2 88.2 100.3
RFotfR.5 58.4 58.9 64.6 56.7 55.1 88.4 58.4 60.9 97.3
RFunsv 62.1 75.1 91.3 56.8 70.8 97.8 58.1 73.3 100.6
RFunsv.5 54.2 57.5 65.4 54.0 49.4 80.0 55.4 51.7 90.7
RFprx 75.5 82.0 88.5 70.7 72.8 94.3 70.9 74.3 102.0
RFprx.5 70.4 72.0 78.6 69.7 71.2 90.3 70.0 72.2 98.2
RFprxR 61.9 68.1 76.6 58.7 64.1 79.5 60.4 74.6 97.5
RFprxR.5 57.3 58.1 61.9 55.9 54.1 71.9 57.8 60.2 93.7
mRF0.25 57.0 57.9 63.3 55.5 50.4 70.5 56.7 50.7 87.3
mRF0.05 50.7 54.0 61.7 48.3 48.4 74.9 49.9 48.6 85.9
mRF 48.2 49.8 61.3 47.0 47.5 70.2 46.6 47.6 82.9
KNN 52.7 63.2 83.2 52.0 71.1 96.4 53.2 74.9 99.2
54
mForest algorithms were by far the best. In some cases, they achieved a relative im-
putation error of 50, which means their imputation error was half of the strawman’s
value. Generally there are no noticeable differences between mRF (missForest) and
mRF0.05 . Performance of mRF0.25 , which uses only 4 regressions per cycle (as op-
posed to p for mRF), is also very good. Other algorithms that performed well in
high correlation settings were RFprxR.5 (proximity imputation with random splitting,
iterated 5 times) and RFunsv.5 (unsupervised multivariate splitting, iterated 5 times).
Of these, RFunsv.5 tended to perform slightly better in the medium and low correla-
tion settings. We also note that while mForest also performed well over medium
correlation settings, performance was not superior to other RF procedures in low
correlation settings, and sometimes was worse than procedures like RFunsv.5 . Re-
garding the comparison procedure KNN, while its performance also improved with
increasing correlation, performance in the medium and low correlation settings was
generally much worse than RF methods.
The missing data mechanism also plays an important role in accuracy of RF pro-
cedures. Accuracy decreased systematically when going from MCAR to MAR and
NMAR. Except for heavy missingness (75%), all RF procedures under MCAR and
MAR were more accurate than strawman imputation. Performance in NMAR was
generally poor unless correlation was high.
Heavy missingness
Accuracy degraded with increasing missingness. This was especially true when
missingness was high (75%). For NMAR data with heavy missingness, procedures
were not much better than strawman (and sometimes worse), regardless of corre-
55
Iterating RF algorithms
Figure 5.5 displays the log of total elapsed time of a procedure averaged over all
experimental conditions and runs, with results ordered by the log-computational
complexity of a data set, c = log10 (np). The same results are also displayed in
Figure 5.7 in a compacted manner. The fastest algorithm is KNN which is generally
3 times faster on the log-scale, or 1000 times faster, than the slowest algorithm,
mRF (missForest). To improve clarity of these differences, Figure 5.6 displays the
relative computational time of procedure relative to KNN (obtained by subtracting
the KNN log-time from each procedures log-time). This new figure shows that
while mRF is 1000 times slower than KNN, the multivariate mForest algorithms,
mRF0.05 and mRF0.25 , improve speeds by about a factor of 10. After this, the next
slowest procedures are the iterated algorithms. Following this are the non-iterated
algorithms. Some of these latter algorithms, such as RFotf , are 100 times faster than
missForest; or only 10 times slower than KNN. These kinds of differences can have
a real effect when dealing with big data. We have experienced settings where OTF
algorithms can take hours to run. This means that the same data would take
56
5.3 Simulations
where " was simulated independently from a N(0, 0.5) distribution. Variables X1
and X2 were correlated with a correlation coefficient of 0.96, and X5 and X6 were
correlated with value 0.96. The remaining variables were not correlated. Vari-
ables X1 , X2 , X5 , X6 were N(3, 3), variables X3 , X10 were N(1, 1), variable X8
was N(3, 4), and variables X4 , X7 , X9 were exponentially distributed with mean
0.5.
The sample size (n) was chosen to be 100, 200, 500, 1000, and 2000. Data was
made missing using the MCAR, MAR, and NMAR missing data procedures de-
scribed earlier. Percentage of missing data was set at 25%. All imputation param-
eters were set to the same values as used in our previous experiments as described
in Section 3.4. Each experiment was repeated 500 times and the relative imputa-
tion error, ER (I ), recorded in each instance. Figure 5.8 displays the mean relative
imputation error for a RF procedure and its standard deviation for each sample size
57
log10(Computing Time)
log10(Computing Time)
4
4
o
o o
2
2
o
o o o ooo
o o oo ooo o
o o oo oo
oo oooo o
oo oo o o o oo
o oo o oo ooo ooo o
oo oooo
oo o o oo o ooo o o
oooooo o o oo
o oo oo
o ooo oo
ooo
ooo o o
0
0
o oooo oo ooo o o ooo o
o oo
ooo o ooo o
−2
−2
−2
3 4 5 6 3 4 5 6 3 4 5 6
c = log10(n p) c = log10(n p) c = log10(n p)
log10(Computing Time)
log10(Computing Time)
4
4
o
oo
oo o o
o oo o o
oo
o ooo o o
oo o
ooo o oo
2
2
oooo o o ooo oo
oo o oo o o o oo
oo oooo oo oo oooo oooo oooo o o
oooooo o o
o oo o o oo o oo
o oo ooo o o ooooooooo o o o ooo o
o oo
ooooooo o o
oo o
oo
0
0
ooo o oo oo ooo
o oo
o
ooo o
−2
−2
−2
3 4 5 6 3 4 5 6 3 4 5 6
c = log10(n p) c = log10(n p) c = log10(n p)
log10(Computing Time)
log10(Computing Time)
4
4
o o o o
oo o o o
o o o o o o
2
2
o o oo
o o o oo oo o o o
oooo oo
o o o ooo oo oo o o
o oo oo o o oo ooo
oooo o o oo ooo o oo
o ooo o
oo o ooo o o o ooo oooo o o
o oo o o
o ooo o o oo
0
0
o oo o oo o ooo o
o oo
ooo o ooo o
−2
−2
−2
3 4 5 6 3 4 5 6 3 4 5 6
c = log10(n p) c = log10(n p) c = log10(n p)
log10(Computing Time)
log10(Computing Time)
4
o o o
oo o o o o o
ooo o
ooo
o o o
o
oo oo oo
oo
oo o ooo o oo oo o o
o o o o o
o oo o o ooo ooo
2
oo o o o ooo o oo ooo
o oo oooo o o ooooooo o oo o oo
o oo
o
o oo oooooooo o ooo oo o o oo o oo
o o oo oo o
o o oo o oo
o
o oo o ooo o oo
ooo o
0
0
−2
−2
−2
3 4 5 6 3 4 5 6 3 4 5 6
c = log10(n p) c = log10(n p) c = log10(n p)
mRF KNN
log10(Computing Time)
log10(Computing Time)
4
o
4
o o oo
o o o
o ooo o ooo
oo o
o o o
oo o
2
oo o o
o oo ooo
oo
o o oo
o oo oo o oo o o
oo o ooo o o o
ooo o
0
o
o oo
o oo ooo
ooo o o
o oo oo ooo
oo
o oo o o
−2
−2
o
o o
3 4 5 6 3 4 5 6
c = log10(n p) c = log10(n p)
Figure 5.5: Log of computing time for a procedure versus log-computational com-
plexity of a data set, c = log10 (np).
58
4.0
4.0
3.0
3.0
3.0
o oo o
o o o
o o oo
2.0
2.0
2.0
o o o o
o o o oo o
o o o
o o o o o oo o o oo o o
o
o o o o oo o o o o
o o o
o o o o ooo oo o o o o
oo
o o o o o
o o
oo o ooo o
o
o oo
o ooo oo
o o oo o o o o o oo
o o o o o o o o
1.0
1.0
1.0
o o o o oo o o
o o o ooo
oo
oo
o
3 4 5 6 3 4 5 6 3 4 5 6
c = log10(n p) c = log10(n p) c = log10(n p)
4.0
4.0
3.0
3.0
3.0
o oo
o o oo
o
o o o oo o oo oo o o o
o o o o o
o o o o o oo o o o oo o ooo o oo o ooo o
o o o oo o o oo
2.0
o
2.0
2.0
oo
o oooo oo o oo o o o o o o
o o o oo
o o oo o
o o
o o o o o oo o o
o o ooo ooo o o o o o
o o o o oo o
o o o o o oo o o o ooo o oo o ooo o
o o o o o oo
o
1.0
1.0
1.0
3 4 5 6 3 4 5 6 3 4 5 6
c = log10(n p) c = log10(n p) c = log10(n p)
4.0
4.0
3.0
3.0
3.0
o oo
o o ooooo o o
o o oo oo o oo oooo o o
2.0
2.0
o o o o oo o o o
oo o
2.0 o
o o
o o o o o oo o
o
o o o ooooo o
o oo o o o
o oo o
o ooooo o ooo oooo
oo
oo
o o
oo o o
o
ooo ooo oo
o o o oo o o
o o o o oo o o o o oo o
1.0
1.0
1.0
3 4 5 6 3 4 5 6 3 4 5 6
c = log10(n p) c = log10(n p) c = log10(n p)
4.0
4.0
oo o o o oo
oo o o o
3.0
3.0
3.0
o oo o o o ooo
o o
o oo oo o ooo o o oo o
o o
o o oo o oooo o o
o o oo
o
o o o o
o o o o o o
o o o o o oo o
o o
o o o o
o o ooo o o oo o
o o o ooo o o o o o oo o
o ooo ooooo o ooo oooo
o o oo o o ooo
o o
o oo o
2.0
2.0
2.0
o o o o o o oo o
oo
1.0
1.0
1.0
3 4 5 6 3 4 5 6 3 4 5 6
c = log10(n p) c = log10(n p) c = log10(n p)
mRF
Relative Computing Time
4.0
oo
o
o oo o
o
oo o
oo
3.0
o
o o oo o o o o
ooo o o o o o
ooo o o o
o
o o oo
o
o
o o o o
o o oo
o
2.0
1.0
3 4 5 6
c = log10(n p)
4
RFo.1 ● RFp.5
● RFo.5 RFpr.1
● ●
RFor.1 ● RFpr.5 ● ●●
● ● ●
●●
● ● ● ●●
3 ● RFor.5 mRF0.25 ●
●
●
●●
●
●
●
● ● ●
log10(Computing Time)
●
●
●● ● ●
● ●
●
●
RFu.1 ● mRF0.05 ● ● ●
●●
●
●
● ● ●●
●
● ● ●
●
● RFu.5 mRF ●● ● ● ● ● ●●
●
● ● ● ●
2
●●● ● ●● ●● ●
RFp.1 KNN ● ●● ● ● ●● ●
●
●● ●●● ●● ● ● ● ●
● ●● ● ● ●
● ● ●● ●●● ●
● ● ●●● ●
●● ● ●●
● ●
●
●
●● ●●●
●
● ● ●●● ●●
●● ●
1
●● ● ●●
● ●●
● ●
● ● ●
●
●
● ●●
●●
● ●●● ●
●
● ● ●
● ●
●●
● ●●
●● ●●●
●● ●
● ● ●
● ●● ●● ●● ● ●
●
● ●
●
●●● ●● ●
● ● ● ●●
● ●●●
● ●●
● ● ● ●● ● ●
● ●
0
●
● ● ●
● ● ●
−1
−2
3 4 5 6
log10(Dimension)
Figure 5.7: Log of computing time for a procedure versus log of dimension of a
data set, with compute times averaged over runs and experimental conditions.
setting. As can be seen, values improve with increasing n. It is also noticeable that
performance depends upon the RF imputation method. In these simulations, the
missForest algorithm mRF0.1 appears to be best (note that p = 10 so mRF0.1 corre-
sponds to the limiting case missForest). Also, it should be noted that performance
of RF procedures decrease systematically as the missing data mechanism becomes
more complex. This mirrors our previous findings.
5.4 Conclusions
Being able to deal with missing data effectively is of great importance to scien-
tists working with real world data today. A machine learning method such as RF
known for its excellent prediction performance and ability to handle all forms of
data, represents a poentially attractive solution to this challenging problem. How-
ever, because no systematic study of RF procedures had been attempted in missing
60
105
MCAR
Imputation Error +/− SD
100
● ●
● ●
● ●
95
● ●
● ● ● ●
● ● ●
● ● ● ●
● ● ●
● ● ● ● ●
● ● ●
●● ● ● ●● ●
●●
n= 100 ●
90
● ● ●
n= 200 ●
n= 500 ● ●
●
n= 1000 ●
●
n= 2000
85
RFotf
RFotf.5
RFotfR
RFotfR.5
RFunsv
RFunsv.5
RFprx
RFprx.5
mRF0.5
mRF0.1
MAR
105
Imputation Error +/− SD
100
●
● ●
● ●
●
● ● ● ●
● ●
● ● ●
95
● ● ●
● ● ● ●
● ● ●
● ● ●
● ●
● ●
● ●
● ● ● ●
● ● ●●
●
90
n= 100 ● ●
● ●
n= 200 ●
n= 500 ●
●
n= 1000
85
n= 2000
RFotf
RFotf.5
RFotfR
RFotfR.5
RFunsv
RFunsv.5
RFprx
RFprx.5
mRF0.5
mRF0.1
NMAR
170
160
Imputation Error +/− SD
150
140
n= 100
130
n= 200
n= 500
n= 1000
120
n= 2000
RFotf
RFotf.5
RFotfR
RFotfR.5
RFunsv
RFunsv.5
RFprx
RFprx.5
mRF0.5
mRF0.1
Figure 5.8: Mean relative imputation error ± standard deviation from simulations
under different sample size values n = 100, 200, 500, 1000, 2000.
61
RF-SRC. Incorporating mForest into the native library, combined with the openMP
parallel processing of RF-SRC, could make it much more attractive. However, even
with all of this, we still recommend some of the more basic OTFI algorithms like
unsupervised RF imputation procedures for big data. These algorithms perform
solidly in terms of imputation and are 100’s of times faster than missForest.
5.5 Discussion
MissForest outperforms OTF imputation for the following reason. When there are
two variables(x1, x2, lets say x1 has missing values) that are highly correlated by
not important in prediction Y, x2 won’t get splitted on early. OTF will not be
63
accurate in imputing x1. missForest will be very accurate in imputing x1, unsuper-
viased will be in between. This is where missForest is gaining advantage.
Although the three random forest imputation algorithms(OTF, unsupervised and
missForest ) represents the three different imputation strategy mentioned above,
they essentially all partition the cases in a way that the correlation between two
variables can be taken advantage of. For a variable with missing values, the ear-
lier the correlated covariate gets splitted, the better the imputation accuracy will be.
Since missForest always split its most correlated covariate first, the imputation ac-
curacy is alway as good or superior to OTF imputation. With OTF imputation, the
response variable determines when its most correlated covariate gets split on. Un-
supervised algorithm is in between, as the response variable is selected at random.
Although it was shown that correlation can be a general guide as to the impu-
tation accuracy, the real determine factor for the imputation accuracy on a specific
variable is how well the rest of the covariates can predict its missing values collec-
tively. For instance, if a continuous variable needs to be imputed, the ’error rate’, or
’percent of variation explained’ from the regression tree with this variable being the
outcome variable is the true indicator of the imputation accuracy. In some cases, the
imputation accuracy may be high for a particular variable even when its correlation
with the rest of the covariate is low.
A user always needs to look at the pattern of the missing values before any
imputation algorithm is applied, as some pitfalls may exit. For example, if a variable
that is highly predictive of the response variable has high missingness, and it is
independent of the covariates-meaning it can not be predicted from the covariates,
its imputed values will far from being accurate. As a result, inference made on this
variable with the imputed data set will be biased.
64
In the next chapter, we will look at some of the pitfalls in details and offer some
possible solutions.
Chapter 6
6.1 Introduction
Random Forest, as a machine learning method, has gained popularity in many re-
search fields. It is appreciated for its improvement of prediction accuracy over a
single CART tree. In addition, it is a good method for high dimensional data, espe-
cially when complex relations exit between the predictive variables. Because of its
intrinsic ability to deal with complicated interactions, multiple methods have been
developed to perform variable selection using Random Forest. These methods are
based on two measures: variable importance measure(VIMP), or minimal depth
measure.
Missing values in predictive variables are often encountered in data analysis.
Although some empirical studies have been carried out to compare different impu-
tation methods in terms of imputation accuracy(Stekhoven and Buhlmann, 2012),
or variable selection(Genuer et al , 2010), a direct investigation on how the variable
importance measures and minimal depth measure are affected by different ways of
handling missing values may provide guidance on which method of handling miss-
65
66
ing data problem should be chosen, and shed light on how these different methods
affect the result of variable selection. In this study, we introduced missing values in
a particular variable of the simulated or real-life datasets and looked at how VIMP
and minimal depth were affected by different RF imputation methods.
6.2 Approach
When an investigator intends to analyze a data set containing missing values for
the purpose of variable selection using random forest, one of the three approaches
of handling the missing values can be chosen: 1. to use complete case method,
which means that any observations that contain missing value should be deleted; 2.
to use the built in missing data options in the software, if available; 3. to impute
the data ahead of the analysis. Complete case method may not be a good choice,
especially when the number of predictor that contain missing values is high-a high
percentage of observations would be deleted as a result. The built in methods for
handling missing values are OTF imputation and proximity imputation. We com-
pared the performance of these two methods and some other imputation methods,
namely, complete case method, imputation by proximity, OTF imputation, mul-
tivariate missForest imputation, missForest imputation, and KNN imputation for
correctly estimating of the variable importance and minimal depth. The ideal im-
putation method is the one that the importance scores calculated from the imputed
data are identical to the importance scores or minimal depth that could have been
calculated from the data without missing values. Therefore, we defined a measure
called relative importance score(RelImp) and relative minimal depth score(RelMdp)
in this study, which is calculated as
67
Five simulation with synthesized data were carried out to study the effect of differ-
ent imputation methods on the importance measurement and minimal depth. We
started with a simplest scenario.
Where x1, x2, x3, e are normal distributions with mean of 2,2,1,0, standard devia-
tion of 2,2,1,0.5, respectively. x4 follows exponential distribution with rate of 0.5.
There are two important variables in the second simulation, and these two im-
portant variables are not correlated and are independent of each other.
In the 3rd, 4th,and 5th simulation, x1 and x2 have the same variance and mean
as in simulation 1 and 2, but instead of being independent of each other, they are
correlated with the correlation coefficient being 0.75.
Simulation 5 : Y = x1 + x2 + x3 + 0.5 ⇥ x4
Five thousand observations were created for each of the simulated dataset. We
first created q(q=15,20,25,30,35,40,45,50,55,60,65,70,75,80) percent of missing val-
ues in x1 by MCAR missing data mechanism, followed by using either complete
case deletion, or one of the six RF imputation methods to impute the dataset.
Then the dataset was analyzed using regular RF. VIMP and minimal depth for
x1 were recorded; relative importance score(RelImp) and relative minimal depth
score(RelMdp) were calculated and plotted.
For OTF imputation, one iteration was used. For missForest or mForest imputa-
tion, a maximum of five iteration was used if the algorithm did not converge before
the fifth iteration. For all the imputation methods, node size, nsplit, ntree (the num-
69
ber of trees grown for each RF) were chosen to be 5, 20, 250, respectively. The ↵
value was chosen to be 0.1 in the mRF↵ imputation method to be an equivalent of
missForest imputation. As there are only four predictor in the simulation, ↵ was set
to 0.5 in mRF↵ for mForest imputation.
As shown in figure 6.1 to 6.5, the complete case method resulted in the same VIMP
and minimal depth as the original data with not missingness. Although some ef-
ficiency can be lost, complete case method gives the same random forest when
the missing mechanism is MCAR. As the percentage of missingness increases, the
standard deviation of the calculated VIMP increases.
In the case of simulation 1 and 2, When the predictive variables are not correlated,
the VIMP for x1 decreases regardless of the imputation method that was used. As
shown in figure 6.1 and 6.2, the relative importance score(RelImp) for x1 decreases
as the percentage of missingness in x1 increases. The magnitude of the decrease
were identical for all random forest imputation methods.
6.3, 6.4 and 6.5, the magnitude of the decrease was less in the OTF and mFor-
est imputation methods, followed by unsupervised and mForest imputation. KNN
imputation had the worse performance in terms of preservation of VIMP and min-
imal depth. Generally speaking, better preservations of VIMP and minimal depth
correspond to better imputation accuracy, as shown in figure 6.3, 6.4 and 6.5.
We then looked at the VIMP and minimal depth change in all four variable in model
5, when the missingness was in x1 . That is, x1 and x2 are correlated with r of 0.75,
Y = x1 + x2 + x3 + 0.5 ⇤ x4 .
As shown in figure 6.6, when x1 is missing, analysis after imputation will result
in VIMP of x1 to decrease, VIMP of x2 to increase, and the VIMP of x3 and x4
unchanged.
Adaptive method does not behave like this-the only loss is the sample size for
x1 .
Noticing the change of correlations between the variables, we suspected that choos-
ing a larger nodesize may be beneficial. We repeated simulation 5 with nodesize set
to be 200, instead of 5. As shown in figure 6.9, the bias for estimation VIMP was
much reduced with adaptive and mForest algorithm. The OTF algorithm performs
especially well, resulting very little bias, even when the percent of missingness was
high.
We further studied how the coefficient estimation in regression is affected when the
missing values are imputed with RF or KNN imputation. We used the simulation
5 described above, that is, Y = x1 + x2 + x3 + 0.5 ⇥ x4 . Missing values of
q (q = 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80) percent were created by
MCAR mechanism. The data missing values were then imputed with RF or KNN
imputation methods and regular GLM were then performed on the imputed dataset
for estimation of 1, 2, 3 and 4. The ratio of the estimated and the true was
recorded and plotted in figure 6.10. As shown in figure 6.10,
6.5. Conclusion
2. In the simulation in this study, OTF imputation was as good or even slightly
better at preserve the original VIMP or minimal depth compared to the more
computationally expensive imputation methods such as missForest or mFor-
est.
6.6 Discussion
Although RF imputation algorithms are able to impute missing values with Superior
accuracy, cautious needs to be taken if the purpose of the imputation is for further
inference study. In two scenarios we can foresee understandable pitfalls. scenario
1: the variable containing missing values is not correlated with any other predictors.
Then the imputation is equivalent to random draw from the observed values. There
will be bias for the inference, and the magnitude of the bias depends on the percent
of missingness. scenario 2: the variable containing missing values is only correlated
with one another predictor. Then the imputation will result in correlation change
(and the covariate matrix changes) between variables. Not surprisingly, therefore
there will be bias for inference. This bias can be reduced with larger node size.
When the variable with missing values correlates with more than one predictor,
we did not observed much correlation change between the variables in our simula-
tions. This is because the imputed values do not lie one any particular regression
line. In this case, imputation is beneficial to the analysis.
Since the inference of the analysis can be affected differently depending on the
data structure and the percent of missingness, we suggest the user of the random
forest imputation algorithms to study the structure of the data before the imputation.
We also suggest a relatively large node size to be used in the imputation. Since
large node size will result in computational speed improvement, this is a desirable
characteristic of the random forest imputation algorithms.
74
simulation 1−Importance
1.2
1.0
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
Standardized Importance
●
0.8
●
●
●
●
0.6
●
●
●
●
0.4
●
● Complete Case ●
●
● Adaptive
●
0.2 Unsupervised
● ●
Proximity
mf
missForest
0.0
knn
20 30 40 50 60 70 80
Percent Missing
● Complete Case
● Adaptive
Unsupervised
Proximity
1.3
mf
missForest
knn
Standardized Minimal Depth
1.2
1.1
●
● ● ● ● ●
● ● ●
● ●
● ●
● ●
1.0
● ● ● ● ● ●
● ● ● ● ●
● ● ●
●
0.9
0.8
20 30 40 50 60 70 80
Percent Missing
Simulation1−Imputation acuracy
● Adp
Unsupervised
2.2
Proximity
mf
missForest
2.0
knn
Standardized Imputation Error
1.8
1.6
1.4
1.2
● ● ● ● ● ● ● ●
● ● ●
●
1.0
● ● ●
20 30 40 50 60 70 80
Percent Missing
Figure 6.1: Simulation 1: The effect of different imputations methods on the im-
portance measurement and the minimal depth.
75
simulation 2−Importance
1.2
1.0
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
Standardized Importance
●
0.8
●
●
●
●
0.6
●
●
●
●
0.4
●
● Complete Case ●
● Adaptive ●
Unsupervised ●
0.2
● ●
Proximity
mf
missForest
0.0
knn
20 30 40 50 60 70 80
Percent Missing
● Complete Case
● Adaptive
Unsupervised
Proximity
1.8
mf
missForest
knn
Standardized Minimal Depth
1.6
● ●
●
1.4
●
● ● ● ● ● ● ●
● ● ●
●
1.2
●
●
●
1.0
● ● ●
● ● ● ● ● ●
● ● ●
0.8
20 30 40 50 60 70 80
Percent Missing
● Adp
Unsupervised
2.2
Proximity
mf
missForest
2.0
knn
Standardized Imputation Error
1.8
1.6
1.4
1.2
● ● ● ● ● ● ● ● ● ●
● ● ●
● ●
1.0
20 30 40 50 60 70 80
Percent Missing
Figure 6.2: Simulation 2: The effect of different imputations methods on the im-
portance measurement and the minimal depth.
76
simulation 3−Importance
1.2
1.0
● ● ● ● ● ● ●
● ● ● ● ● ●
● ●
●
Standardized Importance
●
●
●
0.8
●
●
●
●
●
●
●
0.6
●
●
0.4
● Complete Case
● Adaptive
0.2 Unsupervised
Proximity
mf
missForest
0.0
knn
20 30 40 50 60 70 80
Percent Missing
● Complete Case
● Adaptive
Unsupervised
Proximity
1.8
mf
missForest
knn
Standardized Minimal Depth
1.6
1.4
1.2
● ● ● ●
● ● ● ● ●
1.0
● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ●
0.8
20 30 40 50 60 70 80
Percent Missing
● Adp
Unsupervised
Proximity
mf
2.0
missForest
knn
Standardized Imputation Error
1.5
1.0
0.5
● ● ● ● ● ● ● ● ● ● ● ● ● ●
20 30 40 50 60 70 80
Percent Missing
Figure 6.3: Simulation 3: The effect of different imputations methods on the im-
portance measurement and the minimal depth.
77
simulation 4−Importance
1.2
● ● ● ●
1.0
● ● ● ● ● ● ● ● ●
●
● ●
Standardized Importance
●
●
●
0.8
●
●
●
●
●
●
●
●
0.6
●
0.4
● Complete Case
● Adaptive
0.2 Unsupervised
Proximity
mf
missForest
0.0
knn
20 30 40 50 60 70 80
Percent Missing
● Complete Case
● Adaptive
Unsupervised
Proximity
1.8
mf
missForest
knn
Standardized Minimal Depth
1.6
● ●
●
●
1.4
●
●
1.2
●
●
●
● ● ●
● ● ● ● ● ●
1.0
● ● ●
● ● ● ●
●
0.8
20 30 40 50 60 70 80
Percent Missing
● Adp
Unsupervised
Proximity
mf
2.0
missForest
knn
Standardized Imputation Error
1.5
1.0
0.5
●
● ● ● ● ● ● ● ● ● ● ● ● ●
20 30 40 50 60 70 80
Percent Missing
Figure 6.4: Simulation 4: The effect of different imputations methods on the im-
portance measurement and the minimal depth.
78
simulation 5−Importance
1.2
● ● ● ●
●
1.0
● ● ● ● ● ●
● ● ●
●
●
●
Standardized Importance
●
●
●
● ●
0.8
●
●
● ● ●
●
0.6
0.4
● Complete Case
● Adaptive
0.2 Unsupervised
Proximity
mf
missForest
0.0
knn
20 30 40 50 60 70 80
Percent Missing
● Complete Case
● Adaptive
Unsupervised
Proximity
1.8
mf
missForest
knn
Standardized Minimal Depth
1.6
1.4
●
1.2
●
●
●
●
●
●
●
● ● ●
● ●
● ● ● ●
1.0
● ●
● ● ● ● ●
● ● ●
●
0.8
20 30 40 50 60 70 80
Percent Missing
● Adp
Unsupervised
Proximity
mf
2.0
missForest
knn
Standardized Imputation Error
1.5
1.0
0.5
● ● ● ● ● ● ● ● ● ● ● ● ● ●
20 30 40 50 60 70 80
Percent Missing
Figure 6.5: Simulation 5: The effect of different imputations methods on the im-
portance measurement and the minimal depth.
79
x1 Importance−Missingness in x1 x2 Importance−missingness in x1
2.0
2.0
● Complete Case
● Adaptive
Unsupervised
Proximity
mf
missForest
knn
1.5
1.5
Standardized Importance
Standardized Importance
● ●
● ●
1.0
1.0
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ●
● ● ● ●
●
● ● ● ● ● ● ● ●
● ●
● Complete Case
● Adaptive
0.5
0.5
Unsupervised
Proximity
mf
missForest
knn
20 30 40 50 60 20 30 40 50 60
x3 Importance−Missingness in x1 x4 Importance−Missingness in x1
2.0
2.0
● Adp ● Adp
Unsupervised Unsupervised
Proximity Proximity
mf mf
missForest missForest
knn knn
1.5
1.5
Standardized Importance
Standardized Importance
● ● ●
1.0
1.0
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
0.5
0.5
20 30 40 50 60 20 30 40 50 60
Figure 6.6: The effect of different imputation methods on the importance measure-
ment for all four variables.
80
2.0
● Complete Case
● Adaptive
Unsupervised
Proximity
mf
missForest
knn
1.5
1.5
Standardized Minimal Depth
1.0
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
●
● ● ● ● ● ● ● ●
● Complete Case
● Adaptive
0.5
0.5
Unsupervised
Proximity
mf
missForest
knn
20 30 40 50 60 20 30 40 50 60
2.0
● Adp ● Adp
Unsupervised Unsupervised
Proximity Proximity
mf mf
missForest missForest
knn knn
1.5
1.5
Standardized Minimal Depth
1.0
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
● ●
0.5
0.5
20 30 40 50 60 20 30 40 50 60
Figure 6.7: The effect of different imputation methods on the minimal Depth mea-
surement for all four variables.
81
Correlation of x1 and x2 change after imputation using missForest Correlation with Y change after imputation using missForest
1.00
0.90
● x1 with Y
x2 with Y
0.95
0.90
0.85
0.85
●
●
●
●
r
r
●
0.80
●
●
● ●
●
0.80
● ●
●
● ●
0.75
●
●
●
● ●
●
●
● ●
0.70
0.65
0.75
0 10 20 30 40 50 60 0 10 20 30 40 50 60
x3 with Y
x4 with Y
0.6
0.5
0.4
r
0.3
0.2
0.1
0.0
0 10 20 30 40 50 60
Percent Missing
Importance
1.2
●
1.0
● ● ● ● ● ● ● ● ● ●
● ●
● ● ● ●
● ● ● ● ● ● ● ●
● ●
Standardized Importance
0.8
0.6
0.4
● Complete Case
● Adaptive
Unsupervised
0.2
Proximity
mf
missForest
0.0
knn
20 30 40 50 60 70 80
Percent Missing
Complete Case
Adaptive
Unsupervised
0.4
Proximity
mf
missForest
0.2
knn
20 30 40 50 60 70 80
Percent Missing
Complete Case
Adaptive
1.6
Unsupervised
Proximity
mf
missForest
knn
1.4
1.2
1.0
20 30 40 50 60 70 80
Percent Missing
Figure 6.10: The effect of different imputations methods on the importance mea-
surement and the minimal depth with the Diabetes data.
84
1.5
● Adp
Unsupervised
1.0
● ● ● ● ● Proximity
●
● ●
●
mf
missForest
1.4
●
●
knn
0.9
1.3
0.8
beta1
beta2
0.7
1.2
0.6
1.1
●
● Adaptive ●
●
Unsupervised
0.5
●
●
Proximity ● ● ●
1.0
● ● ●
mf
missForest
0.4
knn
20 30 40 50 60 20 30 40 50 60
0.6
● Adp
Unsupervised
Proximity
mf
missForest
knn
1.1
0.5
● ● ● ● ● ● ● ● ● ● ●
beta3
beta4
1.0
0.4
● ● ● ● ● ● ● ● ● ● ●
0.9
0.3
● Complete Case
● Adaptive
Unsupervised
Proximity
mf
missForest
0.8
0.2
knn
20 30 40 50 60 20 30 40 50 60
Figure 6.11: The effect of different imputations methods on the parameter estima-
tion in GLM.
Chapter 7
In the MESA data, there are 711 total variables with at least one observation(8 vari-
ables have 0 observations, therefore were deleted.), including variables from clinic
procedures, questionnaires, and created analytic variables (e.g. body mass index,
hypertension stage, ankle-brachial index), and key Reading Center variables (e.g.
total calcium score, aortic distensibility by MRI, average carotid wall thickness by
85
86
ultrasound). These variables involve multiple aspects of the patients, such as med-
ical history, anthropometry, health and life,medication information,neighborhood
information,personal history, CT, ECG, lipids( Blood group or NMR lipoprofile-II
spectral analysis), MRI, pulse wave, US: DISTENSIBILITY, US - IMT, Physical
Activity.
Out of the 711 variables, 134 variables have more than 50 percent missing val-
ues, while 114 variables have more than 75 percent missing values, 58 variables
have more than 90 percent missing values. The percent of missingness for each
variable were plotted in figure 7.1.
To select variables that are important in predicting 10-year CHD, we first used
unsupervised imputation, mForest imputation, and OTF imputation to impute the
dataset. Random Survival Forest analysis was then conducted on the imputed
datasets. The top important variables in predicting 10-year CHD and their rela-
tive VIMPs were reported in table 7.1. We also compared these results with the
OTF no imputation method.
As shown in table 7.1, Total calcium volume score and different versions of
Framingham risk score of CHD were top selected predictors in all the imputed
datasets. Interestingly, with OTF imputation, three variables ranked ahead of the
Framingham scores in terms of its importance in predicting the outcome; and with
OTF no imputation, six variables ranked before the Framingham scores.
87
100
80
Percent missing
60
40
20
0
Variables
Figure 7.1: The percent of missingness for all the variables in MESA data.
89
Table 7.1: Top importance variables and the rank of importance of tnfri1 with dif-
ferent imputation methods
Use Observed MIA for tnfri1 mForest Imputation
Variable Importance Variable Importance Variable Importance
lncac 1.00 lncac 1.00 lncac 1.00
Fram. Scores 0.20-0.11 Fram. Scores 0.17-0.07 Fram. Scores 0.16-0.08
maxstn1c 0.089 zmximt1c 0.051 maxint1c 0.057
youngm1 0.072 stm1 0.002 zmximt1c 0.051
totvip1 0.070 age1c 0.027 rdpedis1 0.039
pkyrs1c 0.051 pkyrs1c 0.026 lptib1 0.038
chwyrs1c 0.035 maxint1c 0.021 pkyrs1c 0.029
spp1c 0.034 lptib1 0.020 pprescp1 0.025
...(27 variables) ... abi1c 0.019 age1c 0.023
tnfri1 0.01 (10 variables) ... ...(6 variables) ...
... ... tnfri1 0.010 tnfri1 0.015
Unsupervised New OTF with Imputation OTF No Imputation
Variable Importance Variable Importance Variable Importance
lncac 1.00 lncac 1.00 lncac 1.00
Fram. Scores 0.16-0.09 Chwday1 0.225 Chwday1 0.197
lolucdp1 0.087 tfpi1 0.165 vwf1 0.180
zmximt1c 0.063 vwf1 0.156 aspeage1 0.131
rdpedis1 0.043 Fram. Scores 0.110-0.094 hrmage1c 0.103
maxint1c 0.038 pkyrs1c 0.086 tfpi1 0.094
lptib1 0.035 aspeage 0.078 pkyrs1c 0.079
pkyrx1c 0.028 stm1 0.064 Fram. Scores 0.078-0.074
...(2 variables) ... ... ... aspsage1 0.062
tnfri1 0.023 ... ... lptib1 0.062
... ... ... ... mmp31 0.057
score
Personalized care based on the risk of a patient for developing Atherosclerotic coro-
nary heart disease ASCVD is the safest and most effective way of treating a patient.
ASCVD risk engine provide a promising way for the clinicians to tailor treatment
to risk by identifying patients that are likely to benefit from or be harmed by a
particular treatment. Current guidelines on ASCVD prevention matches the inten-
sity of risk-reducing therapy to the patients absolute risk for new or recurrent AS-
CVD events using ASCVD risk engines(Stone et al., 2014). For instance, the 2013
American College of Cardiology/American Heart Association(ACC/AHA) guide-
lines recommend that patients with calculated pooled risk score of equal or greater
than 7.5% should be eligible for statin therapy for primary prevention.
A number of multivariable risk models have been developed for estimating the
risk of initial cardiovascular events in healthy individuals. The original Framing-
ham risk score, published in 1998, was derived from a largely Caucasian popula-
tion of European descent (Wilson et al., 1998) using the endpoints of CHD death,
Nonfatal MI, Unstable angina, and Stable angina. Prediction variables used in
Framingham CHD risk score included age, gender, total or LDL cholesterol, HDL
cholesterol, systolic blood pressure, diabetes mellitus, and current smoking status.
The Framingham General CVD risk score (2008) was an extension of the origi-
nal Framingham risk was score to include all of the potential manifestations and
adverse consequences of atherosclerosis, such as stroke, transient ischemic attack,
claudication and heart failure (HF) (D’Agostino et al., 2008). ACC/AHA pooled
91
cohort hard CVD risk calculator (2013) was the first risk model to include data
from large populations of both Caucasian and African-American patients (Goff et
al., 2014), developed from several cohorts of patients. The model includes the
same parameters as the 2008 Framingham General CVD model, but in contrast to
the 2008 Framingham model includes only hard endpoints (fatal and nonfatal MI
and stroke). However, while the calculator appears to be well-calibrated in some
populations similar to those for which the calculator was developed (REGARDS),
it has not been as accurate in other populations (Rotterdam)(Kavousi et al., 2014).
Prediction variables used in ACC/AHA pooled cohort hard CVD risk calculator
(2013) were age, gender, total cholesterol, HDL cholesterol, systolic blood pres-
sure, blood pressure treatment, diabetes mellitus, and current smoking. Endpoints
assessed in ACC/AHA pooled cohort hard CVD risk calculator (2013) were CHD
death, nonfatal MI, fatal stroke, and nonfatal stroke. Another well-known risk score
is JBS risk score(2014), which is based on the QRISK lifetime cardiovascular risk
calculator and extends the assessment of risk beyond the 10-year window, allowing
for the estimation of heart age and assessment of risk over longer intervals.
The MESA risk score(2015) improved the accuracy of the 10-year CHD risk es-
timation by incorporating CAC(Coronary Artery Calcium score) in the algorithm,
together with the traditional risk factors. It was shown that inclusion of CAC in the
MESA risk score resulted significant improvements in risk prediction (C-statistic
0.80 vs.0.75; p<0.0001), compared to using only the traditional risk factors. In
addition, external validation in both the HNR(German Heinz Nixdorf Recall) and
DHS(Dallas Heart Study) studies provided evidence of very good discrimination
and calibration. The prediction variables used in MESA risk score (2015) were
age, gender, ethnicity (non-Hispanic white, Chinese American, African American,
Hispanic), total cholesterol, HDL cholesterol, Lipid lowering treatment, Systolic
92
blood pressure, Blood pressure treatment (yes or no), diabetes mellitus (yes or no),
current smoking (yes or no), family history of myocardial infarction at any age
(yes or no), coronary artery calcium score. The endpoints assessed in MESA risk
score (2015) were CHD death, nonfatal MI, resuscitated cardiac arrest, and coro-
nary revascularization in patient with angina. Although the MESA risk score with
CAC incorporated appears to be superior in risk prediction, one problem is that the
CAC score may not be available for many individuals. In this study, we look at dif-
ferent strategies of building risk engine when CAC score is not available in testing
data.
7.2.2 Methods
When the CAC score is available in the training data set, but not in the testing data
set, four strategies may be applied, which are:
3. Framingham model with true CAC, predict CAC in testing using all the test-
ing dataset.
4. New method: replace in training with predicted CAC and proceed as strategy
3.
93
1. Build a predictive model for CAC using the training data, with the outcome
variables removed. (Remember that CAC score is available in the training
dataset.)
3. Replace the true CAC score with the fitted CAC score in the training data.
4. Build predictive model for the outcome (10-year CHD) using updated train-
ing data from step 3
5. Predict CAC score in the testing dataset using model from step 1
7.2.3 Results
Traditional Framingham with CAC score predict 10-year CHD better than
RFS with all 696 available predictors
Using the 12 traditional Framingham predictors with CAC score, the error rate for
10-year CHD was found to be 22.6%, using node size of 100. We first investigated
whether the error rate can be improved by simply using all 690 available predictors
in the MESA dataset. The resulting RSF had an error rate of 0.239, showing no
improvement of prediction. This is not surprising as we showed that the only strong
predictors for 10-year CHD were CAC score and the Framingham predictors, the
resting being very weak predictors.
When the CAC is not available, the error rate increased to be 29.0%, or 0.290
using only the Framingham predictors.
94
One natural scenario is that CAC score is available in training dataset, but not in
testing dataset. And accordingly, one strategy may be including CAC in the training
dataset while try to predict CAC in the testing dataset when CAC is not available-
which is a common case in clinic, as not all patients will take the CT scan for their
CAC score. Using this strategy, the prediction error rate was found to be 0.324,
which is worse than simply ignoring CAC in both training and testing dataset. This
is not surprising, as the percent of variance explained predicting CAC was only 32
percent. We used a novel strategy to be able to utilize the known CAC score in the
training dataset, which resulted in improved prediction compared to simply ignore
the CAC score.
To compared the four strategies for predicting the 10-year CHD with the MESA
data, the following simulation was carried out. The observations in the MESA data
was assigned randomly to training or testing set. The CAC scores in the testing test
were then removed. Each of the four strategies for predicting the 10-year CHD were
carried out. The experiment was repeated 100 times for each of the four strategies
and the average of the 100 error rate was recorded. The RSF parameters used were
100 for mtry, 600 for ntree, 20 or 100 for nsplit, 100 for nodesize. 696, instead
of 711 variables were used because we deleted all the variables created from CAC
score except one. The natural log of CAC score was calculated and used to be
consistent with the literature. The result is shown in Table 7.3.
7.2.4 Conclusions
The MESA dataset we used is unique in that it essentially has two very important
variables in predicting the outcome (CHD): CAC score and Framingham score. In
95
reality, the Framingham score is usually always available for a patient, while the
CAC score is often no available, since it takes a CT scan to obtain the CAC score,
which is expensive and involves radiation. This resulted in a scenario that a very
important variable is in the training dataset, while not in the testing dataset. We
showed that to simply discard this variable is not a good strategy. We proposed a
new strategy to utilize the information of this variable in the training dataset even
when it is not available in the testing dataset. We showed that the predictive error
rate was improved with this new strategy.
Table 7.3: Prediction error rates using four strategies when CAC score is available
in the training, but not test set
Strategy p(training) p(testing) Error rate Error rate
(nsplit=20) (nsplit=100)
Framingham and CAC 13 13 0.223 0.228
Full model and CAC 696 696 0.234 0.247
Strategy 1 12 12 0.283 0.290
Strategy 2 695 695 0.276 0.293
Strategy 3 13 13 0.298 0.324
Strategy 4 13 13 0.263 0.263
Bibliography
Abdella, M. and Marwala, T. (2015). The use of genetic algorithms and neural
networks to approximate missing data in database. International Conference on
Computational Cybernatics, School of Electrical and information Engineering.
University of the Witwatersrand. Johannesburg, South Africa, 13-16 April, pg2-
7-212.
Breiman L., Friedman J.H., Olshen R.A., and Stone C.J. (1984) Classification and
Regression Trees, Belmont, California.
Breiman L. (2003) Manual – setting up, using, and understanding random forests
V4.0. .
Bartlett J.W, Seaman S.R., White I.R., and Carpenter J.R. (2015). Multiple impu-
tation of covariates by fully conditional specification: accommodating the sub-
stantive model. Statistical Methods in Medical Reserch, 24(4): 462487.
D’Agostino, R.B. Sr, Vasan, R.S., Pencina, M.J., Wolf, P.A., Cobain, M., Massaro,
J.M., Kannel, W.B. (2008) General cardiovascular risk profile for use in primary
care: the Framingham Heart Study. Circulation. 117(6):743.
Devroye L., Gyorfi L., and Lugosi G. (1996) Probabilistic Theory of Pattern Recog-
nition, Springer-Verlag.
96
97
Doove .L. L, Van Buuren S., and Dusseldorp E. (2014) Recursive partitioning for
missing data imputation in the presence of interaction effects. Computational
Statistics & Data Analysis, 72:92-104.
Goff, D.C. Jr, Lloyd-Jones, D.M., Bennett, G., Coady, S., D’Agostino, R.B., Gib-
bons, R., Greenland, P., Lackland, D.T., Levy, D., O’Donnell, C.J., Robinson,
J.G., Schwartz, J.S., Shero, S.T., Smith, S.C. Jr, Sorlie, P., Stone, N.J., Wil-
son, P.W., Jordan, H.S., Nevo, L., Wnek, J., Anderson, J.L., Halperin, J.L.,
Albert, N.M., Bozkurt, B., Brindis, R.G., Curtis, L.H., DeMets, D., Hochman,
J.S., Kovacs, R.J., Ohman, E.M., Pressler, S.J., Sellke, F.W., Shen, W.K., Smith,
S.C. Jr, Tomaselli, G.F. (2014) 2013 ACC/AHA guideline on the assessment of
cardiovascular risk: a report of the American College of Cardiology/American
Heart Association Task Force on Practice Guidelines. Circulation. 129(25 Suppl
2):S49.
Gyorfi, L., Kohler, M., Krzyzak, A., and Walk, H. (2002) A Distribution-Free
Theory of Nonparametric Regression, Springer.
Hastie, T., Tibshirani, R., Narasimhan, B., and Chu, G. (2015) impute: Imputation
for microarray data. R package version 1.34.0, http://bioconductor.
org.
Ishwaran, H., Kogalur, U. B., Blackstone, E. H., and Lauer, M. S. (2008) Random
survival forests. Ann. Appl. Stat., 2:841–860.
Ishwaran, H., Kogalur, U.B., Gorodeski, E.Z., Minn, A.J. and Lauer, M.S. (2010).
High-dimensional variable selection for survival data. J. Amer. Stat. Assoc, 105,
205-217.
Kavousi, M., Leening, M.J., Nanchen, D., Greenland, P., Graham, I.M., Steyerberg,
E.W., Ikram, M.A., Stricker, B.H., Hofman, A., Franco, O.H. (2014) Compari-
son of application of the ACC/AHA guidelines, Adult Treatment Panel III guide-
lines, and European Society of Cardiology guidelines for cardiovascular disease
prevention in a European cohort. JAMA. 311(14):1416.
Little, R.J.A (1992). Regression With Missing X’s: A Review Journal of the
American Statistical Association Vol. 87, No. 420, pp. 1227-1237
Loh, P.L. and Wainwright, M.J.. (2011) High-dimensional regression with noisy
and missing data: provable guarantees with non-convexity. Advances in Neural
Information Processing Systems, pp. 2726–2734.
Rubin, D.B. (1987) Multiple Imputation for Nonresponse in Surveys. New York:
John Wiley & Sons.
Rubin, D.B. (1996) Multiple Imputation after 18+ Years (with discussion). Journal
of the American Statistical Association, 91:473-489.
99
Shah, A.D., Bartlett, J.W., Carpenter, J., Nicholas, O., and Hemingway, H. (2014)
Comparison of random forest and parametric imputation models for imputing
missing data using MICE: a caliber study. American Journal of Epidemiology,
179(6):764774.
Stone, N.J., Robinson, J.G., Lichtenstein, A.H., et al. (2014) ACC/AHA guideline
on the treatment of blood cholesterol to reduce atherosclerotic cardiovascular
risk in adults: a report of the American College of Cardiology/American Heart
Association Task Force on Practice Guidelines. J Am Coll Cardiol 63:2889934.
Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R.,
Botstein, D., and Altman, R. (2001) Missing value estimation methods for DNA
microarrays. Bioinformatics, 17(6):520525.
Twala,B., Jones, M.C. and Hand, D.J. (2008) Good methods for coping with miss-
ing data in decision trees. Pattern Recognition Letters, 29(7):950–956.
Twala,B., Cartwright, M.C. (2010) Ensemble missing data techniques for software
effort prediction. Intelligent Data Analysis. 14(3):299-331.
Waljee, A.K. et al. (2013). Comparison of imputation methods for missing labora-
tory data in medicine. BMJ Open, 3(8):e002847.
Wilson, P.W., D’Agostino, R.B., Levy, D., Belanger, A.M., Silbershatz, H., Kannel,
W.B. Prediction of coronary heart disease using risk factor categories. Circula-
tion. 1998;97(18):1837.