You are on page 1of 108

University of Miami

Scholarly Repository
Open Access Dissertations Electronic Theses and Dissertations

2017-05-02

Random Forest Missing Data Approaches


Fei Tang
University of Miami, ftang@med.miami.edu

Follow this and additional works at: https://scholarlyrepository.miami.edu/oa_dissertations

Recommended Citation
Tang, Fei, "Random Forest Missing Data Approaches" (2017). Open Access Dissertations. 1852.
https://scholarlyrepository.miami.edu/oa_dissertations/1852

This Open access is brought to you for free and open access by the Electronic Theses and Dissertations at Scholarly Repository. It has been accepted for
inclusion in Open Access Dissertations by an authorized administrator of Scholarly Repository. For more information, please contact
repository.library@miami.edu.
UNIVERSITY OF MIAMI

RANDOM FOREST MISSING DATA APPROACHES

By

Fei Tang

A DISSERTATION

Submitted to the Faculty


of the University of Miami
in partial fulfillment of the requirements for
the degree of Doctor of Philosophy

Coral Gables, Florida

May 2017
c 2017
Fei Tang
All Rights Reserved
UNIVERSITY OF MIAMI

A dissertation submitted in partial fulfillment of


the requirements for the degree of
Doctor of Philosophy

RANDOM FOREST MISSING DATA APPROACHES

Fei Tang

Approved:

Hemant Ishwaran, Ph.D. J. Sunil Rao, Ph.D.


Professor of Biostatistics Professor of Biostatistics

Lily Wang, Ph.D. Guillermo Prado, Ph.D.


Professor of Biostatistics Dean of the Graduate School

Panagiota V. Caralis, M.D.


Professor of Internal Medicine
TANG, FEI ( Ph.D., Biostatistics )
Random Forest Missing Data Approaches ( May 2017 )

Abstract of a dissertation at the University of Miami


Dissertation supervised by Professor Hemant Ishwaran
No. of pages in text. ( 99 )

Random forest (RF) missing data algorithms are an attractive approach for im-
puting missing data. They have the desirable properties of being able to handle
mixed types of missing data, they are adaptive to interactions and nonlinearity, and
they have the potential to scale to big data settings. Currently there are many dif-
ferent RF imputation algorithms, but relatively little guidance about their efficacy.
Using a large, diverse collection of data sets, imputation performance of various
RF algorithms was assessed under different missing data mechanisms. Algorithms
included proximity imputation, on the fly imputation, and imputation utilizing mul-
tivariate unsupervised and supervised splittingthe latter class representing a gener-
alization of a new promising imputation algorithm called missForest. Our findings
reveal RF imputation to be generally robust with performance improving with in-
creasing correlation. Performance was good under moderate to high missingness,
and even (in certain cases) when data was missing not at random. Real data analysis
using the RF imputation methods was conducted on the MESA data.
TABLE OF CONTENTS

Page

LIST OF FIGURES ..................................................................................................... iv

LIST OF TABLES ....................................................................................................... vi

Chapter

1 PREFACE ...................................................................................................... 1

2 CART TREES AND RANDOM FOREST ..................................................... 9

3 EXISTING RF MISSING DATA APPROACHES ........................................ 24

4 NOVEL ENHANCEMENTS OF RF MISSING DATA APPROACHES .... 30

5 IMPUTATION PERFORMANCE .................................................................. 42

6 RF IMPUTATION ON VIMP AND MINIMAL DEPTH ............................. 65

7 MESA DATA ANALYSIS ............................................................................ 85

Bibliography .................................................................................................. 96

iii
List of Figures

5.1 Summary values for the 60 data sets. . . . . . . . . . . . . . . . . . 44


5.2 ANOVA effect size for the log-information. . . . . . . . . . . . . . 50
5.3 Relative imputation error, ER (I ). . . . . . . . . . . . . . . . . . . 51
5.4 Relative imputation error stratified by correlation. . . . . . . . . . . 52
5.5 Log of computing time versus log-complexity. . . . . . . . . . . . . 57
5.6 Relative log-computing time versus log-complexity. . . . . . . . . . 58
5.7 Log of computing time versus log-dimension. . . . . . . . . . . . . 63
5.8 Mean relative imputation error under different sample sizes. . . . . 64

6.1 Simulation 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.2 Simulation 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.3 Simulation 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.4 Simulation 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.5 Simulation 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.6 Importance measurement for all four variables. . . . . . . . . . . . 79
6.7 Minimal Depth measurement for all four variables. . . . . . . . . . 80
6.8 The correlation change due to missForest imputation. . . . . . . . . 81
6.9 Simulation 5 with nodesize of 200. . . . . . . . . . . . . . . . . . 82
6.10 The effect of different imputations methods with the Diabetes data. 83

iv
6.11 The effect of different imputations methods in GLM. . . . . . . . . 84

7.1 The percent of missingness for all the variables in MESA data. . . . 88

v
List of Tables

5.1 Experimental design used for large scale study . . . . . . . . . . . . 43


5.2 Relative imputation error ER (I ). . . . . . . . . . . . . . . . . . . 53

6.1 Summary characteristics for the Simulated models. . . . . . . . . . 68

7.1 Top importance variables with different imputation methods . . . . 89


7.2 Top importance variables explanation . . . . . . . . . . . . . . . . 89
7.3 Prediction error rates using four strategies . . . . . . . . . . . . . . 95

vi
Chapter 1

Preface

Classification and regression trees (CART), is a machine learning method that is


commonly used in data mining. It constructs trees by conducting binary splits of
certain predicting variables of the data, aiming to produce homogeneous subsets
of the data with respect to the outcome. Although CART is a very intuitive way
for understanding data, it is known for its instability in prediction, meaning the
prediction based on CART can change substantially with minor perturbation of the
data. One method for reducing the variance of a predictor is bagging, or ”bootstrap
aggregation”. Random forest(RF, Breiman (2001)) is bagged trees with randomness
injected to minimize the correlation between the trees. It is appreciated for its
improvement of prediction accuracy over a single CART tree and its predicting
stability. In addition, it is a good method for high dimensional data, especially
when complex relations exit between the predictive variables. As a result, it has
gained popularity in many research fields.
Missing data is a real world problem frequently encountered in medical settings.
Because statistical analyses almost always require complete data, medical

1
2

researchers are forced to either impute data or to discard missing values when miss-
ing data is encountered. Of course, to simply discard observations with missing
values(complete case analysis) is usually not a reasonable practice, as valuable in-
formation may be lost and inferential power compromised(Enders , 2010).It can
even cause selection bias in some cases. In addition, deleting observations with
missing values may result in very few observations left when there is a high number
of predictive variables that contain missing values, especially for high dimensional
data. As a result, it is advisable to impute the data before any analysis is perform.
Many statistical imputation methods have been developed for missing data.
However, in high dimensional and large scale data settings, such as genomic, pro-
teomic, neuroimaging, and other high-throughput problems, many of these methods
perform poorly. For example, it is recommended that all variables be included in
multiple imputation to make it proper in general and in order to not create bias in
the estimate of the correlations(Rubin, 1996). This can lead to overparameteriza-
tion when there are a large number of variables but the sample size is moderate; a
scenario often seen with modern genomic data. In addition, computational issues
can arise in the implementation of standard methods. An example is the occur-
rence of non-convexity due to missing data when maximizing the log-likelihood,
which is problematic and challenging to optimize using traditional methods such
as the EM algorithm(?). Among the available imputation methods, some are for
continuous data(KNN for instance)(Troyanskaya et al, 2001), some are for categor-
ical data(saturated multinomial model). MICE(Multivariate Imputation by Chained
Equations)(Van Buuren, 2007) is for mixed data types(i.e., data having both nomi-
nal and categorical variables), depending on tuning parameters or specification of a
parametric model. High dimensional data often feature mixed data types with com-
plicated interactions among variables, making it unfeasible to specify any paramet-
3

ric model. In addition, implementation of these methods can often break down in
challenging data settings(Liao et al , 2014). These methods tend to perform poorly
in high dimensional and large scale data settings, as they were never designed to
be regularized or they cannot be applied due to computational issues. An exam-
ple of the latter occurs when maximizing the log-likelihood, which can become
non-convex due to missing data, and therefore will be challenging to optimize us-
ing traditional methods such as the EM algorithm(?). Another serious issue is that
most methods cannot deal with complex interactions and nonlinearity of variables,
which is common with data from medical research. Standard multiple imputation
approaches do not automatically incorporate interaction effects, and not surpris-
ingly, this leads to biased parameter estimates when interactions are present(Doove
et al. , 2014). Although some techniques such as fully conditional specification of
the covariate(Bartlett et al. , 2015) can be used to try to solve this problem, these
techniques may be hard and inefficient to implement in settings where interactions
are expected to be complicated.
For these reasons there has been much interest in using machine learning meth-
ods for missing data imputation. A promising approach can be based on Breimans
random forests (abbreviated hereafter as RF; Breiman (2001)). RF have the de-
sired characteristic of being able to: (1) handle mixed types of missing data; (2)
address interactions and nonlinearity; (3) scale to high-dimensions, while being
free of data assumptions; (4) avoid overfitting; (5) address settings when there
are more variables than observations; and (6) yield measures of variable impor-
tance that potentially can be used for variable selection. Currently there are sev-
eral different RF missing data algorithms. This includes the original RF proximity
algorithm proposed by Breiman (Breiman, 2003) implemented in the randomFor-
est R-package (Liaw and Wiener, 2002). A different class of algorithms are the
4

on-the-fly-imputation (OTFI) algorithms implemented in the randomSurvivalFor-


est R-package (abbreviated as RSF; Ishwaran et al. (2008)), which allow data to
be imputed while simultaneously growing a survival tree. These algorithms have
been adopted within the randomForestSRC R-package (abbreviated as RF-SRC)
to include not only survival, but classification and regression as well as other set-
tings(Ishwaran et al., 2016). Recently, Stekhoven and Buhlmann (2012) introduced
missForest which takes a different approach by recasting the missing data problem
as a prediction problem. Data is imputed by regressing each variable in turn against
all other variables and then predicting missing data for the dependent variable using
the fitted forest. In applications to both atomic and mixed data settings, missForest
was found to outperform well known methods such as k-nearest neighbors (Troy-
anskaya et al, 2001) and parametric MICE (multivariate imputation using chained
equation; Van Buuren (2007)). These findings have been confirmed in independent
studies; see for example Waljee et al. (2013).
Because of the desired characteristics of RF mentioned above, some missing
data algorithms have recently been developed to incorporate RF with traditional
imputation methods. For instance, Doove et al. (2014) proposed using random for-
est for multiple imputation, within the MICE framework. The algorithm is briefly
summarized as follows.
To impute Y using fully observed (x1 ,xp ):

1. Apply random forest to (y obs , xobs ), using k bootstraps

2. For a given subject with missing Y with predictor values x1 ,... ,xp , take the
observed values of Y in the terminal nodes of all k trees

3. Randomly sample one observed value of Y from these as the imputation of


the missing Y .
5

This process was embedded into MICE, and repeated to create multiple imputa-
tions. The approach is included in van Buurens MICE package in R. Independently
of Doove et al , Shah et al. (2014) also proposed using random forest for imputation
using a somewhat different approach:

1. Take a bootstrap sample (y obs,bs , xobs,bs ) from (y obs , xobs )

2. Standard random forest is applied to (y obs,bs , xobs,bs )

3. Missing Y values are imputed by taking a normal draw, with residual variance
equal to the out of bag mean square error.

A recent comparison of random forest-based MICE algorithm with MICE showed


that both methods produced unbiased estimates of (log) hazard ratios using a real
life data, but random forest MICE was more efficient and produced narrower confi-
dence intervals. However, when a simulated data where nonlinearity exits, param-
eter estimates were less biased using random forest MICE, and confidence interval
coverage was better(Shah et al. , 2014). This suggests that random forest impu-
tation may be useful for imputing complex data sets. However, a drawback of the
approach however is the assumption of conditional normality and constant variance.
The out of bag error is not residual variance; it is residual variance plus bias(Mendez
and Lohr , 2011).
Besides random forest, other machine learning paradigms, consisting elements
such as Neuro-Fuzzy (NF) networks, Multilayer Perception (MLP) Neural Net-
works (NN) and Genetic Algorithms (GA)(Abdella et al. , 2005), et al. have been
considered for missing data imputation as well. An empirical study showed that
random forest were superior in imputing missing data in terms both of accuracy
and of computation time, compared to Auto-associative Neuro-Fuzzy configura-
tions(Pantanowitz and Marwala, 2008).
6

Depending on whether missingness is related to the data values, three types


of missingness are distinguished, namely, missing completely at random(MCAR),
missing at random(MAR), and missing not at random(MNAR)(Rubin, 1976). For-
mally, let Z denote the n ⇥ (p + 1) data matrix, which include the observed values
Zobs and missing values Zmis , let R be the missing data indicator matrix R, with
(i, j)th element Ri,j = 1 if Zij is observed and Ri,j = 0 if Zij is missing. The
notion of missing data mechanism was then formalized in terms of conditional dis-
tribution of R given Z. Missing Completely at Random(MCAR) implies the distri-
bution of R does not dependend on Zobs or Zmis ; Missing at Random(MAR) allows
the R to depend on Zobs but not on Zmis ; while Missing Not at Random(MNAR)
allows R to depend on both Zobs and Zmis . In other words, these three types of
missingness can be distinguished as follows:

M CAR : P (R|Zcomp ) = P (R)

M AR : P (R|Zcomp ) = P (R|Zobs )

M N AR : P (R|Zcomp ) = R(R|Zobs , Zmis )

The missingness mechanism in regression was illustrated by Little(Little, 1992)


in a univariate missing data example. In this dataset consisting of X = (X1 , ..., Xp )
and Y , only X1 has missing values. The probability that X1 is missing for a case
may be independent of data values (MCAR), dependent on the value of X1 for that
case (MNAR), (c) depend on the values of X2 ,...,Xp and Y for that case(MAR). In
this study, all three types of missingness were investigated.
In this study, we first proposed and implemented some novel enhancements to
RF imputation algorithms, including multivariate missForest(mForest), which im-
7

proves greatly at the computation speed compared to missForest. We then com-


pared several RF imputation methods in details. The goal of the investigation is
two folds: 1. how is the performance, in terms of the imputation accuracy and
speed, of random forest based imputation methods compared with some popular
non-random-forest based methods,; 2. If the data analyst intends to handle the miss-
ing values using a random forest based methods followed by carrying out variable
selection using random forest, which imputation method will give the best result.
In another word, we would like to identify the imputation method with which the
effect of missing values on the result of variable selection can be minimized. To
answer the first question, we performed a large scale empirical study using several
RF missing data algorithms. Performance was assessed by imputation accuracy and
computational speed. Different missing data mechanisms (missing at random and
not missing at random) were used to assess robustness. In addition to the RF al-
gorithms described above, we also considered several new algorithms, including a
multivariate version of missForest, refered to as mForest. Despite the superior per-
formance of missForest, the algorithm is computationally expensive to implement
in high-dimensions as a separate RF must be run for each variable. The mForest
algorithm helps to alleviate this problem by grouping variables and running a mul-
tivariate forest using each group in turn as the set of dependent variables. This
replaces p regressions, where p is the number of variables, with ⇡ 1/↵ regressions,
where 0 < ↵ < 1 is the user specified group fraction size. Computational savings
are found to be substantial for mForest but without overly compromising accuracy
even for relatively large . Other RF algorithms studied included a new multivariate
unsupervised algorithm and algorithms utilizing random splitting.
8

The organization of the thesis is as follows.

1. In chapter 2, we give a review on CART tree and random forest, including


the splitting rules, VIMP and minimal depth, which are measures used for
variable selection.

2. Chapter 3 reviews the existing random forest based approaches for handling
incomplete data.

3. We introduced the novel enhancements of RF missing data methods in Chap-


ter 4.

4. Chapter 5 presented the imputation performance comparison of these meth-


ods, in terms of the imputation accuracies and speed.

5. Chapter 6 shows how VIMP and minimal depth are affected by different
methods of handling missing data.
Chapter 2

CART Trees and Random Forest

Classification and regression tree(CART) is a commonly used machine learning


method in data mining. It constructs trees by conducting binary splits of predicting
variables of the data, aiming to produce homogeneous subsets of the data. Despite
the advantages of the CART trees, they are unstable. Random forest is bagged trees
with randomness injected by 1)bootstrapping 2) selecting from a random subset of
variables to split on. Often used univariate splitting rules include 1) the twoing cri-
terion 2) the entropy criterion 3) the entropy criterion. The multivariate regression
splitting rule is based on the weighted sample variance. The multivariate classifi-
cation splitting rule is an extension of Gini splitting rule. The multivariate mixed
splitting rule is a combination of the weighted sample variance, and weighted multi-
variate Gini splitting rule. VIMP(Variable Importance Measure) and minimal depth
measures how important a variable is in terms of predicting the outcome. They can
be used for variable selection.

2.1 Classification and regression trees

Classification trees use recursive partitioning to classify a p-dimensional feature


x 2 X into one of J class labels for a categorical outcome Y 2 {C1 , . . . , CJ }.
The tree is constructed from a learning sample (X1 , Y1 ), . . . , (Xn , Yn ). Many clas-
sification rules work by partitioning X into disjoint regions Rn,1 , Rn,2 , . . .. For a

9
10

given x 2 X , such a classifier is defined as

n
X
Ĉn (x) = argmax 1{Yi =Cj } 1{Xi 2Rn (x)} (2.1)
1jJ
i=1

where Rn (x) 2 {Rn,1 , Rn,2 , . . .} is the partition region (cell) containing x.


Regression trees are constructed using recursive partitioning similar to classifi-
cation trees, with the outcome variable being continuous, instead of categorical. It
seeks the split-point that minimizes the weighted sample variance.

2.2 Splitting rules

The general principle of splitting rule is the reduction of tree impurity, because it
encourages the tree to push dissimilar cases apart.

2.2.1 Univariate splitting

Often used splitting rules include the twoing criterion, the entropy criterion, and the
Gini criterion. The Gini splitting rule is arguably the most popular and is defined
as follows. Let h be a tree node that is being split. Let p̂j (h) denote the proportion
of class j cases in h. Let s be a proposed split for a variable x that splits h into left
and right daughter nodes hL and hR , where

hL := {xi : xi  s}, hR := {xi : xi > s}.


11

Let N = |h|, NL = |hL |, and NR = |hR | denote the number of cases in h, hL , and
hR (note that N = NL + NR ). The Gini node impurity for h is defined as

J
X
ˆ (h) p̂j (h)(1 p̂j (h)).
j=1

The Gini node impurity for hL is

J
X
ˆ (hL ) = p̂j (hL )(1 p̂j (hL )).
j=1

where p̂j (hL ) is the class frequency for class j in hL . ˆ (hR ) is defined in a similar
way. The decrease in the node impurity is

h i
ˆ (h) ˆ ˆ
p̂(hL ) (hL ) + p̂(hR ) (hR ) .

where p̂(hL ) = NL /N and p̂(hR ) = NR /N are the proportion of observations


in hL and hR , respectively. The quantity

h i
ˆ h) = p̂(hL ) ˆ (hL ) + p̂(hR ) ˆ (hR ) .
✓(s,

is refered to as the Gini index or Gini splitting rule. The best split on x is the
split-point s = ŝ maximizing the decrease in node impurity, which is equivalent to
ˆ h), with respect to s.
minimizing the Gini index, ✓(s,
The twoing criterion (Breiman et al., 1984, pp 104-106) is designed to find the
grouping of all J classes into two superclasses that leads to the greatest decrease in
node impurity when considered as a two-class problem. Under Gini impurity, the
12

best twoing split is the value of s maximizing

" J #2
P̂ (h L ) P̂ (h R ) X
ˆ h) =
✓(s, p̂j (hL ) p̂j (hR ) .
4 j=1

The entropy impurity is

J
X
ˆ (h) = p̂j (h) log p̂j (h).
j=1

Thus, the best split on x under entropy is the value of s maximizing

J
X J
X
ˆ h) = P̂ (hL )
✓(s, p̂j (hL ) log p̂j (hL ) + P̂ (hR ) p̂j (h) log p̂j (hR ).
j=1 j=1

2.2.2 Multivariate splitting

Regression multivariate splitting

We shall denote the learning data by (Xi , Yi )1in , where Xi = (Xi,1 , . . . , Xi,p ) is
a p-dimensional feature and Yi = (Yi,1 , . . . , Yi,q ) is a q 1 dimensional response.
We shall denote a generic coordinate of the multivariate feature by X and refer to X
as a variable (i.e. covariate). For example, Xi refers to the value of the covariate for
Xi . Multivariate regression corresponds to the case when Yi,j are continuous. To
define the multivariate regression splitting rule we begin by considering univariate
(q = 1) regression.
Consider splitting a regression tree T at a node t. Let s be a proposed split for
a variable X that splits t into left and right daughter nodes tL := tL (s) and tR :=
tR (s) where tL are the cases {Xi  s} and tR are the cases {Xi > s}. Regression
13

node impurity is determined by within node sample variance. The impurity of t is

X
ˆ (t) = (Yi Y t )2 /N,
Xi 2t

P
where Y t = Xi 2t Yi /N is the sample mean for t and N = |t| is the sample size
of t (note that N = n only when t is the root node). The within sample variance for
a daughter node, say tL , is

X
ˆ (tL ) = 1 (Yi Y tL ) 2 , X(s, t) = {Xi 2 t, Xi  s}
NL
i2X(s,t)

where Y tL is the sample mean for tL and NL is the sample size of tL . The decrease
in impurity under the split s for X equals

h i
ˆ (s, t) = ˆ (t) ˆ ˆ
p̂(tL ) (tL ) + p̂(tR ) (tR ) ,

where p̂(tL ) = NL /N and p̂(tR ) = NR /N . The optimal split-point ŝN maximizes


the decrease in impurity ˆ (s, t) (Chapter 8.4; Breiman et al., 1984), which is equiv-
alent to minimizing

D̂W (s, t) = p̂(tL ) ˆ (tL ) + p̂(tR ) ˆ (tR )

which can be written as

1 X 1 X
D̂W (s, t) = (Yi Y tL ) 2 + (Yi Y tR ) 2 .
N i2t N i2t
L R

In other words, CART seeks the split-point ŝN that minimizes the weighted sample
variance.
14

We extend the weighted sample variance rule to the multivariate case q > 1 by
applying the splitting rule to each coordinate separately. We seek to minimize

q
( )
X X X

D̂W (s, t) = Wj (Yi,j Y t Lj ) 2 + (Yi,j Y t Rj ) 2
j=1 i2tL i2tR

where 0 < Wj < 1 are prespecified weights for weighting the importance of
coordinate j of the response Y, and Y tLj and Y tRj are the sample means of the
j-th coordinate in the left and right daughter nodes. Notice that such a splitting
rule can only be effective if each of the coordinates of Y are measured on the same
scale, otherwise we could have a coordinate j with say enormous values and its
contribution would dominate D̂W

(s, t). We can calibrate D̂W

(s, t) by using Wj but
it will be more convenient to assume that each coordinate has been standardized.
We will asssume

1 X 1 X 2
Yi,j = 0, Y = 1, 1  j  q. (2.2)
N i2t N i2t i,j

The standardization is applied prior to splitting the node t. With some elemen-
tary manipulations, it is easily verified that minimizing D̂W

(s, t) is equivalent to
maximizing
8 !2 !2 9
X q < 1 X 1 X =
?
D̂W (s, t) = Wj Yi,j + Yi,j . (2.3)
: NL NR ;
j=1 i2tL i2tR

Rule (4.1) is the multivariate splitting rule used for multivariate continuous out-
comes.
15

Categorical (classification) multivariate splitting

Now we consider the effect of Gini splitting when Yi,j is categorical. First consider
the univariate case (i.e., the multiclass problem) where the outcome Y is a class
label Y 2 {1, . . . , K} where K 2. Consider growing a classification tree using
Gini splitting. Let ˆk (t) denote the class frequency for class k in a node t. The Gini
node impurity for t is defined as

K
X
ˆ (t) = ˆk (t)(1 ˆk (t)).
k=1

The decrease in the node impurity from splitting X at s is

h i
ˆ (s, t) = ˆ (t) ˆ ˆ
p̂(tL ) (tL ) + p̂(tR ) (tR ) .

The quantity
Ĝ(s, t) = p̂(tL ) ˆ (tL ) + p̂(tR ) ˆ (tR )

is the Gini index which is minimized with respect to s.


P P
Let Nk,L = i2tL 1{Yi =k} and Nk,R = i2tR 1{Yi =k} . With some algebra one
can show minimizing Ĝ(s, t) is equivalent to maximizing

K
X 2 K
X 2
Nk,L Nk,R
+
k=1
NL NR
2k=1 !2 !2 3
XK X X
= 4 1 Zi(k) +
1
Zi(k) 5, (2.4)
k=1
NL i2tL
NR i2tR

where Zi(k) = 1{Yi =k} . Notice the similarity to (4.1). When q = 1, the regression
16

splitting rule (4.1) can be written as


8 !2 !2 9
< 1 X 1 X =
?
D̂W (s, t) = Yi + Yi .
: NL NR ;
i2tL i2tR

This is equivalent to each of the summands in (2.4) if we set Yi = Zi(k) . In-


deed, (2.4) is equivalent to (4.1), if in (4.1) we replace j with k, q with K, and
Yi,j = Zi(k) , and assume uniform weighting W1 = · · · = WK = 1/K.
When q > 1 we apply Gini splitting to each coordinate of Y yielding the ex-
tended Gini splitting rule
2 8 !2 !2 93
q Kj < =
X X X X
Ĝ⇤ (s, t) = 4 1 1
Zi(k),j +
1
Zi(k),j 5 , (2.5)
j=1
Kj k=1 : NL i2tL
NR i2tR
;

where Zi(k),j = 1{Yi,j =k} . Notice that (4.2) is equivalent to (4.1), but with an addi-
tional summation over the class labels k = 1, . . . , Kj for each j. The normalization
1/Kj employed for a coordinate j is required to standardize the contribution of the
Gini split from that coordinate.

Multivariate mixed outcome splitting

The equivalence between Gini splitting for categorical responses and weighted vari-
ance splitting for continuous responses now points to a means to split multivariate
mixed outcomes. Let QR ✓ {1, . . . , q} denote the coordinates of the continuous
outcomes in Y and let QC denote the coordinates having categorical outcomes.
17

S
Notice that QR QC = {1, . . . , q}. The mixed outcome splitting rule is

2 8 !2 !2 93
Kj < =
X 1 X 1 X 1 X
⇥(s, t) = Wj 4 Zi(k),j + Zi(k),j 5
K j : NL N R ;
j2QC k=1 i2tL i2tR
8 !2 !2 9
X < 1 X 1 X =
+ Wj Yi,j + Yi,j .
: NL N R ;
j2Q R i2t L i2t R

The standard practice is to use uniform weighting W1 = · · · = Wq = 1. Also recall


that Yi,j for j 2 QR are standardized as in (4.0.4) prior to splitting t.

2.2.3 Survival outcome splitting

Segal Intrator and LeBlanc and Crowley use as the prediction rule the Kaplan-Meier
estimate of the survival distribution, and as the splitting rule a test for measuring
differences between distributions adapted to censored data such as the log-rank test
or Wilcoxon test. These statistics are weighted versions of the log-rank statistic
where the weights allow exibility in emphasizing differences between two survival
curves for early times (the left tail of the distribution), middle times or late times(the
right tail of the distribution). In particular an observation at time t is weighted by

Q(t) = Ŝ(t)⇢ (1 S(t))

where S is the Kaplan Meier estimate of the survival curve for both samples com-
bined. Thus we can obtain sensitivity to early occurring differences by taking ⇢ > 0
and ⇡ 0 while we emphasize differences in the middle by taking ⇢ ⇡ 0 and ⇡1
and we emphasize late differences if ⇢ ⇡ 0and > 0.
18

Three different survival splitting rules can be used: (i) a log-rank splitting rule,
the default splitting rule(ii) a conservation of events splitting rule (iii) a logrank
score rule.

Notation

Assume we are at node h of a tree during its growth and that we seek to split h
into two daughter nodes.We introduce some notation to help discuss how the the
various splitting rules work to determine the best split. Assume that within h are n
individuals. Denote their survival times and 0-1 censoring information by (T 1, 1 ),
. . . ,(T n, n ). An individual l will be said to be right censored at time Tl if l = 0,
otherwise the individual is said to have died at Tl if l = 1. In the case of death, Tl
will be refered to as an event time,and the death as an event. An individual l who
is right censored at Tl simply means the individual is known to have been alive at
Tl , but the exact time of death is unknown. A proposed split at node h on a given
predictor x is always of the form x  c and x > c. Such a split forms two daughter
nodes (a left and right daughter) and two new sets of survival data. A good split
maximizes survival differences across the two sets of data. Let t1 < t2 < < tN be
the distinct death times in the parent node h, and let di,j and Yi,j equal the number
of deaths and individuals at risk at time ti in the daughter nodes j = 1, 2. Note that
Yi,j is the number of individuals in daughter j who are alive at time ti , or who have
an event (death) at time ti . More precisely,

Yi,1 = #{Tl ti , xl  c}, Yi,2 = #{Tl ti , xl > c},

where xl is the value of x for individual l = 1, ..., n. Finally, define Yi = Yi,1 + Yi,2
19

and di = di,1 + di,2 . Let nj be the total number of observations in daughter j. Thus,
n = n1 + n2 . Note that n1 = #{l : xl  c} and n2 = #{l : xl > c}.

Log-rank splitting

The log-rank test for a split at the value c for predictor x is

N
X di
(di,1 Yi,1 )
i=1
Yi
L(x, c) = v
u N
uX Yi,1 Yi,1 Yi di
t (1 )( )di
i=1
Y i Y i Yi 1

The value |L(x, c)| is the measure of node separation. The larger the value for
|L(x, c)|, the greater the difference between the two groups, and the better the split
is. In particular, the best split at node h is determined by finding the predictor x?
and split value c? such that |L(x? , c? )| |L(x, c)| for all x and c.

Log-rank score splitting

Another useful splitting rule available is the log-rank score test introduced by Hothorn
and Lausen(Hothorn et al., 2003). To describe this rule, assume the predictor x has
been ordered so that x1  x2  ...  xn . Now, compute the ranks for each survival
time Tl ,

Xl
k
al = l
k=1
n k +1

where k = #{t : Tt  Tk }. The log-rank score test is defined as

P
x c al n1 ā
S(x, c) = p l n1 2
n1 (1 n
)sa
20

where ā and s2a are the sample mean and sample variance of {al : l = 1, ..., n}.
Log-rank score splitting defines the measure of node separation by |S(x, c)|. Max-
imizing this value over x and c yields the best split.

2.3 Random forest

Random forest, introduced by Leo BreimanBreiman (2001), can be loosely viewed


as ensemble CART trees. In random forests, the base learner is a binary recur-
sive tree grown using random input selection. Its random feature is formed by 1.
selecting a small group of input variables at random to split on at each node, 2.
bootstrapping of the original data. It is similar to bagging in that the terminal nodes
of the tree contain the predicted values which are tree-aggregated to obtain the pre-
dictor. In random forest, the bootstrapped sample for each tree if called in-bag data,
while the data that were not sampled are called out of bag (OOB) data. The pre-
diction accuracy of random forest is assessed by OOB data, which are observations
that were not used in constructing of the tree. As a result, a more realistic estimate
of the predicting performance can be provided. Unlike a single CART tree, the
trees in random forest usually are not pruned. BreimanBreiman (2001) showed that
random forests do not overfit even with rising number of trees.
The algorithm of random forest is as follows.

Algorithm 1 Random Forest algorithm


1: Draw a bootstrap sample of the original data.
2: Grow a tree using data from Step 1. When growing the tree, at each node in the
tree, determine the optimal split for the node using m < p randomly selected
variables. Grow the tree so that each terminal node contains no fewer than
n 1 cases.
3: Repeat Steps 1-2, B > 1 times independently.
4: Combine the B trees to form the ensemble predictor.
21

2.4 VIMP

Random forest is appreciated not only for its superiority in prediction accuracy, but
also for its intrinsic ability to deal with complicated interactions. Multiple methods
have been developed to perform variable selection using Random Forest. These
methods are based on two measures: variable importance measure(VIMP), or min-
imal depth measure.
Variable importance, or VIMP, equals the amount of prediction error change,which
can be either increase or decrease, when a particular variable is noised up by permu-
tation, or randomly assignment of observations to the child nodes when the parent
nodes is split on this given variable.
The permutation importance is assessed by a comparing the prediction accuracy,
in terms of correct classification rate or mean squared error(MSE), of a tree before
and after random permutation of a predictor variable. By permutation the original
association with the response is destroyed and the accuracy is supposed to drop
for a relevant predictor. Therefore, when the accuracies between before and after
permutation are small, it indicates that the importance of Xj is small in predicting
of the response variable. In contrast, if the prediction accuracy drops substantially
after the permutation, it indicates a strong association between Xj and the response
variable. One algorithm of computing the importance score by permutation is as
follows.

Algorithm 2 Importance score calculation by permutation(by Forest)


1: for i = 1, ..., n do
2: Compute the OOB accuracy of a RF
3: Permute the predictor variable of interest in the OOB observations
4: Compute the OOB accuracy of the RF
5: end for
6: The importance score the average difference
22

Another strategy for calculating VIMP is to calculate the change of prediction


accuracy by permuting for each tree(Ishwaran et al., 2007). A given variable xv is
randomly permuted in the out-of-bag (OOB) data and the noised up OOB data is
dropped down the tree grown from the in-bag data. This is done for each tree in the
forest and an out-of-bag estimate of prediction error is computed. The difference
between this new prediction error and the original out-of-bag prediction error with-
out the permutation is the VIMP of xv . The R package randomForetSRC that
we use in this study uses this strategy. The algorithm based on this strategy is as
follows.

Algorithm 3 Importance score calculation by permutation(by Trees)


1: for k = 1, ..., N do
2: Compute the OOB accuracy of the kth tree
3: Permute the predictor variable of interest (xv ) in the OOB observations
4: Compute the OOB accuracy of the kth tree using the permuted data
5: Compute the change of OOB accuracy for the kth tree
6: end for
7: The importance score the average change of OOB accuracy of all N trees

As indicated by its name, variable importance measures rate a variable’s rele-


vance for prediction. Not surprisingly, it has been used for variable selection as
well. Variable selection methods using VIMP fall into one of the two categories:
performance based methods, or test based methods. One example of the perfor-
mance based method is an algorithm proposed by Diaz et al(Diaz-Uriate and Al-
varez de Andres, 2006), which use random forest and VIMP to select genes by
iteratively fittting random forests. At each iteration, a new forest was built after dis-
carding those variables (genes) with the smallest variable importances; the selected
set of genes is the one that yields the same OOB error rate, within predefined stan-
dard error of the minimum error rate of all forest. In another example, Genuer et
al(Genuer et al , 2010) proposed a strategy involving a ranking of explanatory vari-
23

ables using VIMP and a stepwise ascending variable introduction strategy. The test
based variable selection methods apply a permutation test framework to estimate
the significance of a variable’s importance. For instance, in a method suggested by
Hapfelmeier and Ulm, VIMP for a variable was recomputed after it was permuted.
This procedure was repeated many times to assess the empirical distribution of im-
portances under the null hypothesis of a certain variable being not predictive of the
outcome variable. A p-values, reflecting the likelihood of the original VIMP within
these empirical distribution, then can be calculated. Variables with p-values less
than a predefined threshold were selected.

2.5 Minimal depth

Minimal depth, which is defined as the distance from the root node to the root of
the closest maximal v-subtree for variable v, is another measure that can be used for
variable selection(Ishwaran et al., 2010). It measures how far a case travels down
a tree to encounter the first split on variable v. Small minimal depth for a variable
implies high predictive ability of the variable. The smallest possible minimal depth
is 0, which means this variable is split at the root node.
Chapter 3

Existing RF Missing Data


Approaches

The existing RF missing data approaches are: proximity approach, ”on-the-fly-


imputation”(OTFI) approach, and missForest approach. A disadvantage of the
proximity approach is that OOB (out-of-bag) estimates for prediction error are bi-
ased. MissForest appears to be superior in imputation accuracy. However, it is
computationally costly. KNN imputation algorithm was also listed here because
trees are adaptive nearest neighbors.

3.1 Missing data approaches in CART

One commonly used algorithms for treating missing data in CART was based on
the idea of a surrogate split [Chapter 5.3, Breiman et al. (1984)]. If s is the best
split for a variable x, the surrogate split for s is the split s? using some other variable
x? such that s? and s are closest to one another in terms of predictive association
[Breiman et al. (1984)]. The CART algorithm uses the best surrogate split among
those variables not missing for the case to assign one that have a missing value for
the variable used to split a node.
24
25

Surrogate splitting method is not suited for random forests, since RF randomly
selects variables when splitting a node. A reasonable surrogate split may not exit
within a node, as the randomly selected variables to be split on may be uncorrelated.
Speed is one issue. A second concern is that to find a surrogate split is computation-
ally intensive and may become infeasible when growing a large number of trees for
random forests. A third concern is that surrogate splitting alters the interpretation
of a variable, which affects measures such as VIMP. For these reasons, a different
strategy is required for RF.

3.2 Overall stratigies of Mssing data approaches in

RF

Three strategies can and have been used in the random forest missing value impu-
tation.

1. Preimpute the data, then grow the forest, and update the original missing
values using certain criteria (such as proximity), based on the grown forest;
iterate for improved results.

2. Impute as the trees are grown; update by summary imputation; iterate for
improved results.

3. Preimpute, grow forest for each variable that has missing values, predict the
missing values using the grown forest, update the missing values with the
predicted values; iterate for improved results.
26

3.3 Proximity imputation

The proximity approach (Breiman, 2003; Liaw and Wiener, 2002) uses the strategy
one, while the adaptive tree method (Ishwaran et al., 2008) uses the strategy two. A
new imputation method named MissForest (Stekhoven and Buhlmann, 2012), fea-
turing predicting the missing values using forest grown using variable with missing
values as response variable, was proposed in 2011, and it falls into the third strategy.
The proximity approach works as follows. First, the data are roughly imputed by
replacing missing values for continuous variables with the median of non-missing
values, and by replacing missing values for categorical variable with the most fre-
quent occurring non-missing value. A RF is then fit with the roughly imputed data,
and a ’proximity matrix’ is calculated from the fitted RF. The proximity matrix, an
n ⇥ n symmetric matrix whose (i, j) entry records the frequency that case i and j
occur within the same terminal node, is then used for imputing the data. For contin-
uous variables, the missing values are imputed with the proximity weighted average
of the non-missing data; for integer variables, the missing values are imputed with
the integer value having the largest average proximity over non-missing data. The
updated data are then used as an input in RF, and the procedure is iterated. The
iterations end when a stable solution is reached.
The disadvantage of proximity approach is that OOB estimates for prediction
error are biased, generally on the order of 10-20% (Breiman, 2003). Further, be-
cause prediction error is biased, so are other measures based on it, such as VIMP. In
addition, the proximity approach does not work on predicting test data with missing
values. The adaptive tree imputation method addressed these issues by adaptively
impute missing data as a tree is grown, drawing randomly from the set of
27

non-missing in-bag data within the working node. The imputing procedure is sum-
marized as follows:

1. For each node h, impute missing data by drawing a random value from the
in-bag non-missing value prior to splitting.

2. After the splitting, reset the imputed data in the daughter nodes to missing.
Proceed as in Step 1 until the tree can no longer be split.

3. The final summary imputed value is the average of the case’s imputed in-bag
values for a continuous variable, or the most frequent in-bag imputed value
for a categorical variable.

Different splitting rules, such as outcome splitting, random splitting, unsuper-


vised splitting can be used for random forest. Accordingly, these different splitting
rules can be used in the process of random forest based missing data imputation. In
random splitting, nsplit, a non-zero positive integer, need to be defined by the user.
A maximum of nspit split points are chosen randomly for each of the potential split-
ting variables within a node. This is in contrast to non-random splitting, where all
possible split points for each of the potential splitting variables are considered. The
splitting rule is applied to the nsplit randomly selected split points and the node is
split on the variable of which the random split point yields the best value, as mea-
sured by the splitting rules such as weighted mean-squared error, or Gini index.
Depending on the splitting rules, a outcome variable may or may not required. In
outcome splitting, an outcome variable is required, while in complete random spit-
ting and unsupervised splitting, no outcome variable is required. The comparison
of different splitting rules, supervised v.s. unsupervised splitting in random forest
based missing data imputation was carried out in this study.
28

Missforest uses an iterative imputation scheme by training a RF on observed val-


ues for each variable, followed by predicting the missing values and then proceed-
ing iteratively until the stopping criteria is met (Stekhoven and Buhlmann, 2012).

3.4 RF imputation

RF imputation refers to the ”on-the-fly-imputation”(OTFI) (or adaptive tree) impu-


tation method(Ishwaran et al., 2008) that is implemented in the R package random-
ForestSRC. In this method, missing data are replaced with values drawn randomly
from the non-missing in-bag data – the bootstrap samples that are to be used to grow
the trees, in contrast to out-of-bag data which are only used for model validation
– within the splitting node. That it, missing data are imputed prior to splitting at
each node. As a result, the daughter nodes contain no missing data as the parent
node is imputed before splitting. Therefore, imputation is carried out adaptively as
a tree is grown and all the missing values are imputed at the end of each iteration.
Weighted mean-squared error splitting was used for continuous outcome, and Gini
index splitting was used for categorical outcome(Breiman et al., 1984).

3.5 KNNimputation algorithm

The KNN-based imputation was proposed in 2001 as follows. If we consider row


A that has one missing value in column 1, the KNNimpuation method would find
K other rows, which do not have the value missing in column 1, with the least
Euclidean distance with row A. A weighted average of values in column 1 from
the K closest rows is then used as an estimate for the missing value in row A. See
Algorithm 2 for more details
29

Algorithm 4 KNNimpute algorithm


Require: X an m rows(genes), and n columns(experiments) matrix, rowmax and
colmax
1: Check if there exists any column that has more than colmax missing values
2: if Yes then
3: Halt and report error
4: else
5: log-transform the data
6: end if
7: for i = 1, ..., m do
8: if row i has more than rowmax missing values then
9: Use mean column average as imputation of the missing values
10: else
11: if row i has at least one but no more than rowmax missing values then
12: posi record the positions where row i has missing values
13: Find rows that have no missing values in column posi
14: Calculate the average Euclidean metric of row i with these rows
15: k nearest neighbors for row i k rows with the smallest average Eu-
clidean distance with row i
16: Use the corresponding average value of these k nearest neighbors as
imputed value
17: end if
18: end if
19: end for
Chapter 4

Novel Enhancements of RF Missing


Data Approaches

Three general strategies have been used for RF missing data imputation:

(A) Preimpute the data; grow the forest; update the original missing values using
proximity of the data. Iterate for improved results.

(B) Simultaneously impute data while growing the forest; iterate for improved
results.

(C) Preimpute the data; grow a forest using in turn each variable that has missing
values; predict the missing values using the grown forest. Iterate for improved
results.

Proximity imputation (Breiman, 2003) uses strategy A, on-the-fly-imputation (Ish-


waran et al., 2008) (OTFI) uses strategy B, and missforest (?) uses strategy C.
Below we detail each of these strategies and describe various algorithms which uti-
lize one of these three approaches. These new algorithms take advantage of new
splitting rules, including random splitting, unsupervised splitting, and multivariate
splitting (Ishwaran et al., 2016).

30
31

4.0.1 Strawman imputation

We first begin by describing a “strawman imputation” which will be used through-


out as our baseline reference value. While this algorithm is rough, it is also very
rapid, and for this reason it was also used to initialize some of our RF procedures.
Strawman imputation is defined as follows. Missing values for continuous variables
are imputed using the median of non-missing values, and for missing categorical
variables, the most frequently occurring non-missing value is used (ties are broken
at random).

4.0.2 Proximity imputation: RFprx and RFprxR

Here we describe proximity imputation (strategy A). In this procedure the data is
first roughly imputed using strawman imputation. A RF is fit using this imputed
data. Using the resulting forest, the n ⇥ n symmetric proximity matrix (n equals
the sample size) is determined where the (i, j) entry records the inbag frequency
that case i and j share the same terminal node. The proximity matrix is used to im-
pute the original missing values. For continuous variables, the proximity weighted
average of non-missing data is used; for categorical variables, the largest average
proximity over non-missing data is used. The updated data are used to grow a new
RF, and the procedure is iterated.
We use RFprx to refer to proximity imputation as described above. However,
when implementing RFprx we use a slightly modified version that makes use of
random splitting in order to increase computational speed. In random splitting,
nsplit, a non-zero positive integer, is specified by the user. A maximum of
nspit-split points are chosen randomly for each of the randomly selected mtry
splitting variables. This is in contrast to non-random (deterministic) splitting typ-
32

ically used by RF, where all possible split points for each of the potential mtry
splitting variables are considered. The splitting rule is applied to the nsplit ran-
domly selected split points and the tree node is split on the variable with random
split point yielding the best value, as measured by the splitting criterion. Random
splitting evaluates the splitting rule over a much smaller number of split points and
is therefore considerably faster than deterministic splitting.
The limiting case of random splitting is pure random splitting. The tree node is
split by selecting a variable and the split-point completely at random—no splitting
rule is applied; i.e. splitting is completely non-adaptive to the data. Pure random
splitting is generally the fastest type of random splitting. We also apply RFprx using
pure random splitting; this algorithm is denoted by RFprxR .
As an extension to the above methods, we implement iterated versions of RFprx
and RFprxR . To distinguish between the different algorithms, we write RFprx.k and
RFprxR.k when they are iterated k 1 times. Thus, RFprx.5 and RFprxR.5 indicates
that the algorithms were iterated 5 times, while RFprx.1 and RFprxR.1 indicates that
the algorithms were not iterated. However, as this latter notation is somewhat cum-
bersome, for notational simplicity we will simply use RFprx to denote RFprx.1 and
RFprxR for RFprxR.1 .

4.0.3 On-the-fly-imputation (OTFI): RFotf and RFotfR

A disadvantage of the proximity approach is that OOB (out-of-bag) estimates for


prediction error are biased (Breiman, 2003). Further, because prediction error
is biased, so are other measures based on it, such as variable importance. The
method is also awkward to implement on test data with missing values. The OTFI
method (Ishwaran et al., 2008) (strategy B) was devised to address these issues.
33

Specific details of OTFI can be found in (Ishwaran et al., 2008, 2016), but for
convenience we summarize the key aspects of OTFI below:

1. Only non-missing data is used to calculate the split-statistic for splitting a tree
node.

2. When assigning left and right daughter node membership if the variable used
to split the node has missing data, missing data for that variable is “imputed”
by drawing a random value from the inbag non-missing data.

3. Following a node split, imputed data are reset to missing and the process
is repeated until terminal nodes are reached. Note that after terminal node
assignment, imputed data are reset back to missing, just as was done for all
nodes.

4. Missing data in terminal nodes are then imputed using OOB non-missing
terminal node data from all the trees. For integer valued variables, a maximal
class rule is used; a mean rule is used for continuous variables.

It should be emphasized that the purpose of the “imputed data” in Step 2 is only
to make it possible to assign cases to daughter nodes—imputed data is not used to
calculate the split-statistic, and imputed data is only temporary and reset to missing
after node assignment. Thus, at the completion of growing the forest, the resulting
forest contains missing values in its terminal nodes and no imputation has been
done up to this point. Step 4 is added as a means for imputing the data, but this step
could be skipped if the goal is to use the forest in analysis situations. In particular,
step 4 is not required if the goal is to use the forest for prediction. This applies
even when test data used for prediction has missing values. In such a scenario, test
data assignment works in the same way as in step 2. That is, for missing test values,
values are imputed as in step 2 using the original grow distribution from the training
34

forest, and the test case assigned to its daughter node. Following this, the missing
test data is reset back to missing as in step 3, and the process repeated.
This method of assigning cases with missing data, which is well suited for
forests, is in contrast to surrogate splitting utilized by CART (Breiman et al., 1984).
To assign a case having a missing value for the variable used to split a node, CART
uses the best surrogate split among those variables not missing for the case. This
ensures every case can be classified optimally, whether the case has missing values
or not. However, while surrogate splitting works well for CART, the method is not
well suited for forests. Computational burden is one issue. Finding a surrogate
split is computationally expensive even for one tree, let alone for a large number
of trees. Another concern is that surrogate splitting works tangentially to random
feature selection used by forests. In RF, variables used to split a node are selected
randomly, and as such they may be uncorrelated, and a reasonable surrogate split
may not exist. Another concern is that surrogate splitting alters the interpretation of
a variable, which affects measures such as variable importance measures (Ishwaran
et al., 2008).
To denote the OTFI missing data algorithm, we will use the abbreviation RFotf .
As in proximity imputation, to increase computational speed, RFotf is implemented
using nsplit random splitting. We also consider OTFI under pure random split-
ting and denote this algorithm by RFotfR . Both algorithms are iterated in our studies.
RFotf , RFotfR will be used to denote a single iteration, while RFotf.5 , RFotfR.5 denotes
five iterations. Note that when OTFI algorithms are iterated, the terminal node im-
putation executed in step 4 uses inbag data and not OOB data after the first cycle.
This is because after the first cycle of the algorithm, no coherent OOB sample ex-
ists.
35

Remark 1. As noted by one of our referees, missingness incorporated in attributes


(MIA) is another tree splitting method which bypasses the need to impute data Twala
et al. (2008); Twala and Cartwright (2010). Again, this only applies if the user is
interested in a forest analysis. MIA accomplishes this by treating missing values
as a category which is incorporated into the splitting rule. Let X be an ordered or
numeric feature being used to split a node. The MIA splitting rule searches over all
possible split values s of X of the following form:

Split A: {X  s or X = missing} versus {X > s}.

Split B: {X  s} versus {X > s or X = missing}.

Split C: {X = missing} versus {X = not missing}.

Thus, like OTF splitting, one can see that MIA results in a forest ensemble con-
structed without having to impute data.

4.0.4 Unsupervised imputation: RFunsv

RFunsv refers to OTFI using multivariate unsupervised splitting. However unlike


the OTFI algorithm, RFotf , the RFunsv algorithm is unsupervised and assumes there
is no response (outcome) variable. Instead a multivariate unsupervised splitting
rule (Ishwaran et al., 2016) is implemented. As in the original RF algorithm, at
each tree node t, a set of mtry variables are selected as potential splitting variables.
However, for each of these, as there is no outcome variable, a random set of ytry
variables is selected and defined to be the multivariate response (pseudo-outcomes).
A multivariate composite splitting rule of dimension ytry (see below) is applied
to each of the mtry multivariate regression problems and the node t split on the
variable leading to the best split. Missing values in the response are excluded when
computing the composite multivariate splitting rule: the split-rule is averaged over
36

non-missing responses only (Ishwaran et al., 2016). We also consider an iterated


RFunsv algorithm (e.g. RFunsv.5 implies five iterations, RFunsv implies no iterations).
Here is the description of the multivariate composite splitting rule. We begin by
considering univariate regression. For notational simplicity, let us suppose the node
t we are splitting is the root node based on the full sample size n. Let X be the
feature used to split t, where for simplicity we assume X is ordered or numeric. Let
s be a proposed split for X that splits t into left and right daughter nodes tL := tL (s)
and tR := tR (s), where tL = {Xi  s} and tR = {Xi > s}. Let nL = #tL and
nR = #tR denote the sample sizes for the two daughter nodes. If Yi denotes the
outcome, the squared-error split-statistic for the proposed split is

1X 1X
D(s, t) = (Yi Y tL ) 2 + (Yi Y tR ) 2
n i2t n i2t
L R

where Y tL and Y tR are the sample means for tL and tR respectively. The best split
for X is the split-point s minimizing D(s, t). To extend the squared-error splitting
rule to the multivariate case q > 1, we apply univariate splitting to each response
coordinate separately. Let Yi = (Yi,1 , . . . , Yi,q )T denote the q 1 dimensional
outcome. For multivariate regression analysis, an averaged standardized variance
splitting rule is used. The goal is to minimize

q
( )
X X X
Dq (s, t) = (Yi,j Y t Lj ) 2 + (Yi,j Y t Rj ) 2
j=1 i2tL i2tR

where Y tLj and Y tRj are the sample means of the j-th response coordinate in the
left and right daughter nodes. Notice that such a splitting rule can only be effective
if each of the coordinates of the outcome are measured on the same scale, otherwise
we could have a coordinate j, with say very large values, and its contribution would
37

dominate Dq (s, t). We therefore calibrate Dq (s, t) by assuming that each coordinate
has been standardized according to

1X 1X 2
Yi,j = 0, Y = 1, 1  j  q.
n i2t n i2t i,j

The standardization is applied prior to splitting a node. To make this standardiza-


tion clear, we denote the standardized responses by Yi,j

. With some elementary
manipulations, it can be verified that minimizing Dq (s, t) is equivalent to maximiz-
ing 8 !2 !2 9
X< 1
q
X 1 X =
Dq⇤ (s, t) = ⇤
Yi,j + ⇤
Yi,j . (4.1)
: nL nR ;
j=1 i2tL i2tR

For multivariate classification, an averaged standardized Gini splitting rule is used.


First consider the univariate case (i.e., the multiclass problem) where the outcome
Yi is a class label from the set {1, . . . , K} where K 2. The best split s for X is
obtained by maximizing
2 !2 !2 3
K
X X X
G(s, t) = 4 1 Zi(k) +
1
Zi(k) 5
k=1
nL i2tL
nR i2tR

where Zi(k) = 1{Yi =k} . Now consider the multivariate classification scenario r > 1,
where each outcome coordinate Yi,j for 1  j  r is a class label from {1, . . . , Kj }
for Kj 2. We apply Gini splitting to each coordinate yielding the extended Gini
splitting rule
2 8 !2 !2 9 3
r Kj < =
X X X X
G⇤r (s, t) = 4 1 1
Zi(k),j +
1
Zi(k),j 5 (4.2)
j=1
Kj k=1 : nL i2tL
nR i2tR
;

where Zi(k),j = 1{Yi,j =k} . Note that the normalization 1/Kj employed for a co-
38

ordinate j is required to standardize the contribution of the Gini split from that
coordinate.
Observe that (4.1) and (4.2) are equivalent optimization problems, with opti-
mization over Yi,j

for regression and Zi(k),j for classification. As shown in (Ish-
waran, 2015) this leads to similar theoretical splitting properties in regression and
classification settings. Given this equivalence, we can combine the two splitting
rules to form a composite splitting rule. The mixed outcome splitting rule ⇥(s, t)
is a composite standardized split rule of mean-squared error (4.1) and Gini index
splitting (4.2); i.e.,
⇥(s, t) = Dq⇤ (s, t) + G⇤r (s, t),

where p = q + r. The best split for X is the value of s maximizing ⇥(s, t).

Remark 2. As discussed in (Segal and Yuanyuan, 2011), multivariate regression


splitting rules patterned after the Mahalanobis distance can be used to incorpo-
rate correlation between response coordinates. Let YL and YR be the multivariate
means for Y in the left and right daughter nodes, respectively. The following Ma-
halanobis distance splitting rule was discussed in (Segal and Yuanyuan, 2011)

X T X T
Mq (s, t) = Yi YL VL 1 Yi YL + Yi YR VR 1 Yi YR
i2tL i2tR

where VL and VR are the estimated covariance matrices for the left and right
daughter nodes. While this is a reasonable approach in low dimensional problems,
recall that we are applying Dq (s, t) to ytry of the feature variables which could be
large if the feature space dimension p is large. Also, because of missing data in the
features, it may be difficult to derive estimators for VL and VR , which is further
complicated if their dimensions are high. This problem becomes worse as the tree
39

is grown because the number of observations decreases rapidly making estimation


unstable. For these reasons, we use the splitting rule Dq (s, t) rather than Mq (s, t)
when implementing imputation.

4.0.5 mForest imputation: mRFa and mRF

The missForest algorithm recasts the missing data problem as a prediction problem.
Data is imputed by regressing each variable in turn against all other variables and
then predicting missing data for the dependent variable using the fitted forest. With
p variables, this means that p forests must be fit for each iteration, which can be
slow in certain problems. Therefore, we introduce a computationally faster version
of missForest, which we call mForest. The new algorithm is described as follows.
Do a quick strawman imputation of the data. The p variables in the data set are
randomly assigned into mutually exclusive groups of approximate size ap where
0 < a < 1. Each group in turn acts as the multivariate response to be regressed
on the remaining variables (of approximate size (1 a)p). Over the multivariate
responses, set imputed values back to missing. Grow a forest using composite mul-
tivariate splitting. As in RFunsv , missing values in the response are excluded when
using multivariate splitting: the split-rule is averaged over non-missing responses
only. Upon completion of the forest, the missing response values are imputed using
prediction. Cycle over all of the ⇡ 1/a multivariate regressions in turn; thus com-
pleting one iteration. Check if the imputation accuracy of the current imputed data
relative to the previously imputed data has increased beyond an ✏-tolerance value
(see Section 3.3 for measuring imputation accuracy). Stop if it has, otherwise repeat
until convergence.
40

To distinguish between mForest under different a values we use the notation


mRFa to indicate mForest fit under the specified a. Note that the limiting case a =
1/p corresponds to missForest. Although technically the two algorithms missForest
and mRFa for a = 1/p are slightly different, we did not find significant differences
between them during informal experimentation. Therefore for computational speed,
missForest was taken to be mRFa for a = 1/p and denoted simply as mRF.
The key steps for mRFmRF.↵ is sketched as in Algorithm 1.
Let X be the n ⇥ p matrix with missing values, X? and X?? be two possible dif-
ferent imputed matrices for X. We define E (X ? , X ?? ), the difference of imputation
accuracy as follows. As described above, values of Xj were made missing under
various missing data assumptions. Let Ij = (I1,j , . . . , In,j ) be a vector of zeroes and
ones indicating which values of Xj were artificially made missing. Define Ii,j = 1
P
if Xi,j is artificially missing; otherwise Ii,j = 0. Let nj = ni=1 Ii,j be the number
of artificially induced missing values for Xj .
Let N and C be the set of nomimal (continuous) and categorical (factor) vari-
ables with more than one artificially induced missing value. That is,

N = {j : Xj is nominal and nj > 1}

C = {j : Xj is categorical and nj > 1}.


41

v
uX
u n 2
u ⇤⇤
Ii,j Xi,j ⇤
Xi,j /nj
X u
1 u i=1
E (X ? , X ?? ) = u n ⇣ ⌘
#N j2N u
t
X
⇤ ⇤ 2
Ii,j Xi,j Xj /nj
i=1
 Pn
1 X i=1 Ii,j
⇤⇤
1{Xi,j ⇤
6= Xi,j }
+ ,
#C j2C nj

⇤ Pn
where Xj = i=1

Ii,j Xi,j /nj . Note that standardized root-mean-squared er-
ror (RMSE) was used to assess imputation difference for nominal variables, and
misclassification error for factors.

Algorithm 5 multivariate missforest algorithm


Require: X an n ⇥ p matrix, stopping criterion , grouping factor ↵, 0 < ↵  1
1: Record which variables and which positions have missing values in X
2: p0 number of variables that have missing value
3: Ximp quick and rough imputation
4: Set diff = 1
5: while diff do
6: Xold.imp Ximp
7: Randomly separate the p0 variables into K = K (↵) groups of approximately
the same size (if ↵ = 1, K = p0 )
8: for i = 1, ..., K do
9: Let Xi be the columns of X corresponding to group i, X( i) the columns
of X excluding group i
10: Set Ximp the values in Xi which were missing back to NA
11: Fit multivariate random forest using variables in groups i as response vari-
ables, and the rest as predicting variables. Note: ONLY the non-missing
values of Xi will be used in calculating the composite split rule
12: Ximp Get the final summary imputed value using the terminal average
for continuous variables and using the maximal terminal node class rule
for categorical variables
13: end for
14: Set diff = E (Xold.imp , Ximp )
15: end while
16: Return the imputed matrix Ximp
Chapter 5

Imputation Performance

Performance of RF algorithms, including proximity imputation, on the fly algo-


rithms,unsupervised imputation, missForest and multivariate missForest unsuper-
vised was assessed using a large scale experiment of 60 data sets under different
missing data mechanisms and levels of missingness of data. RF algorithms were
found to be robust overall, performing well in the presence of moderate to high
missingnes; sometimes even with missing not at random data. Performance was
especially good for correlated data. In such cases, the most accurate algorithm
was missForest; however multivariate analogs of the algorithm performed nearly
as well while reducing computational time by a factor of up to 10. In low to medium
correlation settings, less computationally intensive RF algorithms were found to be
as effective, some reducing compute times by another factor of up to 10.

5.1 Methods

5.1.1 Experimental design and data

Table 5.1 lists the nine experiments that were carried out. In each experiment, a
pre-specified target percentage of missing values was induced using one of three
different missing mechanisms (Rubin, 1976):

42
43

1. Missing completely at random (MCAR). This means the probability that an


observation is missing does not depend on the observed value or the missing
ones.

2. Missing at random (MAR). This means that the probability of missingness


may depend upon observed values, but not missing ones.

3. Not missing at random (NMAR). This means the probability of missingness


depends on both observed and missing values.

Table 5.1: Experimental design used for large scale study of RF missing data algo-
rithms.
Missing Percent
Mechanism Missing
EXPT-A MCAR 25
EXPT-B MCAR 50
EXPT-C MCAR 75
EXPT-D MAR 25
EXPT-E MAR 50
EXPT-F MAR 75
EXPT-G NMAR 25
EXPT-H NMAR 50
EXPT-I NMAR 75

Sixty data sets were used, including both real and synthetic data. Figure 5.1 il-
lustrates the diversity of the data. Displayed are data sets in terms of correlation (⇢),
sample size (n), number of variables (p), and the amount of information contained
in the data (I = log10 (n/p)). The correlation, ⇢, was defined as the L2 -norm of the
correlation matrix. If R = (⇢i,j ) denotes the p ⇥ p correlation matrix for a data set,
⇢ was defined to equal

✓ ◆ p
!1/2
p
1 X X
⇢= |⇢i,j |2 . (5.1)
2 j=1 k<j
44

This is similar to the usual definition of the L2 -norm for a matrix, but where we
have modifed the definition to remove the diagonal elements of R which equal 1,
as well as the contribution from the symmetric lower diagonal values.
Note that in the plot for p there are 10 data sets with p in the thousands—
these are a collection of well known gene expression data sets. The right-most plot
displays the log-information of a data set, I = log10 (n/p). The range of values on
the log-scale vary from 2 to 2; thus the information contained in a data set can
differ by as much as ⇡ 104 .

● ● ● ●
●●

● ●●●●
8000

● ● ●

2

4000


0.8


● ● ●●
● ●●
●●● ●●
● ●

●●

1
● ●●
6000

●●
●● ● ●
● ●
3000

● ●
0.6

● ●● ● ●

I = log10(n / p)
correlation

● ●
● ●●

●● ● ●

0
n

p
4000
2000

● ● ●●
0.4

● ●
● ● ●

● ●

−1
● ●●
● ● ●●● ●
2000
1000

● ● ● ● ●
● ●
0.2

● ● ●● ● ● ● ●
● ●
●● ● ●● ●
● ●●
●● ●
● ●● ● ● ●●
●● ● ●●
● ●
● ●● ● ●
−2


● ● ● ● ● ●
● ●●
● ●
●●
● ●●●● ●●●●●● ● ●●
● ●
● ●●●
●●
●●●● ●●● ●●
● ●

●● ●● ●
● ●●●

● ●●
●● ●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

● ●
0
0.0

0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60
Data Sets Data Sets Data Sets Data Sets

Figure 5.1: Summary values for the 60 data sets used in the large scale RF missing
data experiment. The last panel displays the log-information, I = log10 (n/p), for
each data set.

5.1.2 Inducing MCAR, MAR and NMAR missingness

The following procedures were used to induce missigness in the data. Let the target
missigness fraction be 0 < g < 1. For MCAR, data was set to missing randomly
without imposing column or row constraints to the data matrix. Specifically, the
45

data matrix was made into a long vector and ng of the entries selected at random
and set to missing.
For MAR, missing values were assigned by column. Let Xj = (X1,j , . . . , Xn,j )
be the n-dimensional vector containing the original values of the jth variable, 1 
j  p. Each coordinate of Xj was made missing according to the tail behavior
of a randomly selected covariate Xk , where k 6= j. The probability of selecting
coordinate Xi,j was

8
>
< F (Xi,k ) if Bj = 1
P {selecting Xi,j |Bj } /
>
: 1 F (Xi,k ) if Bj = 0

where F (x) = (1 + exp( 3x)) 1


and Bj were i.i.d. symmetric 0-1 Bernoulli ran-
dom variables. With this method, about half of the variables will have higher mis-
signess in those coordinates corresponding to the right tail of a randomly selected
variable (the other half will have higher missigness depending on the left tail of
a randomly selected variable). A total of ng coordinates were selected from Xj
and set to missing. This induces MAR, as missing values for Xj depend only on
observed values of another variable Xk .
For NMAR, each coordinate of Xj was made missing according to its own tail
behavior. A total of ng values were selected according to
8
>
< F (Xi,j ) with probability 1/2
P {selecting Xi,j } /
>
: 1 F (Xi,j ) with probability 1/2.

Notice that missingness in Xi,j depends on both observed and missing values. In
particular, missing values occur with higher probability in the right and left tails of
the empirical distribution. Therefore, this induces NMAR.
46

5.1.3 Measuring imputation accuracy

Accuracy of imputation was assessed using the following metric. As described


above, values of Xj were made missing under various missing data assumptions.
Let (11,j , . . . , 1n,j ) be a vector of zeroes and ones indicating which values of Xj
were artificially made missing. Define 1i,j = 1 if Xi,j is artificially missing; oth-
P
erwise 1i,j = 0. Let nj = ni=1 1i,j be the number of artificially induced missing
values for Xj .
Let N and C be the set of nominal (continuous) and categorical (factor) vari-
ables with more than one artificially induced missing value. That is,

N = {j : Xj is nominal and nj > 1}

C = {j : Xj is categorical and nj > 1}.

Standardized root-mean-squared error (RMSE) was used to assess performance


for nominal variables, and misclassification error for factors. Let X⇤j be the n-
dimensional vector of imputed values for Xj using procedure I . Imputation error
for I was measured using

v
uX
u n 2
u ⇤
1i,j Xi,j Xi,j /nj
X u
1 u i=1
E (I ) = u n
#N j2N u
t
X 2
1i,j Xi,j Xj /nj
i=1
 Pn
1 X i=1

1i,j 1{Xi,j 6= Xi,j }
+ ,
#C j2C nj

Pn
where Xj = i=1 (1i,j Xi,j ) /nj . To be clear regarding the standardized RMSE,
observe that the denominator in the first term is the variance of Xj over the arti-
47

ficially induced missing values, while the numerator is the MSE difference of Xj
and X⇤j over the induced missing values.
As a benchmark for assessing imputation accuracy we used strawman imputa-
tion described earlier, which we denote by s. Imputation error for a procedure I
was compared to s using relative imputation error defined as

E (I )
ER (I ) = 100 ⇥ .
E (s)

A value of less than 100 indicates a procedure I performing better than the straw-
man.

5.1.4 Experimental settings for procedures

Randomized splitting was invoked with an nsplit value of 10. For random fea-
p
ture selection, mtry was set to p. For random outcome selection for RFunsv , we
p
set ytry to equal p. Algorithms RFotf , RFunsv and RFprx were iterated 5 times in
addition to be run for a single iteration. For mForest, the percentage of variables
used as responses was a = .05, .25. This implies that mRF0.05 used up to 20 regres-
sions per cycle, while mRF0.25 used 4. Forests for all procedures were grown using
a nodesize value of 1. Number of trees was set at ntree = 500. Each experi-
mental setting (Table 5.1) was run 100 times independently and results averaged.
For comparison, k-nearest neighbors imputation (hereafter denoted as KNN)
was applied using the impute.knn function from the R-package impute (Hastie
et al., 2015). For each data point with missing values, the algorithm determines the
k-nearest neighbors using a Euclidean metric, confined to the columns for which
that data point is not missing. The missing elements for the data point are then
imputed by averaging the non-missing elements of its neighbors. The number of
48

neighbors k was set at the default value k = 10. In experimentation we found


the method robust to the value of k and therefore opted to use the default setting.
Much more important were the parameters rowmax and colmax which control
the maximum percent missing data allowed in each row and column of the data
matrix before a rough overall mean is used to impute the row/column. The default
values of 0.5 and 0.8, respectively, were too low and led to poor performance in the
heavy missing data experiments. Therefore, these values were set to their maximum
of 1.0, which greatly improved performance. Our rationale for selecting KNN as
a comparison procedure is due to its speed because of the large scale nature of
experiments (total of 100 ⇥ 60 ⇥ 9 = 54, 000 runs for each method). Another
reason was because of its close relationship to forests. This is because RF is also
a type of nearest neighbor procedure—although it is an adaptive nearest neighbor.
We comment later on how adaptivity may give RF a distinct advantage over KNN.

5.2 Results

Section 4.1 presents the results of the performance of a procedure as measured by


relative imputation accuracy, ER (I ), and in Section 4.2 we discuss computational
speed.

5.2.1 Imputation Accuracy

In reporting the values for imputation accuracy, we have stratified data sets into low,
medium and high-correlation groups, where correlation, ⇢, was defined as in (5.1).
Low, medium and high-correlation groups were defined as groups whose ⇢ value
fell into the [0, 50], [50, 75] and [75, 100] percentile for correlations. Results were
stratified by ⇢ because we found it played a very heavy role in imputation per-
49

formance and was much more informative than other quantities measuring infor-
mation about a data set. Consider for example the log-information for a data set,
I = log10 (n/p), which reports the information of a data set by adjusting its sample
size by the number of features. While this is a reasonable measure, Figure 5.2 shows
that I is not nearly as effective as ⇢ in predicting imputation accuracy. The figure
displays the ANOVA effect sizes for ⇢ and I from a linear regression in which log
relative imputation error was used as the response. In addition to ⇢ and I, depen-
dent variables in the regression also included the type of RF procedure. The effect
size was defined as the estimated coefficients for the standardized values of ⇢ and
I. The two dependent variables ⇢ and I were standardized to have a mean of zero
and variance of one which makes it possible to directly compare their estimated
coefficients. The figure shows that both values are important for understanding im-
putation accuracy and that both exhibit the same pattern. Within a specific type of
missing data mechanism, say MCAR, importance of each variable decreases with
missingness of data (MCAR 0.25, MCAR 0.5, MCAR 0.75). However, while the
pattern of the two measures is similar, the effect size of ⇢ is generally much larger
than I. The only exceptions being the MAR 0.75 and NMAR 0.75 experiments, but
these two experiments are the least interesting. As will be discussed below, nearly
all methods performed poorly here.

Correlation

The imputation accuracy was summarized in Figure 5.3 and Table 5.2. Figure 5.4
presents the same results as Figure 5.3, with more compacted way for displaying.
Figure 5.3 and Table 5.2, which have been stratified by correlation group, show
the importance of correlation for RF imputation procedures. In general, imputa-
tion accuracy generally improves with correlation. Over the high correlation data,
50

0.3
ANOVA Effect Size
0.2
0.1
0.0

MCAR .25

MCAR .50

MCAR .75

NMAR .25

NMAR .50

NMAR .75

MCAR .25

MCAR .50

MCAR .75

NMAR .25

NMAR .50

NMAR .75
MAR .25

MAR .50

MAR .75

MAR .25

MAR .50

MAR .75

log−information correlation

Figure 5.2: ANOVA effect size for the log-information, I = log10 (n/p), and corre-
lation, ⇢ (defined as in (5.1)), from a linear regression using log relative imputation
error, log10 (ER (I )), as the response. In addition to I and ⇢, dependent variables
in the regression included type of RF procedure used. ANOVA effect sizes are the
estimated coefficients of the standardized variable (standardized to have mean zero
and variance 1).
Relative Imputation Error Relative Imputation Error Relative Imputation Error

50
60
70
80
90
100
110
50
60
70
80
90
100
110
50
60
70
80
90
100
110
RFotf RFotf RFotf
RFotf.5 RFotf.5 RFotf.5
RFotfR RFotfR RFotfR
RFotfR.5 RFotfR.5 RFotfR.5
RFunsv RFunsv RFunsv
RFunsv.5 RFunsv.5 RFunsv.5
RFprx RFprx RFprx
RFprx.5 RFprx.5 RFprx.5
RFprxR RFprxR RFprxR

MAR .25

NMAR .25
MCAR .25

RFprxR.5 RFprxR.5 RFprxR.5


mRF0.25 mRF0.25 mRF0.25
mRF0.05 mRF0.05 mRF0.05
mRF mRF mRF
KNN KNN KNN

Relative Imputation Error Relative Imputation Error Relative Imputation Error

50
60
70
80
90
100
110
50
60
70
80
90
100
110
50
60
70
80
90
100
110

RFotf RFotf RFotf


RFotf.5 RFotf.5 RFotf.5
RFotfR RFotfR RFotfR
RFotfR.5 RFotfR.5 RFotfR.5
RFunsv RFunsv RFunsv
RFunsv.5 RFunsv.5 RFunsv.5
RFprx RFprx RFprx
RFprx.5 RFprx.5 RFprx.5
RFprxR RFprxR RFprxR
MAR .50

NMAR .50
MCAR .50

sponse); KNN (k-nearest neighbor imputation).


RFprxR.5 RFprxR.5 RFprxR.5
mRF0.25 mRF0.25 mRF0.25
mRF0.05 mRF0.05 mRF0.05
mRF mRF mRF
KNN KNN KNN

Relative Imputation Error Relative Imputation Error Relative Imputation Error

50
60
70
80
90
100
110
50
60
70
80
90
100
110
50
60
70
80
90
100
110

RFotf RFotf RFotf


RFotf.5 RFotf.5 RFotf.5
RFotfR RFotfR RFotfR
RFotfR.5 RFotfR.5 RFotfR.5
RFunsv RFunsv RFunsv
RFunsv.5 RFunsv.5 RFunsv.5
RFprx RFprx RFprx
RFprx.5 RFprx.5 RFprx.5

low correlation
low correlation
low correlation

high correlation
high correlation
high correlation

RFprxR RFprxR RFprxR


MAR .75

NMAR .75
MCAR .75

RFprxR.5 RFprxR.5 RFprxR.5

medium correlation
medium correlation
medium correlation

mRF0.25 mRF0.25 mRF0.25


mRF0.05 mRF0.05 mRF0.05
mRF mRF mRF
KNN KNN KNN

mRF0.05 , mRF (mForest imputation, with 25%, 5% and 1 variable(s) used as the re-
RFprxR , RFprxR.5 (same as RFprx and RFprx.5 but using pure random splitting); mRF0.25 ,
1 and 5 iterations); RFprx , RFprx.5 (proximity imputation with 1 and 5 iterations);
pure random splitting); RFunsv , RFunsv.5 (multivariate unsupervised splitting with
tion with 1 and 5 iterations); RFotfR , RFotfR.5 (similar to RFotf and RFotf.5 but using
of correlation of a data set. Procedures are: RFotf , RFotf.5 (on the fly imputa-
Figure 5.3: Relative imputation error, ER (I ), stratified and averaged by level
51
52

low correlation medium correlation high correlation

110 110 110

● ●
100 ● 100 100
● ●
●●●●● ●●● ●● ●
Relative Imputation Error

Relative Imputation Error

Relative Imputation Error


● ● ●
● ● ●
● ● ●
90 ● ●●
● ●●● ● 90 90
●● ●
● ●
●●●● ● ●●● ●

●●
● ● ● ●
●●●●● ●

●●
80 80 ● ●● ● ● 80
● ●
● ●● ●
● ●●● ●
●● ●
●● ●●
70 70 ● 70

●● ●
● ●
● ●
●●
60 60 60 ● ● ● ●
●●●
● ● ●
● MCAR .25 ● MAR .25 ● NMAR .25 ●
●●
MCAR .50 MAR .50 NMAR .50
MCAR .75 MAR .75 NMAR .75 ●●●
50 50 50
RFo.1
RFo.5
RFor.1
RFor.5
RFu.1
RFu.5
RFp.1
RFp.5
RFpr.1
RFpr.5
mRF0.25
mRF0.05
mRF
KNN

RFo.1
RFo.5
RFor.1
RFor.5
RFu.1
RFu.5
RFp.1
RFp.5
RFpr.1
RFpr.5
mRF0.25
mRF0.05
mRF
KNN

RFo.1
RFo.5
RFor.1
RFor.5
RFu.1
RFu.5
RFp.1
RFp.5
RFpr.1
RFpr.5
mRF0.25
mRF0.05
mRF
KNN
low correlation medium correlation high correlation

110 110 110



● ● ●
● ● ● ●

● ●
100 ● 100 ● 100 ●

● ● ●
● ●

● ● ● ●
● ● ●
● ●
● ● ● ● ●
Relative Imputation Error

Relative Imputation Error

Relative Imputation Error


● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
90 ● ● ● ●
90 ● ● 90
● ● ● ●


● ● ●
● ● ●
● ●
● ● ● ● ●
● ● ● ●
80 80 ● 80
● ●
● ● ●

● ●

● ● ●
● ● ● ● ●
● ●
● ● ●
70 70 70
RFo.1 ● RFp.5 ●


● ●

● RFo.5 RFpr.1 ● ●
RFor.1 ● RFpr.5 ● ●
60 ● RFor.5 mRF0.25 60 60 ● ● ● ●
● ● ●
● ● ●
RFu.1 ● mRF0.05
● RFu.5 mRF ● ● ●

RFp.1 KNN ●
50 50 50
.25
.50
.75
.25
.50
.75
.25
.50
.75

.25
.50
.75
.25
.50
.75
.25
.50
.75

.25
.50
.75
.25
.50
.75
.25
.50
.75

MCAR MAR NMAR MCAR MAR NMAR MCAR MAR NMAR

Figure 5.4: Relative imputation error for a procedure over 60 data sets averaged
over 100 runs, stratified and averaged by level of correlation of a data set. Top row
displays values by procedure; bottom row displays values in terms of the 9 different
experiments (Table 1). Procedures were: RFo.1, RFo.5 (on the fly imputation with
1 and 5 iterations); RFor.1, RFor.5 (similar to RFo.1 and RFo.5, but using pure
random splitting); RFu.1, RFu.5 (multivariate unsupervised splitting with 1 and 5
iterations); RFp.1, RFp.5 (proximity imputation with 1 and 5 iterations); RFpr.1,
RFpr.5 (same as RFp.1 and RFp.5, but using pure random splitting); mRF0.25,
mRF0.05, mRF (mForest imputation, with 25%, 5% and 1 variable(s) used as the
response); KNN (k-nearest neighbor imputation). See Section 3.4 for more details
regarding parameter settings of procedures.
53

Table 5.2: Relative imputation error ER (I ).


Low Correlation
MCAR MAR NMAR
.25 .50 .75 .25 .50 .75 .25 .50 .75
RFotf 89.0 93.9 96.2 89.5 94.5 97.2 96.5 97.2 100.9
RFotf.5 88.7 91.0 95.9 89.5 88.6 93.5 96.0 92.6 98.8
RFotfR 89.9 94.1 96.8 89.8 94.7 97.8 96.6 97.6 101.7
RFotfR.5 92.3 95.8 95.8 96.5 93.7 94.2 103.2 97.0 102.9
RFunsv 88.3 92.8 96.2 87.9 93.0 97.3 95.4 97.4 101.6
RFunsv.5 85.4 90.1 94.7 85.7 88.6 92.2 97.7 93.0 100.8
RFprx 91.1 92.8 96.7 89.9 88.5 90.4 91.5 92.7 99.2
RFprx.5 90.6 93.9 101.8 89.8 88.7 93.9 95.7 91.1 99.3
RFprxR 90.2 92.6 96.2 89.4 88.8 90.7 94.6 97.5 100.5
RFprxR.5 86.9 92.4 100.0 88.1 88.8 94.8 96.2 94.3 102.8
mRF0.25 87.4 92.9 103.1 88.8 89.3 99.8 96.9 92.3 98.7
mRF0.05 86.1 94.3 105.3 86.0 88.7 102.7 96.8 92.6 99.0
mRF 86.3 94.7 105.6 84.4 88.6 103.3 96.7 92.5 98.8
KNN 91.1 97.4 111.5 94.4 100.9 106.1 100.9 100.0 101.7

Medium Correlation
MCAR MAR NMAR
.25 .50 .75 .25 .50 .75 .25 .50 .75
RFotf 82.3 89.9 95.6 78.8 88.6 97.0 92.7 92.6 102.2
RFotf.5 76.2 82.1 90.0 83.4 79.1 93.4 99.6 89.1 100.8
RFotfR 83.1 91.4 96.0 80.3 90.3 97.4 92.2 96.1 105.3
RFotfR.5 82.4 84.1 93.1 88.2 84.2 95.1 112.0 97.1 104.5
RFunsv 80.4 88.4 95.9 76.1 87.7 97.5 87.3 92.7 104.7
RFunsv.5 73.2 78.9 89.3 78.8 79.0 92.4 98.8 92.8 104.2
RFprx 82.6 86.3 93.1 80.7 80.5 97.7 88.6 93.8 99.5
RFprx.5 77.1 84.1 93.3 86.5 77.0 92.1 98.1 93.7 101.0
RFprxR 81.2 85.4 93.1 80.4 82.4 96.3 89.2 97.2 101.3
RFprxR.5 76.1 80.8 92.0 82.1 77.7 95.1 102.1 96.6 105.1
mRF0.25 73.8 80.2 91.6 75.3 75.6 90.2 97.6 87.5 102.1
mRF0.05 70.9 80.1 95.2 70.1 76.6 93.0 87.4 87.9 103.4
mRF 69.6 80.1 95.0 71.3 74.6 92.4 86.9 87.8 103.1
KNN 79.8 93.5 105.3 80.2 96.0 98.7 93.9 98.3 102.1

High Correlation
MCAR MAR NMAR
.25 .50 .75 .25 .50 .75 .25 .50 .75
RFotf 72.3 83.7 94.6 65.5 83.3 98.4 66.5 84.8 100.4
RFotf.5 70.9 72.1 80.9 69.5 70.9 91.0 70.1 70.8 97.3
RFotfR 68.6 81.0 93.6 59.5 87.1 98.9 61.2 88.2 100.3
RFotfR.5 58.4 58.9 64.6 56.7 55.1 88.4 58.4 60.9 97.3
RFunsv 62.1 75.1 91.3 56.8 70.8 97.8 58.1 73.3 100.6
RFunsv.5 54.2 57.5 65.4 54.0 49.4 80.0 55.4 51.7 90.7
RFprx 75.5 82.0 88.5 70.7 72.8 94.3 70.9 74.3 102.0
RFprx.5 70.4 72.0 78.6 69.7 71.2 90.3 70.0 72.2 98.2
RFprxR 61.9 68.1 76.6 58.7 64.1 79.5 60.4 74.6 97.5
RFprxR.5 57.3 58.1 61.9 55.9 54.1 71.9 57.8 60.2 93.7
mRF0.25 57.0 57.9 63.3 55.5 50.4 70.5 56.7 50.7 87.3
mRF0.05 50.7 54.0 61.7 48.3 48.4 74.9 49.9 48.6 85.9
mRF 48.2 49.8 61.3 47.0 47.5 70.2 46.6 47.6 82.9
KNN 52.7 63.2 83.2 52.0 71.1 96.4 53.2 74.9 99.2
54

mForest algorithms were by far the best. In some cases, they achieved a relative im-
putation error of 50, which means their imputation error was half of the strawman’s
value. Generally there are no noticeable differences between mRF (missForest) and
mRF0.05 . Performance of mRF0.25 , which uses only 4 regressions per cycle (as op-
posed to p for mRF), is also very good. Other algorithms that performed well in
high correlation settings were RFprxR.5 (proximity imputation with random splitting,
iterated 5 times) and RFunsv.5 (unsupervised multivariate splitting, iterated 5 times).
Of these, RFunsv.5 tended to perform slightly better in the medium and low correla-
tion settings. We also note that while mForest also performed well over medium
correlation settings, performance was not superior to other RF procedures in low
correlation settings, and sometimes was worse than procedures like RFunsv.5 . Re-
garding the comparison procedure KNN, while its performance also improved with
increasing correlation, performance in the medium and low correlation settings was
generally much worse than RF methods.

Missing data mechanism

The missing data mechanism also plays an important role in accuracy of RF pro-
cedures. Accuracy decreased systematically when going from MCAR to MAR and
NMAR. Except for heavy missingness (75%), all RF procedures under MCAR and
MAR were more accurate than strawman imputation. Performance in NMAR was
generally poor unless correlation was high.

Heavy missingness

Accuracy degraded with increasing missingness. This was especially true when
missingness was high (75%). For NMAR data with heavy missingness, procedures
were not much better than strawman (and sometimes worse), regardless of corre-
55

lation. However, even with missingness of up to 50%, if correlation was high, RF


procedures could still reduce the strawman’s error by one-half.

Iterating RF algorithms

Iterating generally improved accuracy for RF algorithms, except in the case of


NMAR data, where in low and medium correlation settings, performance some-
times degraded.

5.2.2 Computational speed

Figure 5.5 displays the log of total elapsed time of a procedure averaged over all
experimental conditions and runs, with results ordered by the log-computational
complexity of a data set, c = log10 (np). The same results are also displayed in
Figure 5.7 in a compacted manner. The fastest algorithm is KNN which is generally
3 times faster on the log-scale, or 1000 times faster, than the slowest algorithm,
mRF (missForest). To improve clarity of these differences, Figure 5.6 displays the
relative computational time of procedure relative to KNN (obtained by subtracting
the KNN log-time from each procedures log-time). This new figure shows that
while mRF is 1000 times slower than KNN, the multivariate mForest algorithms,
mRF0.05 and mRF0.25 , improve speeds by about a factor of 10. After this, the next
slowest procedures are the iterated algorithms. Following this are the non-iterated
algorithms. Some of these latter algorithms, such as RFotf , are 100 times faster than
missForest; or only 10 times slower than KNN. These kinds of differences can have
a real effect when dealing with big data. We have experienced settings where OTF
algorithms can take hours to run. This means that the same data would take
56

missForest 100’s of hours to run, which makes it questionable to be used in such


settings.

5.3 Simulations

In this section we used simulations to study the performance of RF as the sample


size n was varied. We wanted to verify two questions: (1) Does the relative impu-
tation error improve with sample size? (2) Do these values converge to the same or
different values for the different RF imputation algorithms?
For our simulations, there were 10 variables X1 , . . . , X10 where the true model
was
Y = X1 + X2 + X3 + X4 + "

where " was simulated independently from a N(0, 0.5) distribution. Variables X1
and X2 were correlated with a correlation coefficient of 0.96, and X5 and X6 were
correlated with value 0.96. The remaining variables were not correlated. Vari-
ables X1 , X2 , X5 , X6 were N(3, 3), variables X3 , X10 were N(1, 1), variable X8
was N(3, 4), and variables X4 , X7 , X9 were exponentially distributed with mean
0.5.
The sample size (n) was chosen to be 100, 200, 500, 1000, and 2000. Data was
made missing using the MCAR, MAR, and NMAR missing data procedures de-
scribed earlier. Percentage of missing data was set at 25%. All imputation param-
eters were set to the same values as used in our previous experiments as described
in Section 3.4. Each experiment was repeated 500 times and the relative imputa-
tion error, ER (I ), recorded in each instance. Figure 5.8 displays the mean relative
imputation error for a RF procedure and its standard deviation for each sample size
57

RFotf RFotf.5 RFotfR


log10(Computing Time)

log10(Computing Time)

log10(Computing Time)
4

4
o
o o
2

2
o
o o o ooo
o o oo ooo o
o o oo oo
oo oooo o
oo oo o o o oo
o oo o oo ooo ooo o
oo oooo
oo o o oo o ooo o o
oooooo o o oo
o oo oo
o ooo oo
ooo
ooo o o
0

0
o oooo oo ooo o o ooo o
o oo
ooo o ooo o
−2

−2

−2
3 4 5 6 3 4 5 6 3 4 5 6
c = log10(n p) c = log10(n p) c = log10(n p)

RFotfR.5 RFunsv RFunsv.5


log10(Computing Time)

log10(Computing Time)

log10(Computing Time)
4

4
o
oo
oo o o
o oo o o
oo
o ooo o o
oo o
ooo o oo
2

2
oooo o o ooo oo
oo o oo o o o oo
oo oooo oo oo oooo oooo oooo o o
oooooo o o
o oo o o oo o oo
o oo ooo o o ooooooooo o o o ooo o
o oo
ooooooo o o
oo o
oo
0

0
ooo o oo oo ooo
o oo
o
ooo o
−2

−2

−2
3 4 5 6 3 4 5 6 3 4 5 6
c = log10(n p) c = log10(n p) c = log10(n p)

RFprx RFprx.5 RFprxR


log10(Computing Time)

log10(Computing Time)

log10(Computing Time)
4

4
o o o o
oo o o o
o o o o o o
2

2
o o oo
o o o oo oo o o o
oooo oo
o o o ooo oo oo o o
o oo oo o o oo ooo
oooo o o oo ooo o oo
o ooo o
oo o ooo o o o ooo oooo o o
o oo o o
o ooo o o oo
0

0
o oo o oo o ooo o
o oo
ooo o ooo o
−2

−2

−2

3 4 5 6 3 4 5 6 3 4 5 6
c = log10(n p) c = log10(n p) c = log10(n p)

RFprxR.5 mRF0.25 mRF0.05


log10(Computing Time)

log10(Computing Time)

log10(Computing Time)
4

o o o
oo o o o o o
ooo o
ooo
o o o
o
oo oo oo
oo
oo o ooo o oo oo o o
o o o o o
o oo o o ooo ooo
2

oo o o o ooo o oo ooo
o oo oooo o o ooooooo o oo o oo
o oo
o
o oo oooooooo o ooo oo o o oo o oo
o o oo oo o
o o oo o oo
o
o oo o ooo o oo
ooo o
0

0
−2

−2

−2

3 4 5 6 3 4 5 6 3 4 5 6
c = log10(n p) c = log10(n p) c = log10(n p)

mRF KNN
log10(Computing Time)

log10(Computing Time)
4

o
4

o o oo
o o o
o ooo o ooo
oo o
o o o
oo o
2

oo o o
o oo ooo
oo
o o oo
o oo oo o oo o o
oo o ooo o o o
ooo o
0

o
o oo
o oo ooo
ooo o o
o oo oo ooo
oo
o oo o o
−2

−2

o
o o

3 4 5 6 3 4 5 6
c = log10(n p) c = log10(n p)

Figure 5.5: Log of computing time for a procedure versus log-computational com-
plexity of a data set, c = log10 (np).
58

RFotf RFotf.5 RFotfR


Relative Computing Time

Relative Computing Time

Relative Computing Time


4.0

4.0

4.0
3.0

3.0

3.0
o oo o
o o o
o o oo
2.0

2.0

2.0
o o o o
o o o oo o
o o o
o o o o o oo o o oo o o
o
o o o o oo o o o o
o o o
o o o o ooo oo o o o o
oo
o o o o o
o o
oo o ooo o
o
o oo
o ooo oo
o o oo o o o o o oo
o o o o o o o o
1.0

1.0

1.0
o o o o oo o o
o o o ooo
oo
oo
o

3 4 5 6 3 4 5 6 3 4 5 6
c = log10(n p) c = log10(n p) c = log10(n p)

RFotfR.5 RFunsv RFunsv.5


Relative Computing Time

Relative Computing Time

Relative Computing Time


4.0

4.0

4.0
3.0

3.0

3.0
o oo
o o oo
o
o o o oo o oo oo o o o
o o o o o
o o o o o oo o o o oo o ooo o oo o ooo o
o o o oo o o oo
2.0

o
2.0

2.0
oo
o oooo oo o oo o o o o o o
o o o oo
o o oo o
o o
o o o o o oo o o
o o ooo ooo o o o o o
o o o o oo o
o o o o o oo o o o ooo o oo o ooo o
o o o o o oo
o
1.0

1.0

1.0
3 4 5 6 3 4 5 6 3 4 5 6
c = log10(n p) c = log10(n p) c = log10(n p)

RFprx RFprx.5 RFprxR


Relative Computing Time

Relative Computing Time

Relative Computing Time


4.0

4.0

4.0
3.0

3.0

3.0
o oo
o o ooooo o o
o o oo oo o oo oooo o o
2.0

2.0

o o o o oo o o o
oo o
2.0 o
o o
o o o o o oo o
o
o o o ooooo o
o oo o o o
o oo o
o ooooo o ooo oooo
oo
oo
o o
oo o o
o
ooo ooo oo
o o o oo o o
o o o o oo o o o o oo o
1.0

1.0

1.0

3 4 5 6 3 4 5 6 3 4 5 6
c = log10(n p) c = log10(n p) c = log10(n p)

RFprxR.5 mRF0.25 mRF0.05


Relative Computing Time

Relative Computing Time

Relative Computing Time


4.0

4.0

4.0

oo o o o oo
oo o o o
3.0

3.0

3.0

o oo o o o ooo
o o
o oo oo o ooo o o oo o
o o
o o oo o oooo o o
o o oo
o
o o o o
o o o o o o
o o o o o oo o
o o
o o o o
o o ooo o o oo o
o o o ooo o o o o o oo o
o ooo ooooo o ooo oooo
o o oo o o ooo
o o
o oo o
2.0

2.0

2.0

o o o o o o oo o
oo
1.0

1.0

1.0

3 4 5 6 3 4 5 6 3 4 5 6
c = log10(n p) c = log10(n p) c = log10(n p)

mRF
Relative Computing Time
4.0

oo
o
o oo o
o
oo o
oo
3.0

o
o o oo o o o o
ooo o o o o o
ooo o o o
o
o o oo
o
o
o o o o
o o oo
o
2.0
1.0

3 4 5 6
c = log10(n p)

Figure 5.6: Relative log-computing time (relative to KNN) versus log-


computational complexity of a data set, c = log10 (np).
59

4
RFo.1 ● RFp.5
● RFo.5 RFpr.1
● ●
RFor.1 ● RFpr.5 ● ●●
● ● ●
●●
● ● ● ●●
3 ● RFor.5 mRF0.25 ●


●●



● ● ●
log10(Computing Time)



●● ● ●
● ●


RFu.1 ● mRF0.05 ● ● ●
●●


● ● ●●

● ● ●

● RFu.5 mRF ●● ● ● ● ● ●●

● ● ● ●
2

●●● ● ●● ●● ●
RFp.1 KNN ● ●● ● ● ●● ●

●● ●●● ●● ● ● ● ●
● ●● ● ● ●
● ● ●● ●●● ●
● ● ●●● ●
●● ● ●●
● ●


●● ●●●

● ● ●●● ●●
●● ●
1

●● ● ●●
● ●●
● ●
● ● ●


● ●●
●●
● ●●● ●

● ● ●
● ●
●●
● ●●
●● ●●●
●● ●
● ● ●
● ●● ●● ●● ● ●

● ●

●●● ●● ●
● ● ● ●●
● ●●●
● ●●
● ● ● ●● ● ●
● ●
0


● ● ●
● ● ●
−1
−2

3 4 5 6
log10(Dimension)

Figure 5.7: Log of computing time for a procedure versus log of dimension of a
data set, with compute times averaged over runs and experimental conditions.

setting. As can be seen, values improve with increasing n. It is also noticeable that
performance depends upon the RF imputation method. In these simulations, the
missForest algorithm mRF0.1 appears to be best (note that p = 10 so mRF0.1 corre-
sponds to the limiting case missForest). Also, it should be noted that performance
of RF procedures decrease systematically as the missing data mechanism becomes
more complex. This mirrors our previous findings.

5.4 Conclusions

Being able to deal with missing data effectively is of great importance to scien-
tists working with real world data today. A machine learning method such as RF
known for its excellent prediction performance and ability to handle all forms of
data, represents a poentially attractive solution to this challenging problem. How-
ever, because no systematic study of RF procedures had been attempted in missing
60

105
MCAR
Imputation Error +/− SD
100

● ●
● ●
● ●
95

● ●
● ● ● ●
● ● ●
● ● ● ●
● ● ●
● ● ● ● ●
● ● ●
●● ● ● ●● ●
●●
n= 100 ●
90

● ● ●
n= 200 ●
n= 500 ● ●

n= 1000 ●

n= 2000
85

RFotf

RFotf.5

RFotfR

RFotfR.5

RFunsv

RFunsv.5

RFprx

RFprx.5

mRF0.5

mRF0.1
MAR
105
Imputation Error +/− SD
100


● ●
● ●

● ● ● ●
● ●
● ● ●
95

● ● ●
● ● ● ●
● ● ●
● ● ●
● ●
● ●
● ●
● ● ● ●
● ● ●●

90

n= 100 ● ●
● ●
n= 200 ●
n= 500 ●

n= 1000
85

n= 2000
RFotf

RFotf.5

RFotfR

RFotfR.5

RFunsv

RFunsv.5

RFprx

RFprx.5

mRF0.5

mRF0.1
NMAR
170
160
Imputation Error +/− SD
150
140

n= 100
130

n= 200
n= 500
n= 1000
120

n= 2000
RFotf

RFotf.5

RFotfR

RFotfR.5

RFunsv

RFunsv.5

RFprx

RFprx.5

mRF0.5

mRF0.1

Figure 5.8: Mean relative imputation error ± standard deviation from simulations
under different sample size values n = 100, 200, 500, 1000, 2000.
61

data settings, we undertook a large scale experimental study of various RF proce-


dures to determine which methods performed best, and under what types of settings.
What we found was that correlation played a very strong role in performance of RF
procedures. Imputation performance of all RF procedures improved with increas-
ing correlation of features. This held even with heavy levels of missing data and
in all but the most complex missing data scenarios. When there is high correlation
we recommend using a method like missForest which performed the best in such
settings. Although it might seem obvious that increasing feature correlation should
improve imputation, we found that in low to medium correlation, RF algorithms did
noticeably better than the popular KNN imputation method. This is interesting be-
cause KNN is related to RF. Both methods are a type of nearest neighbor method,
although RF is more adaptive than KNN and in fact can be more accurately de-
scribed as an adaptive nearest neighbor method. This adaptivity of RF may play a
special role in harnessing correlation in the data that may not necessarily be present
in other methods, even methods that have similarity to RF. Thus, we feel it is worth
emphasizing that correlation is extremely important to RF imputation methods.
In big data settings, computational speed will play a key role. Thus practically
speaking users might not be able to implement the best method possible because
computational times will simply be too long. This is the downside of a method
like missForest, which was the slowest of all the procedures considered. As one
solution, we proposed mForest (mRFa ) which is a omputationally more efficient
implementation of missForest. Our results showed mForest could achieve up to a
10-fold reduction in compute time relative to missForest. We believe these compu-
tational times can be improved further by incorporating mForest directly into the
native C-library of randomForestSRC (RF-SRC). Currently mForest is run as
an external R-loop that makes repeated calls to the impute.rfsrc function in
62

RF-SRC. Incorporating mForest into the native library, combined with the openMP
parallel processing of RF-SRC, could make it much more attractive. However, even
with all of this, we still recommend some of the more basic OTFI algorithms like
unsupervised RF imputation procedures for big data. These algorithms perform
solidly in terms of imputation and are 100’s of times faster than missForest.

5.5 Discussion

MissForest outperforms OTF imputation for the following reason. When there are
two variables(x1, x2, lets say x1 has missing values) that are highly correlated by
not important in prediction Y, x2 won’t get splitted on early. OTF will not be
63

accurate in imputing x1. missForest will be very accurate in imputing x1, unsuper-
viased will be in between. This is where missForest is gaining advantage.
Although the three random forest imputation algorithms(OTF, unsupervised and
missForest ) represents the three different imputation strategy mentioned above,
they essentially all partition the cases in a way that the correlation between two
variables can be taken advantage of. For a variable with missing values, the ear-
lier the correlated covariate gets splitted, the better the imputation accuracy will be.
Since missForest always split its most correlated covariate first, the imputation ac-
curacy is alway as good or superior to OTF imputation. With OTF imputation, the
response variable determines when its most correlated covariate gets split on. Un-
supervised algorithm is in between, as the response variable is selected at random.
Although it was shown that correlation can be a general guide as to the impu-
tation accuracy, the real determine factor for the imputation accuracy on a specific
variable is how well the rest of the covariates can predict its missing values collec-
tively. For instance, if a continuous variable needs to be imputed, the ’error rate’, or
’percent of variation explained’ from the regression tree with this variable being the
outcome variable is the true indicator of the imputation accuracy. In some cases, the
imputation accuracy may be high for a particular variable even when its correlation
with the rest of the covariate is low.
A user always needs to look at the pattern of the missing values before any
imputation algorithm is applied, as some pitfalls may exit. For example, if a variable
that is highly predictive of the response variable has high missingness, and it is
independent of the covariates-meaning it can not be predicted from the covariates,
its imputed values will far from being accurate. As a result, inference made on this
variable with the imputed data set will be biased.
64

In the next chapter, we will look at some of the pitfalls in details and offer some
possible solutions.
Chapter 6

RF Imputation on VIMP and


Minimal Depth

6.1 Introduction

Random Forest, as a machine learning method, has gained popularity in many re-
search fields. It is appreciated for its improvement of prediction accuracy over a
single CART tree. In addition, it is a good method for high dimensional data, espe-
cially when complex relations exit between the predictive variables. Because of its
intrinsic ability to deal with complicated interactions, multiple methods have been
developed to perform variable selection using Random Forest. These methods are
based on two measures: variable importance measure(VIMP), or minimal depth
measure.
Missing values in predictive variables are often encountered in data analysis.
Although some empirical studies have been carried out to compare different impu-
tation methods in terms of imputation accuracy(Stekhoven and Buhlmann, 2012),
or variable selection(Genuer et al , 2010), a direct investigation on how the variable
importance measures and minimal depth measure are affected by different ways of
handling missing values may provide guidance on which method of handling miss-

65
66

ing data problem should be chosen, and shed light on how these different methods
affect the result of variable selection. In this study, we introduced missing values in
a particular variable of the simulated or real-life datasets and looked at how VIMP
and minimal depth were affected by different RF imputation methods.

6.2 Approach

When an investigator intends to analyze a data set containing missing values for
the purpose of variable selection using random forest, one of the three approaches
of handling the missing values can be chosen: 1. to use complete case method,
which means that any observations that contain missing value should be deleted; 2.
to use the built in missing data options in the software, if available; 3. to impute
the data ahead of the analysis. Complete case method may not be a good choice,
especially when the number of predictor that contain missing values is high-a high
percentage of observations would be deleted as a result. The built in methods for
handling missing values are OTF imputation and proximity imputation. We com-
pared the performance of these two methods and some other imputation methods,
namely, complete case method, imputation by proximity, OTF imputation, mul-
tivariate missForest imputation, missForest imputation, and KNN imputation for
correctly estimating of the variable importance and minimal depth. The ideal im-
putation method is the one that the importance scores calculated from the imputed
data are identical to the importance scores or minimal depth that could have been
calculated from the data without missing values. Therefore, we defined a measure
called relative importance score(RelImp) and relative minimal depth score(RelMdp)
in this study, which is calculated as
67

Importance score with imputed data


RelImp =
Importance score with original full data
minimal depth with imputed data
RelM dp =
minimal depth with original full data

The imputation performance is considered to be good if the RelImp or RelMdp is


equal or close to 1. If the RelImp is much greater than one, or if the RelMdp is much
less than 1, it implies that some variable may be falsely selected because the missing
value imputation exaggerates its importance. If RelImp is much smaller than one,
or if RelMdp is much greater than one, it indicates the some importance variable
may be missed in variable selection because of the missing value imputation.

6.2.1 Simulation Experiment Design

Five simulation with synthesized data were carried out to study the effect of differ-
ent imputation methods on the importance measurement and minimal depth. We
started with a simplest scenario.

Simulation 1 : Y = x1 + 0.2 ⇥ x2 + 0.1 ⇥ x3 + 0.1 ⇥ x4

Where x1, x2, x3, e are normal distributions with mean of 2,2,1,0, standard devia-
tion of 2,2,1,0.5, respectively. x4 follows exponential distribution with rate of 0.5.
There are two important variables in the second simulation, and these two im-
portant variables are not correlated and are independent of each other.

Simulation 2 : Y = x1 + x2 + 0.1 ⇥ x3 + 0.1 ⇥ x4

Where x1,x2,x3, x4 are defined as in simulation 1.


68

In the 3rd, 4th,and 5th simulation, x1 and x2 have the same variance and mean
as in simulation 1 and 2, but instead of being independent of each other, they are
correlated with the correlation coefficient being 0.75.

Simulation 3 : Y = x1 + 0.2 ⇥ x2 + 0.1 ⇥ x3 + 0.1 ⇥ x4

Simulation 4 : Y = x1 + x2 + 0.1 ⇥ x3 + 0.1 ⇥ x4

Simulation 5 : Y = x1 + x2 + x3 + 0.5 ⇥ x4

Table 6.1: Summary characteristics for the Simulated models.


Correlation between x1 and x2 Important variables
Simulation 1 0 x1
Simulation 2 0 x1 , x2
Simulation 3 0.75 x1
Simulation 4 0.75 x1 , x2
Simulation 5 0.75 x1 , x2 , x3

Five thousand observations were created for each of the simulated dataset. We
first created q(q=15,20,25,30,35,40,45,50,55,60,65,70,75,80) percent of missing val-
ues in x1 by MCAR missing data mechanism, followed by using either complete
case deletion, or one of the six RF imputation methods to impute the dataset.
Then the dataset was analyzed using regular RF. VIMP and minimal depth for
x1 were recorded; relative importance score(RelImp) and relative minimal depth
score(RelMdp) were calculated and plotted.
For OTF imputation, one iteration was used. For missForest or mForest imputa-
tion, a maximum of five iteration was used if the algorithm did not converge before
the fifth iteration. For all the imputation methods, node size, nsplit, ntree (the num-
69

ber of trees grown for each RF) were chosen to be 5, 20, 250, respectively. The ↵
value was chosen to be 0.1 in the mRF↵ imputation method to be an equivalent of
missForest imputation. As there are only four predictor in the simulation, ↵ was set
to 0.5 in mRF↵ for mForest imputation.

6.3. Simulation results

6.3.1 Complete case analysis

As shown in figure 6.1 to 6.5, the complete case method resulted in the same VIMP
and minimal depth as the original data with not missingness. Although some ef-
ficiency can be lost, complete case method gives the same random forest when
the missing mechanism is MCAR. As the percentage of missingness increases, the
standard deviation of the calculated VIMP increases.

6.3.2 When predictive variables are not correlated

In the case of simulation 1 and 2, When the predictive variables are not correlated,
the VIMP for x1 decreases regardless of the imputation method that was used. As
shown in figure 6.1 and 6.2, the relative importance score(RelImp) for x1 decreases
as the percentage of missingness in x1 increases. The magnitude of the decrease
were identical for all random forest imputation methods.

6.3.3 When the predictive variables are correlated

In simulation 3, 4 and 5, x1 , the variable containing missing values correlates with


x2 . Similar to simulation 1 and 2, The relative importance score(RelImp) for x1
decreases as the percentage of missingness increases. However, as shown in figure
70

6.3, 6.4 and 6.5, the magnitude of the decrease was less in the OTF and mFor-
est imputation methods, followed by unsupervised and mForest imputation. KNN
imputation had the worse performance in terms of preservation of VIMP and min-
imal depth. Generally speaking, better preservations of VIMP and minimal depth
correspond to better imputation accuracy, as shown in figure 6.3, 6.4 and 6.5.

6.3.4 VIMP change of all the variables

We then looked at the VIMP and minimal depth change in all four variable in model
5, when the missingness was in x1 . That is, x1 and x2 are correlated with r of 0.75,
Y = x1 + x2 + x3 + 0.5 ⇤ x4 .
As shown in figure 6.6, when x1 is missing, analysis after imputation will result
in VIMP of x1 to decrease, VIMP of x2 to increase, and the VIMP of x3 and x4
unchanged.
Adaptive method does not behave like this-the only loss is the sample size for
x1 .

6.3.5 Correlation change due to imputation

To better understand the imputation effect, we further looked at the correlation


change when mForest imputation was performed, as mForest has the highest impu-
tation accuracy among all the RF imputation methods. The model from simulation
5 was used.
As shown in figure 6.8, The correlation of x1 and x2 increases, while the correla-
tion of x1 and Y decreases. This explains why the VIMP of x1 decreases. Although
the correlation of x2 and Y stays the same, x2 comes in more to compensate for the
decreases of VIMP in x1 .
71

6.3.6 The effect of nodesize

Noticing the change of correlations between the variables, we suspected that choos-
ing a larger nodesize may be beneficial. We repeated simulation 5 with nodesize set
to be 200, instead of 5. As shown in figure 6.9, the bias for estimation VIMP was
much reduced with adaptive and mForest algorithm. The OTF algorithm performs
especially well, resulting very little bias, even when the percent of missingness was
high.

6.3.7 How coefficient estimation is affected

We further studied how the coefficient estimation in regression is affected when the
missing values are imputed with RF or KNN imputation. We used the simulation
5 described above, that is, Y = x1 + x2 + x3 + 0.5 ⇥ x4 . Missing values of
q (q = 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80) percent were created by
MCAR mechanism. The data missing values were then imputed with RF or KNN
imputation methods and regular GLM were then performed on the imputed dataset
for estimation of 1, 2, 3 and 4. The ratio of the estimated and the true was
recorded and plotted in figure 6.10. As shown in figure 6.10,

6.4. A real life data


We used a real life data on Diabetes to look at how different imputation methods
affected the calculation of VIMP and minimal depth. q=(15, 20, 25, 30, 35, 40, 45,
50, 55, 60, 65, 70, 75, 80, 85) percent of missingness was created in the predictive
variable BMI. The missing values were imputed using the RF imputation methods
or KNN imputation. As shown in figure 6.10, the VIMP for BMI decreased as
the percent of missing increased, and the minimal depth increased as the percent
of missing increased. Among all the imputation methods, the adaptive method
72

had only 10 to 15 percent decrease in the calculation of VIMP. missForest and


mForest imputation resulted in more decrease in the calculation of VIMP compared
to the adaptive method, but much less decrease in VIMP compared to unsupervised,
proximity imputation methods.

6.5. Conclusion

1. Generally speaking, better imputation accuracy results in better preservation


of VIMP or minimal depth. In the simulation examples in this study, OTF
imputation and mForest imputation had better imputation accuracy, as well
as better preservation of VIMP and minimal depth, compared with unsuper-
vised imputation or proximity imputation. All the RF imputation methods
performed better in accuracy or inference than KNN imputation.

2. In the simulation in this study, OTF imputation was as good or even slightly
better at preserve the original VIMP or minimal depth compared to the more
computationally expensive imputation methods such as missForest or mFor-
est.

3. Minimal depth is good especially for important predictors, as it will be close


to the tree root node and we do not need to worry about the cases will assigned
to the wrong daughter node.

4. In some specific settings, mForest imputation may have certain unintended


consequences, as it can alter the correlation among variables. When the per-
cent of missing is low for a variable, the alter of the correlation is not sig-
nificant. When the percent of missing is high for a variable, the alter of the
correlation can be significant. By choosing a larger node size, the change of
73

correlation can be reduced. Therefore, we suggest a large node size should


be chosen in imputation when the percentage of missingness is high.

6.6 Discussion

Although RF imputation algorithms are able to impute missing values with Superior
accuracy, cautious needs to be taken if the purpose of the imputation is for further
inference study. In two scenarios we can foresee understandable pitfalls. scenario
1: the variable containing missing values is not correlated with any other predictors.
Then the imputation is equivalent to random draw from the observed values. There
will be bias for the inference, and the magnitude of the bias depends on the percent
of missingness. scenario 2: the variable containing missing values is only correlated
with one another predictor. Then the imputation will result in correlation change
(and the covariate matrix changes) between variables. Not surprisingly, therefore
there will be bias for inference. This bias can be reduced with larger node size.
When the variable with missing values correlates with more than one predictor,
we did not observed much correlation change between the variables in our simula-
tions. This is because the imputed values do not lie one any particular regression
line. In this case, imputation is beneficial to the analysis.
Since the inference of the analysis can be affected differently depending on the
data structure and the percent of missingness, we suggest the user of the random
forest imputation algorithms to study the structure of the data before the imputation.
We also suggest a relatively large node size to be used in the imputation. Since
large node size will result in computational speed improvement, this is a desirable
characteristic of the random forest imputation algorithms.
74

simulation 1−Importance

1.2
1.0
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

Standardized Importance

0.8



0.6



0.4

● Complete Case ●

● Adaptive

0.2 Unsupervised
● ●
Proximity
mf
missForest
0.0

knn

20 30 40 50 60 70 80

Percent Missing

Simulation 1−Minimal depth


1.4

● Complete Case
● Adaptive
Unsupervised
Proximity
1.3

mf
missForest
knn
Standardized Minimal Depth

1.2
1.1


● ● ● ● ●
● ● ●
● ●
● ●
● ●
1.0

● ● ● ● ● ●

● ● ● ● ●
● ● ●

0.9
0.8

20 30 40 50 60 70 80

Percent Missing

Simulation1−Imputation acuracy

● Adp
Unsupervised
2.2

Proximity
mf
missForest
2.0

knn
Standardized Imputation Error

1.8
1.6
1.4
1.2

● ● ● ● ● ● ● ●
● ● ●

1.0

● ● ●

20 30 40 50 60 70 80

Percent Missing

Figure 6.1: Simulation 1: The effect of different imputations methods on the im-
portance measurement and the minimal depth.
75

simulation 2−Importance

1.2
1.0
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

Standardized Importance

0.8



0.6



0.4

● Complete Case ●
● Adaptive ●

Unsupervised ●
0.2
● ●
Proximity
mf
missForest
0.0

knn

20 30 40 50 60 70 80

Percent Missing

Simulation 2−Minimal depth


2.0

● Complete Case
● Adaptive
Unsupervised
Proximity
1.8

mf
missForest
knn
Standardized Minimal Depth

1.6

● ●

1.4


● ● ● ● ● ● ●
● ● ●

1.2




1.0

● ● ●
● ● ● ● ● ●
● ● ●
0.8

20 30 40 50 60 70 80

Percent Missing

Simulation 2−Imputation acuracy

● Adp
Unsupervised
2.2

Proximity
mf
missForest
2.0

knn
Standardized Imputation Error

1.8
1.6
1.4
1.2

● ● ● ● ● ● ● ● ● ●
● ● ●
● ●
1.0

20 30 40 50 60 70 80

Percent Missing

Figure 6.2: Simulation 2: The effect of different imputations methods on the im-
portance measurement and the minimal depth.
76

simulation 3−Importance

1.2
1.0
● ● ● ● ● ● ●
● ● ● ● ● ●
● ●

Standardized Importance


0.8






0.6

0.4
● Complete Case
● Adaptive
0.2 Unsupervised
Proximity
mf
missForest
0.0

knn

20 30 40 50 60 70 80

Percent Missing

Simulation 3−Minimal depth


2.0

● Complete Case
● Adaptive
Unsupervised
Proximity
1.8

mf
missForest
knn
Standardized Minimal Depth

1.6
1.4
1.2

● ● ● ●
● ● ● ● ●
1.0

● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ●
0.8

20 30 40 50 60 70 80

Percent Missing

Simulation 3−Imputation acuracy

● Adp
Unsupervised
Proximity
mf
2.0

missForest
knn
Standardized Imputation Error

1.5
1.0
0.5

● ● ● ● ● ● ● ● ● ● ● ● ● ●

20 30 40 50 60 70 80

Percent Missing

Figure 6.3: Simulation 3: The effect of different imputations methods on the im-
portance measurement and the minimal depth.
77

simulation 4−Importance

1.2
● ● ● ●

1.0
● ● ● ● ● ● ● ● ●

● ●

Standardized Importance


0.8







0.6

0.4
● Complete Case
● Adaptive
0.2 Unsupervised
Proximity
mf
missForest
0.0

knn

20 30 40 50 60 70 80

Percent Missing

Simulation 4−Minimal depth


2.0

● Complete Case
● Adaptive
Unsupervised
Proximity
1.8

mf
missForest
knn
Standardized Minimal Depth

1.6

● ●


1.4



1.2




● ● ●
● ● ● ● ● ●
1.0

● ● ●
● ● ● ●

0.8

20 30 40 50 60 70 80

Percent Missing

Simulation 4−Imputation acuracy

● Adp
Unsupervised
Proximity
mf
2.0

missForest
knn
Standardized Imputation Error

1.5
1.0
0.5


● ● ● ● ● ● ● ● ● ● ● ● ●

20 30 40 50 60 70 80

Percent Missing

Figure 6.4: Simulation 4: The effect of different imputations methods on the im-
portance measurement and the minimal depth.
78

simulation 5−Importance

1.2
● ● ● ●

1.0
● ● ● ● ● ●
● ● ●


Standardized Importance



● ●

0.8


● ● ●

0.6
0.4
● Complete Case
● Adaptive
0.2 Unsupervised
Proximity
mf
missForest
0.0

knn

20 30 40 50 60 70 80

Percent Missing

Simulation 5−Minimal depth


2.0

● Complete Case
● Adaptive
Unsupervised
Proximity
1.8

mf
missForest
knn
Standardized Minimal Depth

1.6
1.4


1.2








● ● ●
● ●
● ● ● ●
1.0

● ●
● ● ● ● ●
● ● ●

0.8

20 30 40 50 60 70 80

Percent Missing

Simulation 5−Imputation acuracy

● Adp
Unsupervised
Proximity
mf
2.0

missForest
knn
Standardized Imputation Error

1.5
1.0
0.5

● ● ● ● ● ● ● ● ● ● ● ● ● ●

20 30 40 50 60 70 80

Percent Missing

Figure 6.5: Simulation 5: The effect of different imputations methods on the im-
portance measurement and the minimal depth.
79

x1 Importance−Missingness in x1 x2 Importance−missingness in x1
2.0

2.0
● Complete Case
● Adaptive
Unsupervised
Proximity
mf
missForest
knn
1.5

1.5
Standardized Importance

Standardized Importance
● ●
● ●
1.0

1.0
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ●
● ● ● ●

● ● ● ● ● ● ● ●
● ●

● Complete Case
● Adaptive
0.5

0.5
Unsupervised
Proximity
mf
missForest
knn

20 30 40 50 60 20 30 40 50 60

Percent Missing Percent Missing

x3 Importance−Missingness in x1 x4 Importance−Missingness in x1
2.0

2.0

● Adp ● Adp
Unsupervised Unsupervised
Proximity Proximity
mf mf
missForest missForest
knn knn
1.5

1.5
Standardized Importance

Standardized Importance

● ● ●
1.0

1.0

● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
0.5

0.5

20 30 40 50 60 20 30 40 50 60

Percent Missing Percent Missing

Figure 6.6: The effect of different imputation methods on the importance measure-
ment for all four variables.
80

x1 Minimal Depth−Missingness in x1 x2 Minimal Depth−Missingness in x1


2.0

2.0
● Complete Case
● Adaptive
Unsupervised
Proximity
mf
missForest
knn
1.5

1.5
Standardized Minimal Depth

Standardized Minimal Depth


● ● ● ●
● ● ● ●
● ● ● ● ● ●
● ● ●
1.0

1.0
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●

● ● ● ● ● ● ● ●

● Complete Case
● Adaptive
0.5

0.5
Unsupervised
Proximity
mf
missForest
knn

20 30 40 50 60 20 30 40 50 60

Percent Missing Percent Missing

x3 Minimal Depth−Missingness in x1 x4 Minimal Depth−Missingness in x1


2.0

2.0

● Adp ● Adp
Unsupervised Unsupervised
Proximity Proximity
mf mf
missForest missForest
knn knn
1.5

1.5
Standardized Minimal Depth

Standardized Minimal Depth


1.0

1.0

● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
● ●
0.5

0.5

20 30 40 50 60 20 30 40 50 60

Percent Missing Percent Missing

Figure 6.7: The effect of different imputation methods on the minimal Depth mea-
surement for all four variables.
81

Correlation of x1 and x2 change after imputation using missForest Correlation with Y change after imputation using missForest
1.00

0.90
● x1 with Y
x2 with Y
0.95
0.90

0.85
0.85





r

r

0.80



● ●

0.80
● ●

● ●
0.75




● ●


● ●
0.70
0.65

0.75

0 10 20 30 40 50 60 0 10 20 30 40 50 60

Percent Missing Percent Missing

Correlation change with Y after imputation using missForest


0.7

x3 with Y
x4 with Y
0.6
0.5
0.4
r

0.3
0.2
0.1
0.0

0 10 20 30 40 50 60

Percent Missing

Figure 6.8: The correlation change due to missForest imputation.


82

Importance
1.2


1.0

● ● ● ● ● ● ● ● ● ●
● ●
● ● ● ●
● ● ● ● ● ● ● ●
● ●
Standardized Importance

0.8
0.6
0.4

● Complete Case
● Adaptive
Unsupervised
0.2

Proximity
mf
missForest
0.0

knn

20 30 40 50 60 70 80

Percent Missing

Figure 6.9: Simulation 5 with nodesize of 200.


83

mcar BMI importance, 500 rep


1.2
Importance/original importance
1.0
0.8
0.6

Complete Case
Adaptive
Unsupervised
0.4

Proximity
mf
missForest
0.2

knn

20 30 40 50 60 70 80
Percent Missing

mcar BMI Minimal Depth, 500 rep


Minimal Depth/original Minimal Depth

Complete Case
Adaptive
1.6

Unsupervised
Proximity
mf
missForest
knn
1.4
1.2
1.0

20 30 40 50 60 70 80
Percent Missing

Figure 6.10: The effect of different imputations methods on the importance mea-
surement and the minimal depth with the Diabetes data.
84

Beta1 Estimation−missing in x1 Beta2 Estimation−Missingness in x1

1.5
● Adp
Unsupervised
1.0

● ● ● ● ● Proximity

● ●

mf
missForest

1.4


knn
0.9

1.3
0.8
beta1

beta2
0.7

1.2
0.6

1.1

● Adaptive ●

Unsupervised
0.5



Proximity ● ● ●
1.0

● ● ●
mf
missForest
0.4

knn

20 30 40 50 60 20 30 40 50 60

Percent Missing Percent Missing

Beta3 Estimation−Missingness in x1 Beta4 Estimation−Missingness in x1


1.2

0.6

● Adp
Unsupervised
Proximity
mf
missForest
knn
1.1

0.5

● ● ● ● ● ● ● ● ● ● ●
beta3

beta4
1.0

0.4

● ● ● ● ● ● ● ● ● ● ●
0.9

0.3

● Complete Case
● Adaptive
Unsupervised
Proximity
mf
missForest
0.8

0.2

knn

20 30 40 50 60 20 30 40 50 60

Percent Missing Percent Missing

Figure 6.11: The effect of different imputations methods on the parameter estima-
tion in GLM.
Chapter 7

MESA Data Analysis

The MESA(Multi-Ethnic Study of Atherosclerosis) study was designed to study the


prevalence, risk factors and sub-clinical cardiovascular disease (CVD) in multieth-
nic cohort, with 10 years follow-up for incident CHD events. 6814 subject were
enrolled for MESA study. 711 variables, including recorded variables and com-
puted variables, such as different versions of Framingham scores, are available as
potential predictors for the outcome. Although multiple outcomes are available in
the MESA dataset, we focus on the coronary heart disease(CHD) as outcome in
this analysis. Thirty-five observations with missing CHD outcome were removed,
resulting in 6779 observations for the analysis.

7.1 How different imputation affects variable selec-


tion in MESA data analysis

7.1.1 Variables Included

In the MESA data, there are 711 total variables with at least one observation(8 vari-
ables have 0 observations, therefore were deleted.), including variables from clinic
procedures, questionnaires, and created analytic variables (e.g. body mass index,
hypertension stage, ankle-brachial index), and key Reading Center variables (e.g.
total calcium score, aortic distensibility by MRI, average carotid wall thickness by

85
86

ultrasound). These variables involve multiple aspects of the patients, such as med-
ical history, anthropometry, health and life,medication information,neighborhood
information,personal history, CT, ECG, lipids( Blood group or NMR lipoprofile-II
spectral analysis), MRI, pulse wave, US: DISTENSIBILITY, US - IMT, Physical
Activity.
Out of the 711 variables, 134 variables have more than 50 percent missing val-
ues, while 114 variables have more than 75 percent missing values, 58 variables
have more than 90 percent missing values. The percent of missingness for each
variable were plotted in figure 7.1.

7.1.2 Relative VIMP by different RF imputation method

To select variables that are important in predicting 10-year CHD, we first used
unsupervised imputation, mForest imputation, and OTF imputation to impute the
dataset. Random Survival Forest analysis was then conducted on the imputed
datasets. The top important variables in predicting 10-year CHD and their rela-
tive VIMPs were reported in table 7.1. We also compared these results with the
OTF no imputation method.
As shown in table 7.1, Total calcium volume score and different versions of
Framingham risk score of CHD were top selected predictors in all the imputed
datasets. Interestingly, with OTF imputation, three variables ranked ahead of the
Framingham scores in terms of its importance in predicting the outcome; and with
OTF no imputation, six variables ranked before the Framingham scores.
87

7.1.3 VIMP of TNF-alpha

We were particularly interested in how important TNF-alpha is in predicting 10-


year CHD. Although there has been increasing evidence that cytokines in general
and TNF-alpha in particular play an importance role in cardiovascular disease, it
is not clear whether the level of these cytokines are predictive of long-term CHD.
Out of the 134 variables that have more than 50 percent missing values, some of
them are serum cytokines such as TNF-alpha soluble receptors, IL2, ec tl. For TNF-
alpha, 57.7 percent of the measurements were missing. We first carried out Random
Survival Forest (RSF) using only the observations with non-missing TNF-alpha
measurements, and concluded that TNF-alpha was not important in predicting 10-
year CHD. To rule out the possibility that the VIMP of TNF-alpha was affected by
NMAR missing mechanism, we first imputed the dataset using mForest imputation,
then coded all the missing values in TNF-alpha as 0 and carried out RSF analysis.
This is practically missingness incorporated in attributes(MIA). As shown in table
7.1, the relative VIMP of TNF-alpha compared to CAC score was 0.01, indicating
it is not important in predicting the outcome. This is consistent with the result
from the analysis using only observations with TNF-alpha measurements. This
also indicates that the missing mechanism of TNF-alpha is not NMAR. This result
was compared with the results from RSF on imputed datasets in table 7.1. From this
analysis, we are confident to conclude that TNF-alpha is not predictive of 10-year
CHD risk.
88

100
80
Percent missing

60
40
20
0

0 100 200 300 400 500 600 700

Variables

Figure 7.1: The percent of missingness for all the variables in MESA data.
89

Table 7.1: Top importance variables and the rank of importance of tnfri1 with dif-
ferent imputation methods
Use Observed MIA for tnfri1 mForest Imputation
Variable Importance Variable Importance Variable Importance
lncac 1.00 lncac 1.00 lncac 1.00
Fram. Scores 0.20-0.11 Fram. Scores 0.17-0.07 Fram. Scores 0.16-0.08
maxstn1c 0.089 zmximt1c 0.051 maxint1c 0.057
youngm1 0.072 stm1 0.002 zmximt1c 0.051
totvip1 0.070 age1c 0.027 rdpedis1 0.039
pkyrs1c 0.051 pkyrs1c 0.026 lptib1 0.038
chwyrs1c 0.035 maxint1c 0.021 pkyrs1c 0.029
spp1c 0.034 lptib1 0.020 pprescp1 0.025
...(27 variables) ... abi1c 0.019 age1c 0.023
tnfri1 0.01 (10 variables) ... ...(6 variables) ...
... ... tnfri1 0.010 tnfri1 0.015
Unsupervised New OTF with Imputation OTF No Imputation
Variable Importance Variable Importance Variable Importance
lncac 1.00 lncac 1.00 lncac 1.00
Fram. Scores 0.16-0.09 Chwday1 0.225 Chwday1 0.197
lolucdp1 0.087 tfpi1 0.165 vwf1 0.180
zmximt1c 0.063 vwf1 0.156 aspeage1 0.131
rdpedis1 0.043 Fram. Scores 0.110-0.094 hrmage1c 0.103
maxint1c 0.038 pkyrs1c 0.086 tfpi1 0.094
lptib1 0.035 aspeage 0.078 pkyrs1c 0.079
pkyrx1c 0.028 stm1 0.064 Fram. Scores 0.078-0.074
...(2 variables) ... ... ... aspsage1 0.062
tnfri1 0.023 ... ... lptib1 0.062
... ... ... ... mmp31 0.057

Table 7.2: Top importance variables explanation


Variable Names Meaning Percent missing
tnfri1 Tumor Necrosis Facto-a soluble receptors 57.7
maxstn1c MAXIMUM CAROTID STENOSIS 1.4
youngm1 YOUNGS MODULUS 4.2
pkyrs1c PACK-YEARS OF CIGARETTE SMOKING 1.2
chwyrs1c CHEWING TOBACCO AMOUNT 0.4
spp1c SEATED PULSE PRESSURE (mmHg) 0.04
totvip1 Total Vascular Impedance(pulse wave) 7.0
lptib1 LEFT POSTERIOR TIBIAL BP (mmHg) 0.9
rdpedis1 RIGHT DORSALIS PEDIS BP (mmHg) 1.0
abi1c ANKLE-BRACHIAL INDEX) (BP) 1.1
maxint1c INTERNAL CAROTID INTIMAL-MEDIAL THICKNESS (mm) 2.7
zmximt1c Z SCORE MAXIMUM IMT 1.1
lulocdp1 REASON AABP INCOMPLETE (L): UNABLE TO LOCATE DP) 98.9
cgrday1 CIGARS AVERAGE number SMOKED PER DAY 90.8
90

7.2 Ten year CHD prediction in the absence of CAC

score

7.2.1 CHD risk engine

Personalized care based on the risk of a patient for developing Atherosclerotic coro-
nary heart disease ASCVD is the safest and most effective way of treating a patient.
ASCVD risk engine provide a promising way for the clinicians to tailor treatment
to risk by identifying patients that are likely to benefit from or be harmed by a
particular treatment. Current guidelines on ASCVD prevention matches the inten-
sity of risk-reducing therapy to the patients absolute risk for new or recurrent AS-
CVD events using ASCVD risk engines(Stone et al., 2014). For instance, the 2013
American College of Cardiology/American Heart Association(ACC/AHA) guide-
lines recommend that patients with calculated pooled risk score of equal or greater
than 7.5% should be eligible for statin therapy for primary prevention.
A number of multivariable risk models have been developed for estimating the
risk of initial cardiovascular events in healthy individuals. The original Framing-
ham risk score, published in 1998, was derived from a largely Caucasian popula-
tion of European descent (Wilson et al., 1998) using the endpoints of CHD death,
Nonfatal MI, Unstable angina, and Stable angina. Prediction variables used in
Framingham CHD risk score included age, gender, total or LDL cholesterol, HDL
cholesterol, systolic blood pressure, diabetes mellitus, and current smoking status.
The Framingham General CVD risk score (2008) was an extension of the origi-
nal Framingham risk was score to include all of the potential manifestations and
adverse consequences of atherosclerosis, such as stroke, transient ischemic attack,
claudication and heart failure (HF) (D’Agostino et al., 2008). ACC/AHA pooled
91

cohort hard CVD risk calculator (2013) was the first risk model to include data
from large populations of both Caucasian and African-American patients (Goff et
al., 2014), developed from several cohorts of patients. The model includes the
same parameters as the 2008 Framingham General CVD model, but in contrast to
the 2008 Framingham model includes only hard endpoints (fatal and nonfatal MI
and stroke). However, while the calculator appears to be well-calibrated in some
populations similar to those for which the calculator was developed (REGARDS),
it has not been as accurate in other populations (Rotterdam)(Kavousi et al., 2014).
Prediction variables used in ACC/AHA pooled cohort hard CVD risk calculator
(2013) were age, gender, total cholesterol, HDL cholesterol, systolic blood pres-
sure, blood pressure treatment, diabetes mellitus, and current smoking. Endpoints
assessed in ACC/AHA pooled cohort hard CVD risk calculator (2013) were CHD
death, nonfatal MI, fatal stroke, and nonfatal stroke. Another well-known risk score
is JBS risk score(2014), which is based on the QRISK lifetime cardiovascular risk
calculator and extends the assessment of risk beyond the 10-year window, allowing
for the estimation of heart age and assessment of risk over longer intervals.
The MESA risk score(2015) improved the accuracy of the 10-year CHD risk es-
timation by incorporating CAC(Coronary Artery Calcium score) in the algorithm,
together with the traditional risk factors. It was shown that inclusion of CAC in the
MESA risk score resulted significant improvements in risk prediction (C-statistic
0.80 vs.0.75; p<0.0001), compared to using only the traditional risk factors. In
addition, external validation in both the HNR(German Heinz Nixdorf Recall) and
DHS(Dallas Heart Study) studies provided evidence of very good discrimination
and calibration. The prediction variables used in MESA risk score (2015) were
age, gender, ethnicity (non-Hispanic white, Chinese American, African American,
Hispanic), total cholesterol, HDL cholesterol, Lipid lowering treatment, Systolic
92

blood pressure, Blood pressure treatment (yes or no), diabetes mellitus (yes or no),
current smoking (yes or no), family history of myocardial infarction at any age
(yes or no), coronary artery calcium score. The endpoints assessed in MESA risk
score (2015) were CHD death, nonfatal MI, resuscitated cardiac arrest, and coro-
nary revascularization in patient with angina. Although the MESA risk score with
CAC incorporated appears to be superior in risk prediction, one problem is that the
CAC score may not be available for many individuals. In this study, we look at dif-
ferent strategies of building risk engine when CAC score is not available in testing
data.

7.2.2 Methods

When the CAC score is available in the training data set, but not in the testing data
set, four strategies may be applied, which are:

1. Use Framingham variables only

2. Use the full model (all the variables available )

3. Framingham model with true CAC, predict CAC in testing using all the test-
ing dataset.

4. New method: replace in training with predicted CAC and proceed as strategy
3.
93

The detailed steps of this new strategy is as follows.

1. Build a predictive model for CAC using the training data, with the outcome
variables removed. (Remember that CAC score is available in the training
dataset.)

2. Get fitted CAC score in the training data.

3. Replace the true CAC score with the fitted CAC score in the training data.

4. Build predictive model for the outcome (10-year CHD) using updated train-
ing data from step 3

5. Predict CAC score in the testing dataset using model from step 1

6. Predict the outcome (CHD) using model from step 4

7.2.3 Results

Traditional Framingham with CAC score predict 10-year CHD better than
RFS with all 696 available predictors

Using the 12 traditional Framingham predictors with CAC score, the error rate for
10-year CHD was found to be 22.6%, using node size of 100. We first investigated
whether the error rate can be improved by simply using all 690 available predictors
in the MESA dataset. The resulting RSF had an error rate of 0.239, showing no
improvement of prediction. This is not surprising as we showed that the only strong
predictors for 10-year CHD were CAC score and the Framingham predictors, the
resting being very weak predictors.
When the CAC is not available, the error rate increased to be 29.0%, or 0.290
using only the Framingham predictors.
94

When CAC score is not available

One natural scenario is that CAC score is available in training dataset, but not in
testing dataset. And accordingly, one strategy may be including CAC in the training
dataset while try to predict CAC in the testing dataset when CAC is not available-
which is a common case in clinic, as not all patients will take the CT scan for their
CAC score. Using this strategy, the prediction error rate was found to be 0.324,
which is worse than simply ignoring CAC in both training and testing dataset. This
is not surprising, as the percent of variance explained predicting CAC was only 32
percent. We used a novel strategy to be able to utilize the known CAC score in the
training dataset, which resulted in improved prediction compared to simply ignore
the CAC score.
To compared the four strategies for predicting the 10-year CHD with the MESA
data, the following simulation was carried out. The observations in the MESA data
was assigned randomly to training or testing set. The CAC scores in the testing test
were then removed. Each of the four strategies for predicting the 10-year CHD were
carried out. The experiment was repeated 100 times for each of the four strategies
and the average of the 100 error rate was recorded. The RSF parameters used were
100 for mtry, 600 for ntree, 20 or 100 for nsplit, 100 for nodesize. 696, instead
of 711 variables were used because we deleted all the variables created from CAC
score except one. The natural log of CAC score was calculated and used to be
consistent with the literature. The result is shown in Table 7.3.

7.2.4 Conclusions

The MESA dataset we used is unique in that it essentially has two very important
variables in predicting the outcome (CHD): CAC score and Framingham score. In
95

reality, the Framingham score is usually always available for a patient, while the
CAC score is often no available, since it takes a CT scan to obtain the CAC score,
which is expensive and involves radiation. This resulted in a scenario that a very
important variable is in the training dataset, while not in the testing dataset. We
showed that to simply discard this variable is not a good strategy. We proposed a
new strategy to utilize the information of this variable in the training dataset even
when it is not available in the testing dataset. We showed that the predictive error
rate was improved with this new strategy.

Table 7.3: Prediction error rates using four strategies when CAC score is available
in the training, but not test set
Strategy p(training) p(testing) Error rate Error rate
(nsplit=20) (nsplit=100)
Framingham and CAC 13 13 0.223 0.228
Full model and CAC 696 696 0.234 0.247
Strategy 1 12 12 0.283 0.290
Strategy 2 695 695 0.276 0.293
Strategy 3 13 13 0.298 0.324
Strategy 4 13 13 0.263 0.263
Bibliography

Abdella, M. and Marwala, T. (2015). The use of genetic algorithms and neural
networks to approximate missing data in database. International Conference on
Computational Cybernatics, School of Electrical and information Engineering.
University of the Witwatersrand. Johannesburg, South Africa, 13-16 April, pg2-
7-212.

Aittokallio, T. (2009) Dealing with missing values in large-scale studies: microar-


ray data imputation and beyond. Brief Bioinform; 2(2):253–264.

Breiman L., Friedman J.H., Olshen R.A., and Stone C.J. (1984) Classification and
Regression Trees, Belmont, California.

Breiman L (2001) Random forests. Machine Learning, 45: 5–32.

Breiman L. (2003) Manual – setting up, using, and understanding random forests
V4.0. .

Bartlett J.W, Seaman S.R., White I.R., and Carpenter J.R. (2015). Multiple impu-
tation of covariates by fully conditional specification: accommodating the sub-
stantive model. Statistical Methods in Medical Reserch, 24(4): 462487.

D’Agostino, R.B. Sr, Vasan, R.S., Pencina, M.J., Wolf, P.A., Cobain, M., Massaro,
J.M., Kannel, W.B. (2008) General cardiovascular risk profile for use in primary
care: the Framingham Heart Study. Circulation. 117(6):743.

Devroye L., Gyorfi L., and Lugosi G. (1996) Probabilistic Theory of Pattern Recog-
nition, Springer-Verlag.

Diaz-Uriarte R., Alvarez de Andres S. (2006) Gene Selection and classification of


microarray data using random forest. BMC Bioinformatics.

96
97

Doove .L. L, Van Buuren S., and Dusseldorp E. (2014) Recursive partitioning for
missing data imputation in the presence of interaction effects. Computational
Statistics & Data Analysis, 72:92-104.

Enders C. K. (2010) Applied missing data analysis, Guilford Publications, New


York.

Friedman, J.H. (2001). Greedy function approximation: A gradient boosting ma-


chine. Annals of Statistics, 29:1189–1232.

Genuer, R. J.-M. Poggi, C. Tuleau-Malot (2010) Variable selection using random


forests. Pattern Recognit. Lett., 0167-8655, 31 (14), pp. 22252236.

Goff, D.C. Jr, Lloyd-Jones, D.M., Bennett, G., Coady, S., D’Agostino, R.B., Gib-
bons, R., Greenland, P., Lackland, D.T., Levy, D., O’Donnell, C.J., Robinson,
J.G., Schwartz, J.S., Shero, S.T., Smith, S.C. Jr, Sorlie, P., Stone, N.J., Wil-
son, P.W., Jordan, H.S., Nevo, L., Wnek, J., Anderson, J.L., Halperin, J.L.,
Albert, N.M., Bozkurt, B., Brindis, R.G., Curtis, L.H., DeMets, D., Hochman,
J.S., Kovacs, R.J., Ohman, E.M., Pressler, S.J., Sellke, F.W., Shen, W.K., Smith,
S.C. Jr, Tomaselli, G.F. (2014) 2013 ACC/AHA guideline on the assessment of
cardiovascular risk: a report of the American College of Cardiology/American
Heart Association Task Force on Practice Guidelines. Circulation. 129(25 Suppl
2):S49.

Gyorfi, L., Kohler, M., Krzyzak, A., and Walk, H. (2002) A Distribution-Free
Theory of Nonparametric Regression, Springer.

Hastie, T., Tibshirani, R., Narasimhan, B., and Chu, G. (2015) impute: Imputation
for microarray data. R package version 1.34.0, http://bioconductor.
org.

Hothorn, T. and Lausen, B.(2003) On the exact distribution of maximally selected


rank statistics. Comput.Statist. Data Analysis, 43:121137,.

Ishwaran, H. (2007) Variable importance in binary regression trees and forests.


Electronic Journal of Statistics, Vol. 1 519-537.

Ishwaran, H., Kogalur, U. B., Blackstone, E. H., and Lauer, M. S. (2008) Random
survival forests. Ann. Appl. Stat., 2:841–860.

Ishwaran, H., Kogalur, U.B., Gorodeski, E.Z., Minn, A.J. and Lauer, M.S. (2010).
High-dimensional variable selection for survival data. J. Amer. Stat. Assoc, 105,
205-217.

Ishwaran, H. (2015) The effect of splitting on random forests. Machine Learning


99(1):75–118.
98

Ishwaran, H. and Kogalur, U.B. (2016). randomForestSRC: Random Forests


for Survival, Regression and Classification (RF-SRC). R package version 2.0.5
http://cran.r-project.org.

Kavousi, M., Leening, M.J., Nanchen, D., Greenland, P., Graham, I.M., Steyerberg,
E.W., Ikram, M.A., Stricker, B.H., Hofman, A., Franco, O.H. (2014) Compari-
son of application of the ACC/AHA guidelines, Adult Treatment Panel III guide-
lines, and European Society of Cardiology guidelines for cardiovascular disease
prevention in a European cohort. JAMA. 311(14):1416.

Liaw, A. and Wiener, M. (2002). Classification and regression by randomForest.


Rnews, 2/3:18–22, 2002.

Liao, S.G. et al.. (2014) Missing value imputation in high-dimensional phenomic


data: imputable or not, and how? BMC Bioinformatics, 15:346.

Little, R.J.A (1992). Regression With Missing X’s: A Review Journal of the
American Statistical Association Vol. 87, No. 420, pp. 1227-1237

Leisch, F. and Dimitriadou, E. (2009) mlbench: Machine Learning Benchmark


Problems, R package version 1.1-6

Loh, P.L. and Wainwright, M.J.. (2011) High-dimensional regression with noisy
and missing data: provable guarantees with non-convexity. Advances in Neural
Information Processing Systems, pp. 2726–2734.

Mendez, G. and Lohr, S. (2011) Estimating residual variance in random forest


regression. Computational Statistics & Data Analysis, 55(11):29372950.

Meng, X.L. (1995). Multiple-imputation inferences with uncongenial sources of


input (with discussion), Statistical Science, 10, 538-573.

Pantanowitz, A. and Marwala, T. Evaluating the impact of missing data


imputaion through the use of the random forest algorithm (2008)
http://arxiv.org/ftp/arxiv/papers/0812/0812.2412.pdf. School of Electrical
and Information Engineering. University of the Witwatersrand Private Bag x3.
Wits. 2050. Republic of South Africa, 2008

Rubin, D.B. (1976) Inference and missing data. Biometrika, 63(3):581–592.

Rubin, D.B. (1987) Multiple Imputation for Nonresponse in Surveys. New York:
John Wiley & Sons.

Rubin, D.B. (1996) Multiple Imputation after 18+ Years (with discussion). Journal
of the American Statistical Association, 91:473-489.
99

Segal, M. and Xiao, Y. (2011) Multivariate random forests. Wiley Interdisciplinary


Reviews: Data Mining and Knowledge Discovery 1(1):80–87.

Shah, A.D., Bartlett, J.W., Carpenter, J., Nicholas, O., and Hemingway, H. (2014)
Comparison of random forest and parametric imputation models for imputing
missing data using MICE: a caliber study. American Journal of Epidemiology,
179(6):764774.

Stekhoven, D.J. and Buhlmann, P. (2012) MissForest—non-parametric missing


value imputation for mixed-type data. Bioinformatics, 28(1):112–118.

Stone, C.J. (1977). Consistent nonparametric regression. Ann. Stat., 8:1348–1360.

Stone, N.J., Robinson, J.G., Lichtenstein, A.H., et al. (2014) ACC/AHA guideline
on the treatment of blood cholesterol to reduce atherosclerotic cardiovascular
risk in adults: a report of the American College of Cardiology/American Heart
Association Task Force on Practice Guidelines. J Am Coll Cardiol 63:2889934.

Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R.,
Botstein, D., and Altman, R. (2001) Missing value estimation methods for DNA
microarrays. Bioinformatics, 17(6):520525.

Twala,B., Jones, M.C. and Hand, D.J. (2008) Good methods for coping with miss-
ing data in decision trees. Pattern Recognition Letters, 29(7):950–956.

Twala,B., Cartwright, M.C. (2010) Ensemble missing data techniques for software
effort prediction. Intelligent Data Analysis. 14(3):299-331.

Van Buuren, S. (2007). Multiple imputation of discrete and continuous data


by fully conditional specification. Statistical Methods in Medical Research,
16(3):219242.

Waljee, A.K. et al. (2013). Comparison of imputation methods for missing labora-
tory data in medicine. BMJ Open, 3(8):e002847.

Wilson, P.W., D’Agostino, R.B., Levy, D., Belanger, A.M., Silbershatz, H., Kannel,
W.B. Prediction of coronary heart disease using risk factor categories. Circula-
tion. 1998;97(18):1837.

You might also like