You are on page 1of 10

Chapter 2

Review of Related Literature

Much research effort has been devoted in the efficacy of various im-

putation methods. In the report entitled Compensating for Missing Sur-

vey Data, two simulation studies using the data in the 1978 Income

Survey Development Program Research Panel were carried out to com-

pare some imputation methods. The first study compared imputation

methods for the variable Hourly Rate of Pay while the second dealt

with the imputation of the variable Quarterly Earnings. For both stud-

ies, the author stratified the data into its imputation classes, construc-

ted data sets with missing values by randomly deleting some of the re-

corded values in the original dataset and then applied the various im-

putation methods to fill in the missing values. This process was replic-

ated ten times to ensure consistency of the results. Once the imputa-

tion methods have been applied, the three measures for evaluating the

effectiveness of imputation methods namely the Mean Deviation, Mean

Absolute Deviation and the Root Mean Square Deviation were obtained

and averaged across the ten trials. (Kalton, 1983)

For the first study of imputing the variable Hourly Rate of Pay, eight

methods were used namely the Grand Mean Imputation (GM), the
Class Mean Imputation using eight imputation classes (CM8), the Class

Mean Imputation using ten imputation classes (CM10), Random

Imputation with eight imputation classes (RM8), Random Imputation

with ten imputation classes (RM10), Multiple Regression Imputation

(MI), Multiple Regression Imputation plus a random residual chosen

from a normal distribution (MN) and Multiple Regression Imputation

plus a randomly chosen respondent residual (MR). Using the Mean De-

viation criteria, the results showed that all mean deviations were neg-

ative, indicating that the imputed values underestimated the actual

values. Moreover, the results show that the Grand Mean Imputation

(GM) has the greatest underestimation among the eight procedures.

Meanwhile for the Mean Absolute Deviation and Root Mean Square De-

viation, which measures the ability to reconstruct the deleted value,

the results showed that the Grand Mean Imputation fared the worst for

both criteria. In addition, it also showed that the Multiple Regression

Imputation (MI) obtained the best measures for the two criteria and

that the procedures with greater number of imputation classes (i.e.CM8

VS. CM10, RC8 VS. RC10) yield slightly better results for the two criter-

ia. (Kalton, 1983)

For the second study, which is the imputation of Quarterly Earn-

ings, ten imputation procedures were used. These are the Grand Mean

Imputation (GM), the Class Mean Imputation using eight imputation


classes (CM8), the Class Mean Imputation using twelve imputation

classes (CM12), Random Imputation with eight imputation classes

(RM8), Random Imputation with twelve imputation classes (RM12), Mul-

tiple Regression Imputation (MI), Multiple Regression Imputation plus a

random residual chosen from a normal distribution (MN), Multiple Re-

gression Imputation plus a randomly chosen respondent residual (MR),

Mixed Deductive and Random Imputation using eight imputation

classes (DI8) and Mixed Deductive and Random Imputation using

twelve imputation classes (DI12). Using the first criteria, the Mean De-

viation, the results showed that the Grand Mean (GM) obtained a posit-

ive bias. This implied that the grand mean imputation is not an effect-

ive imputation method for the this study. The results also showed that

the regression imputation procedures have almost similar results pro-

ducing almost unbiased estimates. In addition, the Class Mean Imputa-

tion methods (CM8 and CM12) have similar measures with those of the

Random Imputation Methods. Nevertheless, all methods have pro-

duced relatively small mean deviations except for the last two meth-

ods. Comparing the Mean Absolute Deviations and the Root Mean

Square Deviations, the results show that the Grand Mean Imputation

obtained values similar to the regression procedures with residuals (i.e.

Multiple Regression Imputation plus a random residual chosen from a

normal distribution or MN, Multiple Regression Imputation plus a ran-

domly chosen respondent residual or MR). The results also show that
the RC8. RC12, MN and MR procedures are over one third larger com-

pared to deterministic procedures such as the CM8, CM12 and MI pro-

cedures. (Kalton, 1983).

To further investigate the relatively larger biases of DI8 and DI12

procedures, the author further divided the date into the deductive and

non deductive cases. This shed further light on the Mean Deviations

and Mean Absolute Deviations of the various imputation methods. It

was found that the mean deviations are positive on the deductive case

and negative on the non deductive case for all of the procedures.

These then explains why there are relatively small deviations in the

previous results since the measures between the cases tend to cancel

out. It also showed that the DI8 and DI12 results are similar to those of

the RC8, RC12, CM8 and CM12 in the non deductive cases but are

largely different in the deductive cases. This explains the larger values

of DI8 and DI12 in the previous results. (Kalton, 1983)

At the end of the two studies, it showed that the imputation pro-

cedures tend to overestimate the Hourly Rate of Pay and underestim-

ate the Quarterly Earnings. Moreover, it showed how the mean imputa-

tion appears to be the weakest imputation method among the studies

since it has distorted the distribution of the original data. Lastly,

Kalton’s study shows the impact of increasing the imputation classes


with respect to the criteria used such that it gives a better yield of val-

ues for the three criteria.

In contrast to Kalton’s criteria in measuring the performance of

imputation procedures, a paper entitled A Comparison of Imputation

Techniques for Missing Data by C. Musil, C. Warner, P. Yobas and S.

Jones, the authors presented a much simple approach in evaluating the

performance of imputation techniques by using the means, standard

deviation and correlation coefficients, then comparing the statistics of

the original data with the statistics obtained from the five methods

namely Listwise deletion, Mean Imputation, Deterministic Regression,

Stochastic Regression and EM Method. The Expectation Maximization

(EM) Method is an iterative procedure that generates missing values by

using expectation (E-step) and maximization (M-step) algorithms. The

E-step calculates expected values based on all complete data points

while the M-step replaces the missing values with E-step generated

values and then recomputed new expected values. (Musil, Warner,

Yobas & Jones, 2002)

Using the Center for Epidemiological Studies data on stress and

health ratings of older adults, the authors imputed a single variable

namely the functional health rating. Of the 492 cases, 20% cases were

deleted in an effort to maximize the effects of each imputation meth-


od. Except for the Listwise Deletion and Mean Imputation, the re-

searchers used the SPSS Missing Value Analysis function for the De-

terministic Regression, Stochastic Regression and EM Method. For the

correlations, the researchers obtained the correlation values of the ori-

ginal data and the five methods of the imputed variable with the vari-

ables, age, gender and self assed health rating. (Musil, Warner, Yobas

& Jones, 2002) The results show that comparing the mean of the origin-

al data with the five methods, all imputed values underestimated the

mean. The closest to the original data was the Stochastic Regression,

followed very closely by EM Method, Deterministic Regression, Listwise

Deletion and Mean Imputation. The same results also hold for the

standard deviations. For the correlations, however, the EM Method pro-

duced the closest correlation values to the original data followed

closely by the Stochastic Regression, Deterministic Regression, Listwise

Deletion and Mean Imputation. Hence, the Finding suggests that the

Stochastic Regression and EM Method performed better while the Mean

Imputation is the least effective. (Musil, Warner, Yobas & Jones, 2002)

In another study by Nordholt entitled Imputation Methods, Simu-

lation, Experiments and Practical Examples, the authors described two

simulation experiments of the Hot Deck Method. The first study fo-

cused on comparing whether the Hot Deck Method performs better

than leaving the records with nonresponse out of the data set when
analyzing the variable, which is known as the Available Case Method.

This was done by constructing a fictitious data set of four values; two

of these variables were used for the imputation. Then nonresponse

rates were identified namely 5%, 10% and 20% and the simulation pro-

cess was replicated 50 times. The data set containing the missing val-

ues was first analyzed using the Available Case Method then followed

by the Hot Deck Imputation. Same with the methodology of Musil et.al.,

descriptive statistics such as the mean, variance and correlation were

computed. Moreover, the absolute differences between the original

and the available case method also with the original and hot deck

method were computed. Based on his criteria, the results show that

Hot Deck performs better than the Available Case Method. Also, it

showed that the Hot Deck, while had closer results with the original

data, has the tendency to underestimate the values. In terms of the

absolute differences, it was observed that these values increase when

the percentage of missing values also increases. (Nordholt, 1998)

Nordholt’s second simulation study focused on the effects of cov-

ariates, otherwise known as imputation classes on the quality of the

Hot Deck Imputation. Using the data of the Dutch Housing Demand

Survey of Statistics Netherlands, the variable value of the house was

chosen as the variable to be imputed due to its importance and the fre-

quency of nonresponse occurring in that variable. For this study, the


observations under category 13 (value worth at least 150,000) and cat-

egory 22 (value worth at 300,000) are changed into missing values.

The rationale for this choice was to ensure that the original value from

these categories will note be used as the replacements for the variable

to be imputed since it is no longer in the file. Then imputation classes

were created once the missing values were already identified. A table

showing the number of respondents before and after imputation

showed that in every category except for 13 and 22, which was set as

missing values, the number of respondents increased after the imputa-

tion. This showed that the remaining records have equal probability of

becoming a donor record for an imputation and that not all imputations

give values that are near category 13 or 22. Nordholt also explored on

the Available Case Method and Hot Deck Method for this real life data.

Same with the first study, the Hot Deck fared better than the Available

Case Method. (Nordholt, 1998)

Lastly, Nordholt addressed several questions regarding imputa-

tion. Using examples of how imputation is applied on the real life sur-

veys such as the Dutch Housing Demand Survey, European Community

Household Panel Survey (ECHP) and the Dutch Structure of Earning

Survey, he outline four criteria to decide which variables to be im-

puted. These are the importance of a variable, the percentage of non-

response, the predictability of missing values and the cost of imputa-


tion. He also mentioned how it is important to estimate the duration of

the imputation process due to the need of the study to be timely. The

duration, according to Nordholt, is dependent on the number of vari-

ables to be imputed, the available capacity, the user friendliness of an

imputation package and the desired imputation quality. These issues

must be settled first before conducting any imputation process and

choosing the appropriate imputation strategy. (Nordholt, 1998)

There were two undergraduate theses that conducted a similar

study on imputation. The first undergraduate thesis was by Salvino and

Yu. They assessed the efficiency of the Mean Imputation versus Hot

Deck Imputation Technique by applying these techniques on the 1991

Census on Agriculture and Fisheries (CAF) data. In their research, they

generated an incomplete data using the Gauss Software for the im-

puted variables which were the count for cattle, hogs and chicken. In

order to determine which is better between the two, the variances

were compared. Looking at the variances, it was determined that the

Hot Deck Imputation Technique was better. Also, the design effect was

considered by dividing the variance of the Hot Deck Imputation versus

the Mean Imputation, since the ratio produced was less than one, they

concluded that again, the Hot Deck Imputation Technique is a better

option. (Salvino and Yu, 1996)


Another undergraduate thesis by Cheng and Sy focused on as-

sessing imputation techniques on a clinical data. The authors em-

ployed four methods of imputation namely Mean Imputation, Hot Deck

Imputation, Linear Regression and Multiple Linear Regression. They as-

sessed the efficacy of the imputation techniques by looking at the ac-

curacy and precision of the estimates. Accuracy was measured by the

percentage error and the variance of these percentage errors were the

basis for the precision of the estimates. The results show that the Lin-

ear Regression was the best method, followed closely by Multiple Re-

gression, then Hot Deck and finally the Mean Imputation. (Cheng and

Sy, 1999)

You might also like