You are on page 1of 114

IMPUTATION PROCEDURES FOR

PARTIAL NONRESPONSE
The Case of the Family Income and Expenditure Survey (FIES)

Diana Camille B. Cortes


James Edison T. Pangan
THE PROBLEM AND ITS
BACKGROUND
Statement of the Problem
• Which imputation technique is the most
appropriate for the FIES data?

• Will the imputation methods generate less


biased estimates in contrast to ignoring
nonresponse?

• How do varying nonresponse rates affect the


results for each imputation method?
Objectives of the Study
• To compare the imputation techniques namely Overall
Mean Imputation, Hot Deck Imputation, Deterministic
Regression and Stochastic Regression, based on its
efficiency and ability to recapture the deleted values by
generating the missing values on the FIES 1997 second
visit data using the first visit data of the same survey.

• To investigate the effect of the varying rates of missing


observations, particularly the effect of 10%, 20% and
30% nonresponse rates on the precision of the
estimates.
Scope and Limitations

• The Family Income and Expenditure Survey (FIES)


1997 was used to tackle the problem of
nonresponse and to examine the impact of the
different imputation methods.

• The researchers focused on using the FIES 1997


data on the first visit to impute the partial
nonresponse that is present on the second visit.
Scope and Limitations
• This paper also assumes that the first visit data is
complete and the pattern of nonresponse follows the
Missing Completely at Random (MCAR) case.

• Only four imputation methods will be applied for this


paper namely: Overall Mean Imputation (OMI), Hot
Deck Imputation (HDI), Deterministic Regression
Imputation (DRI) and Stochastic Regression
Imputation (SRI).
Scope and Limitations

• This paper only covered the partial nonresponse


occurring in the National Capital Region (NCR)

• The variables imputed for this study would be the


Total Income (TOTIN2) and Total Expenditures
(TOTEX2) in the second visit of the FIES 1997 data
Scope and Limitations
• The imputation procedures will be evaluated on the basis of
the following:

1. Nonresponse Bias and Variances of the Imputed Data

3. Assessment of the Distributions of the Imputed vs. the Actual


Data

3. The criteria set in the report entitled Compensating for Missing


Data (Kalton, 1983) namely the Mean Deviation, Mean Absolute
Deviation and the Root Mean Square Deviation.
CONCEPTUAL FRAMEWORK
Nonresponse Bias
• Nonresponse Bias becomes a burden when missing
data is either ignored, deleted or discarded in the
analysis of the data.

• To understand the concept better, it will only tackle


the general idea and would not mention anything
about the types and patterns of nonresponse.
Nonresponse Bias
• Consider a Simple Random Sample (SRS) in the
variable y, where y contains missing data from a
population of N is drawn.

• The population is divided into two groups, namely,


the respondents and nonrespondents.
Nonresponse Bias

• Let R and M be the number of respondent and


nonrespondents with N = R + M in the population.

• Let r and m with n = r + m be the corresponding sample


quantities.
Nonresponse Bias
• The proportion of respondents and nonrespondents in
the population is given by:

and
Nonresponse Bias
• The proportion of respondents and nonrespondents in
the sample is given by:

and
Nonresponse Bias
• The population total and mean of the population are given
by:

and
Nonresponse Bias
• The corresponding sample total and mean
are given by:
Nonresponse Bias

• If no compensation is made, the respondent sample mean


is used to estimate the population mean. Its bias is given
by:
Nonresponse Bias
The expectation of the respondent sample mean can be
obtained in two stages, first conditional on fixed r and
then over different values of r, that is:

where E2 is the conditional expectation for fixed r and E1 is


the expectation over different values of r
Nonresponse Bias
The expectation of the respondent sample mean is given by:
Nonresponse Bias
Hence, the bias of the respondent sample mean is given by:
Patterns of Nonresponse
• Nonresponse patterns are essential assumptions
since it is influential in handling missing data
particularly in the implementation of the imputation
procedures to be used in this study.

• There are three patterns of nonresponse: Missing


Completely At Random (MCAR), Missing At
Random (MAR) and NonIgnorable Nonresponse
(NIN)
Patterns of Nonresponse

• Missing Completely At Random (MCAR) occurs if the


probability of Y is unrelated to Y itself or to other
variables in the data. Data with this kind of nonresponse
has the highest degree of randomness and show no
underlying reasons for missing observations that can
potentially lead to bias research findings
Patterns of Nonresponse

• Missing At Random (MAR) occurs if the probability of Y


is unrelated to Y itself after controlling other variables in
the data. This means that the likelihood of a case having
incomplete information on a variable can be explained by
other variables in the data set. Similar to the MCAR
assumption, data from this type also show some
randomness.
Patterns of Nonresponse
• NonIgnorable Nonresponse (NIN) occurs if the
probability of missing data on Y is related to the value of
Y and possibly to some other variable Z even if other
variables are controlled in the analysis. Unlike MCAR
and MAR, this type does not exhibit randomness but
rather systematic, nonrandom factors underlying the
occurrence of the missing values that are not apparent
or measured.
Types of Nonresponse
• Unit (Total) Nonresponse (UN) takes place wherein no
information collected from a sampling unit.

• Reasons include:

1. Failure to contact respondent


2. Inability to cooperate
3. Questionnaires are lost
Types of Nonresponse
• Item Nonresponse (IN) is the failure to collect complete
information due to refusal of answering some of the questions.

• Reasons include:

1. Lack of information necessarily needed by the informant


2. Failure to make the effort of retrieving it from his memory or
by consulting his records
3. Refusal of answering because of the sensitivity of the
questions
4. Failure to record an answer
5. Responses are subsequently rejected for an edit check
Types of Nonresponse
• Partial Nonresponse (PN) is the failure to collect large sets of items
for a responding unit.
• Reasons include:
1. Fails to provide information in one or more wave of a panel survey
or later phases of a multi-phase survey data collection procedure
2. Later items in the questionnaire after breaking off a telephone
interview
3. Unavailability of the data after all possible checking and follow-up
4. Inconsistency of the responses that do not satisfy natural or
reasonable constraints known as edits
5. Similar causes to total nonresponse
Imputation Procedures
• Imputation is the process of replacing a missing value
through available statistical and mathematical
techniques, with a value that is considered to be a
reasonable substitute for the missing information.

• Imputation is listed as one of the many procedures that


can be used to deal with nonresponse in order to
generate unbiased results.
Imputation Procedures
• Listed below are the advantages (benefits) and disadvantages
(dangers) of using Imputation

Advantages Disadvantages

• Helps reduce biases in survey • Biases might be greater after using


estimates imputation
• Imputation makes analysis easier • The distribution of the data might be
and presentation simpler distorted
• Ensures consistency of the results • Falsely treating the imputed data as
across analyses, a feature that an if it were a complete data set.
incomplete data set cannot fully
provide
Imputation Procedures
• Imputation Procedures or Methods (IM) are techniques
applied to replace missing values.

• These techniques can either implement statistical or simply


mathematical procedures like replacing an observation by a
constant value (e.g. mean)

• There are four IMs applied in this study, namely, the Overall
(Grand) Mean Imputation (OMI), Hot Deck Imputation
(HDI), Deterministic Regression Imputation (DRI) and
Stochastic Regression Imputation (SRI).
Imputation Procedures
• Imputation Class (IC) is a stratification class that divides the
data into groups before imputation takes place.
• Formation of ICs can be very useful if it were divided into
homogeneous groups.
• Variables used to define IC are called Matching Variables
(MV).
• The group of observations with a response are called donors.
• The group of observations that will be substituted by a
response are called recipients.
Imputation Procedures
• Problems might arise if one does not form IC with caution and
one of them is the determination of a definite number of IC.

 As the number of imputation class increases, the tendency to


inflate the variances within the class due to the likelihood of
decreasing the number of observations increases.
 As the number of imputation class decreases, the tendency to
inflate the aggregation bias within the class due to the
likelihood of increasing the number of observations increases.
Imputation Procedures
Overall Mean Imputation (OMI)

• OMI simply replaces each missing data by the overall mean of


the available (responding) units in the same population.

• The overall mean is given by:


Imputation Procedures
Overall Mean Imputation (OMI)

• The IC is the entire population itself.

• In many related literature, IC is not a requirement


and therefore excluded in performing this method.
Imputation Procedures
Overall Mean Imputation (OMI)

Advantages Disadvantages

• Can be applied to any data • Distribution of the data


set becomes distorted
• Easier to use and generate • Produces large biases and
results faster variances because it does
not allow variability in the
imputation of missing
values.
Imputation Procedures
Hot Deck Imputation (HDI)

• HDI method is the process by which the missing observations


are imputed by choosing a value from the set of available
units.

• Values are either selected at random (traditional hot deck), in


some deterministic way (deterministic hot deck) or in some
function of distance (nearest-neighbor hot deck).
Imputation Procedures
Hot Deck Imputation (HDI)

• In performing this method, let Y be the variable that contains


missing observations and let Xi be the ith-variable that has no
missing observations. The following procedures is as follows:

5. Find a set of categorical X variables that are highly associated with


Y. The X variables to be selected will be the matching variables in
this imputation.
6. Form a contingency table based on X variables.
Imputation Procedures
Hot Deck Imputation (HDI)

3. If there are cases that are missing within a particular cell in the
table, select a case from the set of available units from Y variable
and imputed the chosen Y value to the missing value.

5. In choosing for the imputation to be substituted to the missing value,


both of them must have similar or exactly the same characteristics.
Imputation Procedures
Hot Deck Imputation (HDI)
Advantages Disadvantages
• The shape of distribution is • All X variables must be
preserved categorical
• Nonexistence of out-of- • Distortion of the distribution of the
range values data is possible due to the
multiple use of one observation
• Imputed values are all
from the donor record.
actual values
• IC must be limited to ensure that
all missing values will have a
donor.
Imputation Procedures
General Regression Imputation
• The method of imputing missing values via the least-squares
regression is known to be the regression imputation (RI) method.
• This technique is seen as the generalization of the group mean
imputation (GMI), another type of mean imputation other than
OMI that uses imputation classes.
• There are many ways of creating a regression model, however,
using Kalton’s (1983) study, the value for which imputations are
needed y is regressed on the matching variables for the units
providing a response on y.
Imputation Procedures
General Regression Imputation
• Missing value may be imputed into two basic ways:
a. To use the predicted value from the model given the values
of the matching variables for the record with a missing
response or otherwise known as Deterministic Regression
Imputation (DRI)
b. To use the predicted value plus some type of randomly
chosen residual or otherwise known as Stochastic
Regression Imputation (SRI)
Imputation Procedures
General Regression Imputation
• In comparing the accuracy and efficiency of this method, it will be
helpful if the methods to be compared have the same imputation
class.

• The general model based on imputation classes is in the form:


Imputation Procedures
General Regression Imputation
Stochastic Regression Imputation

• Since the predicted value from the model


corresponds to the mean value imputation in
the restricted model which have undesirable
distributional properties, a good case therefore
exists for including the estimated residual.
Imputation Procedures
General Regression Imputation (GRI)
Deterministic Regression Stochastic Regression Imputation
Imputation

• Even if the deterministic


• Distribution becomes too predicted value is feasible, the
peaked and the variances is stochastic value may need not
underestimated. be.
• After adding the residual,
unfeasible values can be the
result.
Imputation Procedures
General Regression Imputation (GRI)
Advantages Disadvantages

• Has the potential to • Time-consuming operation and


produce closer imputed often times unrealistic to consider
values its application for all items with
missing values in a survey.
• A high coefficient of
determination is required to make
the method effective.
METHODOLOGY
The Simulation Method
• The simulation method is procedure to
create an artificial data set with missing
observations to indicate which values will be
imputed.

• The objective of creating a simulation


method is to be able to make an empirical
comparison of the statistical properties of the
estimates with imputed values
The Simulation Method
• The algorithm for the simulation procedure
are as follows:
1. To get the number of observations to be set
to missing for each nonresponse rate, the
total number of observations was multiplied
to the indicated nonresponse rate. The
nonresponse rates used for this study were
10%, 20% and 30%.
The Simulation Method
1. Each observation from the matrix of random
numbers was assigned to both observations of
the 1997 FIES second visit variables TOTIN2
and TOTEX2.

3. The second visit observations for both variables


were sorted in ascending order through their
corresponding random number.
The Simulation Method
1. The first 10% of the sorted second visit data for both
variables were selected and set to as missing
observations. The same procedure goes for the data
set which will contain 20% and 30% nonresponse rates
respectively.

5. The missing observations were flagged. This was done


to distinguish the imputed from the actual values
during the data analysis.
Formation of Imputation Classes
• The steps undertaken in the formation of
imputation classes are as follows:

1. The researchers identified the potential matching


variables, which are the candidate variables that
could have an association with the variables of
interest (i.e. TOTIN and TOTEX).
Formation of Imputation Classes

2. The categorical variables from the first visit data must fit
into the criteria in order to be selected as a candidate
variable. The criteria are as follows:

a. The variable must be known


b. The variable must be easy to measure
c. The probability of missing observations is small
Formation of Imputation Classes

3. For the variables that have many categories, the


researchers reduced the number of categories for these
variables. The rationale for this procedure is because
having too many categories can increase heterogeneity
and the bias of the estimates. This was done with the use
of the software Statistica, particularly, the Recode
function.
Formation of Imputation Classes
1. Measures of association were tested on the matching
variables. The Chi Squared test was the first test applied on
the variables. This was made to determine if the candidate
variables is a significant factor for the variables of interest.

5. Other tests such as the Phi-coefficient, Cramer’s V and


Contingency Test. From these tests, the candidate variable
with the greatest degree of association will be chosen as the
matching variable that will group the data into their respective
imputation class.
Overall Mean Imputation
1. The overall mean for the variables of interest, TOTIN
and TOTEX, for the first visit was computed. The
equation used for the computation is:
Overall Mean Imputation

2. Using the nonresponse data sets generated, the


missing observations for the second visit variables
TOTIN and TOTEX were replaced with the overall
means of the first visit TOTIN and TOTEX.
Hot Deck Imputation
1. The donor and recipient record for each imputation
class and variable were first identified.

4. The missing observations of the second visit were assigned


to their respective recipient records for each imputation
class while the first visit observations were placed to their
respective donor records for each imputation class.

3. The values that were substituted for the missing


observations were randomly chosen from the donor record
for each imputation class.
Regression Imputation

1. A logarithmic transformation was applied for the first


and second visit of the variables TOTIN and TOTEX.
The rationale for this transformation is because the
income and expenditure variables are not normally
distributed, moreover logarithmic transformations help
correct the non-linearity of the regression equation.
Regression Imputation

1. The formation of regression equation was done after


the transformation. For this study, only one predictor
variable was used and the general formula for the
regression equation is:
Regression Imputation

1. For the stochastic regression which involves the


computation of the error term, the following steps were
made:
a. A frequency distribution of the residuals was created.

b. The class means of the frequency distributions were used


to obtain the error terms for the regression equation.
Regression Imputation
1. Model validation of the regression equations follow. This
diagnostic checking requires to satisfy the following
assumptions:
a. Linearity
b. Normality of Error Terms
c. Independence of Error Terms
d. Constancy of Variance

5. The missing observations were replaced by the predicted


value using the corresponding regression equation.
Comparison of Imputation Techniques

• The imputation methods were compared


using the following:
a. Bias and Variance of the Estimates
b. The Distribution of Imputed vs. Actual Data
c. Kalton’s Criteria in Assessing the
Performance of Imputation Techniques
Bias and Variance of the Estimates
• The variances of the actual data and the imputed
data were obtained.

• The variances of the imputed and actual data were


compared to assess the ability of the imputation
techniques to mirror the actual data, moreover to
determine the effect of the varying nonresponse
rates on the performance of the imputation
techniques.
Bias and Variance of the Estimates
2. The mean of the imputed data was computed. For
hot deck and stochastic regression imputation, the
average of all the mean of the 1000 simulated data
sets was computed.

4. The mean of the actual data, was computed.

3. The resulting bias of the mean of the imputed data


was computed by getting the difference between
(1) and (2).
Bias and Variance of the Estimates
• For the overall mean and deterministic regression
imputation, the variance is zero. On the other hand, for
hot deck and stochastic regression imputation, the
variance is given by:

and
Comparing the Distribution of the
Imputed vs. Actual Data
• A goodness – of –fit test was utilized for the comparison
of the distributions.

• The Kolmogorov - Smirnov Test was used.

• The Kolmogorov - Smirnov Test is a goodness-of-fit test


concerned with the degree of agreement between the
distribution of a set of sampled (observed) values and
some specified theoretical distribution (Siegel, 1988)
Comparing the Distribution of the
Imputed vs. Actual Data

1. Income and Expenditure deciles were created. The


creation of these deciles was based on the second
visit actual FIES 1997 data.

2. The obtained deciles were used as upper bounds


of the frequency classes.
Comparing the Distribution of the
Imputed vs. Actual Data

1. A Frequency Distribution Table (FDT) for each trial


was created.

4. The FDT includes the Relative Cumulative


Frequency (RCF) for both the imputed and actual
distribution. RCFs are computed by dividing the
cumulative frequency by the total number of
observations.
Comparing the Distribution of the
Imputed vs. Actual Data
1. The absolute value of the difference of the actual
data RCF and the imputed RCF was computed.

6. The test statistic for the Kolmogrov - Smirnov Test,


which is the maximum deviation, D, was
determined by using this equation:
Comparing the Distribution of the
Imputed vs. Actual Data
7. Since this is a large sample case and assuming a 0.05
level of significance, the critical value for this is
computed using the formula:

8. If D is less than the critical value, then the conclusion


that the imputed data maintains the same distribution
of the actual data follows.
Comparing the Distribution of the
Imputed vs. Actual Data
• To provide additional information to the distribution of the
distribution of the imputed vs. actual data, the comparison of
the frequency distribution of the true (deleted) values vs.
imputed values was taken.

3. Income and Expenditure deciles were created. The deciles


that were used in the previous test were the same deciles
used here.

2. The obtained deciles were used as upper bounds of the


frequency classes.
Comparing the Distribution of the
Imputed vs. Actual Data

1. A Frequency Distribution Table (FDT) for both the


imputed values and actual values was generated.

3. For the hot deck and stochastic regression which


had 1000 sets, the relative frequencies (RF) for
each frequency class were averaged over 1000
RFs.
Other Measures in Assessing the
Performance of the Imputation Methods

• Kalton used three criteria in assessing


the performance of the imputation
methods namely:
a. Mean Deviation (MD)
b. Mean Absolute Deviation (MAD)
c. Root Mean Square Deviation (RMSD)
Kalton’s Criteria in Assessing the
Performance of the Imputation Methods
• The Mean Deviation (MD) measures the bias of the
imputed values. This is represented by the
equation:
Kalton’s Criteria in Assessing the
Performance of the Imputation Methods
• The Mean Absolute Deviation (MAD) is a criterion
for measuring the closeness with which the deleted
are reconstructed measures the bias of the
imputed values. This is represented by the
equation:
Kalton’s Criteria in Assessing the
Performance of the Imputation Methods
• The Root Mean Square Deviation (RMSD) is the
square root of the sum of the square deviations of the
imputed and actual observation. Same as the MAD, it
measures the closeness with which the deleted values
are reconstructed. This is expressed as:
Determining the Best Imputation Method
• The four imputation methods were ranked to answer the
primary objective of the study. The selection of the best
method is independent for all the variables of interest and
nonresponse rates. The ranking covered the following:
a. Nonresponse Bias
b. Estimated Percentage of Correct Distribution
c. Mean Deviation
d. Mean Absolute Deviation
e. Root Mean Square Deviation
Determining the Best Imputation Method

1. In each criteria mentioned above, the


imputation were ranked using the scale of 1 to
4, with 1 indicating the best imputation method
and 4 being the worst.

2. For each variable of interest and nonresponse


rate, these rankings for each criteria were
obtained and summarized.
Determining the Best Imputation Method

1. The obtained rankings of a particular


imputation method for each criteria is added.

4. The imputation method with the lowest total


will be considered as the best imputation
method for the respective variable of interest
and nonresponse rate.
RESULTS AND DISCUSSION
Descriptive Statistics of Second Visit Data

•Average Total Spending in NCR: Php 102,389.80


•Average Total Earnings in NCR: Php 134,119.40
•TOTIN2 has a larger mean and standard deviation against TOTEX2
•There is greater dispersion and variability in TOTIN2 than TOTEX2
Formation of Imputation Classes
• Three candidate matching variables were selected namely
Provincial Area Code (PROV), Recoded Education Status
(CODES1), Recoded Total Employed Household Members
(CODEP1)

• The Chi-Squared test of association for the candidates and


the variables of interest showed that PROV, CODES1 and
CODEP1 are associated to CODIN1 and CODEX1.

• The p-values for all the candidates were less than 0.0001
indicating that the association is very significant.
Formation of Imputation Classes

•Only the candidate matching variable CODES1


measured at a minimum of 20% for all the three tests.
•The other candidate matching variables showed weak
association
Descriptive Statistics of the Data Grouped
into Imputation Classes (Table 5)
• The purpose of the descriptive statistics is to tell if the selected
matching variable decreases the variability of observations.

• Variability will be checked by using the standard deviation and


comparing it with the value from the overall standard deviation of
the variables of interest

• First IC produced lesser spread compared to the other two ICs.


The second and third IC had larger values of standard deviation
however it is compensated by the lower standard deviation of the
first IC
Mean of the Simulated Data (Table 6)
• When the nonresponse rate increases, the mean of the
observations deleted for both variables increases.

• When the nonresponse rate increases, the mean of the


observations retained for both variables decreases.

• The results showed that as the number of missing values


increase, the deviation between the means of the actual
and retained data slowly increases.
Regression Model Adequacy (Table 7)
• All the regression equations used for this study were able to satisfy the
model validation assumptions of linearity, normality of error terms,
independence of error terms and constancy of variance.

• The highest r2 measured is at 93.2% from the third imputation class of


TOTEX2 under the 30% nonresponse rate.

• The lowest r2 measured is at 70.3% from the first imputation class of


TOTIN2 under the 20% nonresponse rate.

• The third imputation class generated the highest r2 while the first
imputation class generated the lowest r2 for all variables of interest and
nonresponse rates.
Results for the Overall Mean Imputation
• For the nonresponse bias and variance:

– As the nonresponse rate increases for both TOTIN2 and


TOTEX2, the bias for TOTEX2 slowly decreases in
value than for TOTIN2.
– The variance for all nonresponse rates and variables of
interest are all zero because the population mean of the
imputed data set is constant.
Results for the Overall Mean Imputation
• For the distribution of the imputed data:

– the OMI method failed to maintain the distribution of the


actual data.

– Since only one value was imputed, the distribution of the


data was distorted.
Results for the Overall Mean Imputation

• For the other measures of variability (i.e. Mean Deviation, Mean


Absolute Deviation and Root Mean Square Deviation):

– In the three criteria, the values for TOTEX2 are increasing as


the nonresponse rate increases.

– For TOTIN2, the data with 20% NRR have the highest values
for all the three criteria.
Results for the Overall Mean Imputation
• For the other measures of variability (i.e. Mean Deviation, Mean
Absolute Deviation and Root Mean Square Deviation):

– In the Mean Deviation, the values show that the OMI for 10%
and 20% NRR underestimates the actual data which is
contrasting from the bias which overestimates the actual data
for the variable TOTEX2 while for the 30% the inverse result
shows.

– In the variable TOTIN2, the values show that the OMI for 10%
and 20% NRR, overestimates the actual data which is
contrasting from the bias which underestimates the actual data.
Results for the Hot Deck Imputation
• For the nonresponse bias and variance:

– The bias of the population mean of the imputed data


decreases for both variables the NRR increases.

– For the variable TOTEX2 with 30% NRR, the bias


becomes negative.

– The biases for the 10% and 20% NRR under HDI
performed better than OMI.
Results for the Hot Deck Imputation
• For the nonresponse bias and variance:

– The variance of the imputed data increases by more


than one hundred percent as the nonresponse rate
increases.

– The data with 10% NRR provided the least spread of the
population means and the data which contained the
largest number of imputation or 30% NRR provided the
worst spread.
Results for the Hot Deck Imputation
• For the distribution of the imputed data:

– For the variable TOTIN2 with 10% and 20% nonresponse, the
HDI was able to maintain the distribution of the actual data.

– For the variable TOTEX2, the HDI was able to maintain the
distribution of the actual data for the 10% NRR.

– For both TOTIN2 and TOTEX2 under the 30% NRR, the HDI
failed to maintain the distribution of the actual data with 1%
and 0% respectively
Results for the Hot Deck Imputation
• For the other measures of variability (i.e. Mean Deviation,
Mean Absolute Deviation and Root Mean Square
Deviation):
– For the variable TOTIN2, the results for the MD show that the
values were underestimated for all NRR . The values under the MD
decreases as the NRR increases. For the MAD and RMSD, the
values obtained were unusually large as compared to OMI.

– For the variable TOTEX2, the values for the MD show that the HDI
for 10% and 20% NRR, overestimates the actual data which is
consistent from the bias while for the 30% the inverse result shows.
The MAD and RMSD showed that the HDI was better than OMI.
Results for the Deterministic
Regression Imputation (DRI)
• For the nonresponse bias and variance:

– For both variables TOTIN2 and TOTEX2, the nonresponse


bias show that the DRI underestimated the actual values.

– Unlike the results in OMI and HDI where the bias increases
tremendously as the nonresponse rate increases, the
increase in bias for this method is much slower
Results for the Deterministic
Regression Imputation (DRI)
• For the nonresponse bias and variance:

– For the variable TOTEX2, DRI more biased estimates for all
NRR than OMI and HDI.

– Same with the OMI, the variance for this method is also zero
since the population mean is constant due to a single
simulation of the missing observations.
Results for the Deterministic
Regression Imputation (DRI)
• For the distribution of the imputed data:

– The DRI was able to maintain the distribution of the actual


data for all NRR and variables of interest (TOTIN2 and
TOTEX2).

– This result is contrary to previous studies which indicate that


DRI could give the same results as that of the OMI.
Results for the Deterministic
Regression Imputation (DRI)
• For the other measures of variability (i.e. Mean Deviation, Mean
Absolute Deviation and Root Mean Square Deviation):

– The MD for both TOTIN2 and TOTEX2 for all NRR


underestimates the actual observations. The underestimation
for all NRR is almost stable because the rate of change is very
small as compared to OMI and HDI.

– The MAD and RMSD for both TOTIN2 and TOTEX2 provided
smaller values for all NRR which shows that this method is
better than the OMI and HDI.
Results for the Stochastic Regression
Imputation (SRI)
• For the nonresponse bias and variance:
– For both TOTIN2 and TOTEX2, SRI showed that there is no
relationship between the nonresponse rate and nonresponse
bias estimates of the population mean. The biases fluctuate
from one nonresponse to another. It also showed that this
method has the least bias for the 30% NRR.

– In all the methods and nonresponse rate, it is clearly seen


that there is a huge disparity between the variances of the
SRI and HDI. Variances from the HDI are almost ten times
larger compared to SRI.
Results for the Stochastic Regression
Imputation (SRI)
• For the distribution of the imputed data:

– For both TOTIN2 and TOTEX2, the SRI was able to maintain
the distribution of the actual data for the 10% and 30% NRR.

– For the 20% NRR of the variable TOTEX2, the SRI was better
than the HDI in retaining the distribution of the actual data.
Results for the Stochastic Regression
Imputation (SRI)
• For the other measures of variability (i.e. Mean Deviation, Mean
Absolute Deviation and Root Mean Square Deviation):

– For the variables TOTIN2 and TOTEX2, the MD fluctuates from


one NRR to another. However, the values in SRI is only second
to the DRI yet SRI performed better than OMI and HDI.

– The same results for the MAD and RMSD wherein the SRI
ranked second to the DRI but outperformed the OMI and HDI.
Distribution of the True Values vs.
Imputed Values
• For the OMI, under all the nonresponse rates and variables of
interest, the tables illustrate the distortion of the distribution as the
missing values replaced by a single value is concentrated on a
single frequency class.
• For the HDI method, in all nonresponse rates, most of the imputed
observations clustered in the first frequency class for both
variables TOTIN and TOTEX.
• The clustering under HDI was also formed for the 10% and 30%
NRR in last frequency class for TOTEX2 and for the all
nonresponse rates in second frequency class for TOTIN2.
Distribution of the True Values vs.
Imputed Values

• For the regression imputations, both regressions in all NRRs and


variables of interest produced more spread distribution although
there are some areas that are under represented.

• The failure to consider a random residual term in deterministic


regression resulted into a severe under representation of the data
in particular the first frequency class under all NRRs and variables
of interest.
Ranking the Imputation Methods
• For TOTIN2 under all NRR:
 SRI and DRI tied at first, followed by OMI and then HDI.
• For TOTEX2 under 10% NRR:
 SRI ranked first followed by HDI, DRI and OMI.
• For TOTEX2 under 20% NRR:
 HDI and DRI tied at first followed by SRI and OMI.
• For TOTEX2 under 30% NRR:
 SRI ranked first followed by DRI, OMI and HDI.
Ranking the Imputation Methods
• Overall:

– The best imputation method for this study is the Stochastic


Regression Imputation using the 1997 FIES data.

– The worst imputation method for this study is the Hot Deck
Imputation.
CONCLUSION
In Summary…

• There are a lot of considerations to make before using any


imputation methods such as the type of analysis, the type of
estimator of interest that will suit his or her purpose.

• Practical issues such as resources available, difficulty in


programming, amount of time it takes to implement each method,
and complexity of procedures should also be taken into
consideration when selecting which imputation method to use.
In Summary…

• Results show that the choice of imputation method significantly


affected the estimates of the actual data.

• The bias and variance estimates of the imputed data appeared to


vary much across imputation methods and it was unexpected that
the HDI rendered the highest estimates in majority of the
nonresponse rates as well as its variables.

• In terms of the distribution, both regression imputation methods


retained the distribution of the data especially the DRI.
In Summary…

• In the other tests of accuracy and precision, namely, the Mean


Deviation, Mean Absolute Deviation and Root Mean Square
Deviation, the different methods provided mixed results in all
nonresponse rates.

• After comparing and ranking the four methods, the SRI procedure
is considered the best imputation method for this study. This can
be attributed to the random residuals added to the deterministic
imputation which helped in making the estimates less biased than
its deterministic counterpart.
In Summary…

• Surprisingly and in contrast with most previous studies, the Hot


Deck Imputation method was the least effective for this study. The
selection of donors with replacement might be the cause of its
poor performance.

• Nevertheless, anyone faced with having to make decisions about


imputation procedures will usually have to choose some
compromise between what is technically effective and what is
operationally expedient.
RECOMMENDATIONS
Recommendations for Further Research

• Explore the use and effectiveness of the Multiple Imputation


Method.

• Implement the use of proper variance estimation methods (e.g.


Jackknife Variance Estimation)

• For selecting a matching variable, advanced modern statistical


methods like the Chi-squared Automatic Interaction Detector
(CHAID) analysis can be utilized for further studies.
Recommendations for Further Research

• For regression imputation, instead of creating models for each


imputation class, dummy variables should be inserted in the
model.

• Also using a statistical package that can generate faster and


easier imputations in order to save time in debugging and
computer crashes due to memory overload.
THANK YOU VERY MUCH!!!

Diana Camille B. Cortes


James Edison T. Pangan

You might also like