Professional Documents
Culture Documents
Table 1 shows the descriptive statistics of the second visit variables of interest
(VI). This was computed to provide a brief idea on how much a household spends and
earns in a period of time, measure the differences of the statistics between the two
variables and to compare the results with other tests later on. This descriptive statistics
(DS) will be also used in comparing the results of imputation classes (IC), how well the
Table 1
Descriptive Statistics
The average total spending of a household in the National Capital Region (NCR)
is about P102389.8 while the average total earnings amounted to P134119.4, a difference
of more than thirty thousand pesos. Observations from the TOTIN2 are larger and more
spread than the TOTEX2 because of a larger mean and standard deviation respectively.
The dispersion can be also seen by just looking at the minimum at maximum of the two
variables. The range of TOTIN2 which measured more than four million against the
range of TOTEX2 measured one million lower than TOTIN2 can be also a sign of the
Table 2 shows the results of the chi-square test where it was done to determine if
the candidate matching variables (MV) are associated with the VIs. The MV stated in the
methodology must be highly correlated to the variables of interest. The first visit VIs,
TOTIN1 and TOTEX1, were grouped into four categories in order to satisfy the
assumptions in the association tests. The first visit VIs was used in as the variables to be
tested for association rather than second visit VIs since the second visit VI already
The following candidate MVs that were tested are the provincial area codes
(PROV), recoded education status (CODES1) and recoded total employed household
members (CODEP1). The PROV has four categories, CODES1 has three, and CODEP1
has also four. Originally, CODES1 and CODEP1 have more than what they have now.
Since the original MVs have numerous categories (i.e. In CODES1 and CODEP1, there
were more than 60 and 7 categories respectively.), the MVs were recoded and further
The resulting number for each candidate is the χ2 test statistic and below it is p-value.
Table 2
CODIN1 CODEX1
χ2 = 151.78 χ2 = 137.83
PROV
(<0.0001) (<0.0001)
χ2 = 613.859 χ2 = 687.342
CODES1
(<0.0001) (<0.0001)
χ2 = 358.436 χ2 = 193.132
CODEP1
(<0.0001) (<0.0001)
The chi-square test of association for the candidates and the variables of interest
showed that PROV, CODES1 and CODEP1 are associated to CODIN1 and CODEX1. In
fact, the p-values for all the candidates were less than 0.0001 indicating that the
determine which of the three candidates will be chosen as the MV of the study. The chi-
Table 3 shows the other tests of association, namely, the Phi-Coefficient, Cramer’s
V and the contingency test. These tests were done in order to assess the degree of
Table 3
Tests of Association:
Degree of association
The table above displays the degree of association between the candidates and the
variables of interest. The degree of association for all the tests showed weak association.
In real complex data, the association between variable happens to be smaller or even no
association at all. In all the other tests of association, only CODES1 measured at least a
minimum of twenty percent to be used in dividing the data into imputation classes. The
result above is now sufficient to say that CODES1 is the chosen MV for this data.
statistics for each imputation class was performed. Table 4 shown below is the descriptive
statistics of each imputation class of the data. The descriptive statistics will tell if the best
MV decreases the variability of the observations. In checking for the variability of each
imputation class, the standard deviation will be used and compared with the value from
Table 4
Valid
Mean Minimum Maximum Std. Dev N
TOTIN2
IC1 93588.32 9067.000 1340900 75619.52 2635
IC2 186940.9 14490.00 4215480 281852.3 1434
IC3 643191.2 54790.00 4357180 829409.3 61
TOTEX2
IC1 74866.68 9025.000 731937.0 47517.69 2635
IC2 135510.8 13575.00 3203978 151984.3 1434
IC3 413184.0 40505.00 2726603 532577.1 61
The table shown above that in the IC1 for both VIs, the first IC which has the largest
number of observations produced lesser spread than the two ICs. The two ICs, IC2 and
IC3 produced large standard deviations however it is being neutralized by a low value
from IC1 which has the largest proportion of the data. It may be that reason why the
standard deviation and the mean of IC3 are large because majority of the extreme values
Table 5 shows the result of the means in both VIs under the varying rates of
nonresponse rate on the population mean ignoring the missing values. More importantly,
the results below will become input in the comparison of the estimates from the imputed
Table 5
TOTIN2
10% 3717 134821.662 413 127799.121
20% 3304 133624.722 826 136098.155
30% 2891 130685.596 1239 142131.636
The mean rates of the observations set to nonresponse and observations retained
showed contrasting results. When the nonresponse rate gets larger for both sets, the mean
possibility that large values were set to nonresponse that increased the means of the data
Comparing the means for the varying nonresponse rates under each VI, the results
showed that there is little difference between the population mean ignoring the missing
data and the population mean of the actual data. However, similar to the description
above, as the number of missing values increase, the deviation between the means of the
(NRRs) that were checked for adequacy. The columns are represented as follows: (a) VI,
(b) the nonresponse rate (NRR), (c) IC, (d) the prediction model, (e) the coefficient of
data sets. The highest r2 in table 6 measured 93.2%, the coefficient of variation for the
third imputation class of the TOTEX2 variable under the highest NRR while the lowest is
70.3%, the coefficient of variation for the first imputation class of the TOTIN2 variable
under 20% NRR. It is interesting to note that for all NRRs and VIs, the third IC generated
the highest r2 among the ICs. The lowest r2 from all the models under the third imputation
class is 88.8% which is from the 30% NRR of the TOTEX2 variable. Contrary to the r 2 of
the third IC, the first IC generated the lowest r2 for all NRRs and VIs. (For the other
In the evaluation of the different imputation methods (IMs), each IM will discuss
its results independently. For each IM, the discussion of results will go as follows: (1)
nonresponse bias and variances of the estimates of the population of the imputed data, (2)
distribution of the imputed data using the Kolmogorov-Smirnov Goodness of Fit Test,
and (3) other measures of variability using the mean deviation (MD), mean absolute
The table of results will contain the following columns: (a) VI, (b) NRR, (c) the
bias of the population mean of the imputed data, Bias( y ' ), (d) the variance of the
population mean of the imputed data, Var( y ' ), (e) Estimated percentage of correct
distribution of the imputed data set to the actual data set (PCD), (f) Mean Deviation
(MD), (g) Mean Absolute Deviation (MAD) and (h) Root Mean Square Deviation
(RMSD).
Overall Mean Imputation
Table 7 shows the results of the different criteria in evaluating the newly created
data with imputations using the overall mean imputation (OMI) method.
Table 7
(c) (d)
(a) (b) BIAS Var (f) (g) (h)
VI NRR ( y ' ) ( y ' ) (e) PCD MD MAD RMSD
In (c) of table 7, results show that for nonresponse bias, as the nonresponse rate
increases for both VI, the value of the bias decreases. The decrease in value of the bias in
TOTIN2 was faster and more dramatic than TOTEX2. It seemed that in TOTIN2, the
extents of the decrease in value are almost five hundred percent under twenty percent
NRR and almost tripled the rate of decrease under twenty percent NRR for the highest
NRR. In contrast of the results in TOTIN2, the extent of decrease of the bias for TOTEX2
is much slower. The biases of the twenty and thirty percent for TOTIN2 is more than 6
The variance for all NRR and VI are all zero because the population mean of the
imputed data set is constant. The data was not simulated one thousand times unlike for
hot deck imputation (HDI3) and stochastic regression imputation (SRI3). Further, the
OMI method did not create a sampling distribution for the mean of the created data due to
a single simulation.
Results in column (e) of table 7 showed that in all nonresponse rates and
variables, the OMI method failed to maintain the distribution of the actual data. This was
expected primarily because in each missing observation from all data sets with missing
data, the missing observations were replaced by a single value which is the overall mean
Results from other studies stated that the OMI is one of the worst among all
are obviously made. Cases that vary significantly to the imputed values were the primary
cause for inaccuracy. Also, the use of only a single value to be imputed for the missing
data distorts the distribution of the data. The distribution of the data becomes too peaked
which makes this method unsuitable for many post-analysis. (Cheng, 1999)
The three criteria in table 7 under the columns (f), (g) and (h) show the other
measures of variability of the imputed data. In all the criteria, the values for TOTEX2 are
increasing as the nonresponse rate increases. However, this is not the case for TOTIN2.
Suprisingly, the data which have twenty percent nonresponse observation that were
showed contrast with the results of the bias which focused on the population mean of the
imputed data. The mean deviation for all nonresponse rates under the TOTEX2 variable
were overestimating the actual data however in the results of bias on the other hand, the
population mean of the imputed data underestimates the actual data. Likewise in the other
variable, when the result in mean deviation is an underestimate, the result from the bias is
Table 8 shows the results of the different criteria in evaluating imputed data with
imputations using the hot deck imputation (HDI3) method with three imputation classes.
Table 8
Criteria Results for the HDI3 Method
imputed data increases for both variables as the NRR increases. As seen in OMI, for the
TOTIN2 variable, the bias of the data which has twenty percent imputations is more than
four times the bias of the data which contained ten percent imputed and almost half the
bias of the data which has thirty percent imputed. The bias in the TOTIN2 variable in this
Similar results were seen in OMI for the other variable, TOTEX2 where in the
data which contained thirty percent imputations, the bias becomes negative. The bias
seemed to decrease in value as the NRR increases. The biases for the first and second
The variance of the population mean of the data which have imputations increases
by more than one hundred percent as the nonresponse rate increases. The data which
contained the lowest number of imputations provided the least spread of the population
means and the data which contained the largest number of imputation provided the worst
spread.
Results in column (e) shows that in TOTIN2, the imputed data maintained the
distribution of the actual data for the data which contained ten and twenty percent
imputations. On the other variable, only the data which contained ten percent imputation
provided maintained the distribution of the actual data for all the one thousand data set. In
the data which contained twenty percent imputations, only 969 out of the 1000 data set
In the data sets which contained the largest number of imputations, both variables
failed to maintain the distribution of the actual. Much worse, none of the simulated data
set for TOTEX2 registered the same distribution as the actual. On the other hand, only a
lone data set maintained the same distribution as the actual. The researchers look into the
possibility that more than one recipient are having the same donor or could be that
majority of the imputations are coming from one particular area in the record.
For the three remaining criteria, the values generated were better than the results
in the OMI method. In the MD criterion for both variables, the MD criterion generated an
underestimation of the actual observation. While the OMI method overestimates the
deleted actual values for the TOTIN2 variable, the HDI3 underestimates them. The
underestimation rapidly increases as the nonresponse rate increases. The magnitude of the
MD for TOTIN2 is larger for HDI3 than in OMI for all nonresponse rates.
Similar to the results in MD for TOTIN2, the MAD and RMSD were unusually
large compared to the OMI. In seems that imputation classes for the TOTIN2 variable
were not as effective as compared to the TOTEX2 variable wherein in majority of values
in all the nonresponse rates and criteria showed that HDI3 was better than OMI.
Table 9 shows the results of the different criteria in evaluating the imputed data
using the deterministic regression imputation method with three imputation classes
(DRI3).
Table 9
(c) (d)
(a) (b) BIAS Var (e) (f) (g) (h)
VI NRR ( y ' ) ( y ' ) PCD MD MAD RMSD
Looking at table 9, the bias for all NRR and VI showed negative results which
indicates that the population mean of the imputed data is underestimated. The results in
the nonresponse bias from this method are similar to the results of the previous two
methods that the TOTIN2 is underestimated. However, not like the results in OMI and
HDI3 which the bias increases tremendously as the nonresponse rate increases, the
increase in bias for this method is much slower. The bias of the data which has twenty
percent imputations of the imputed data set is just twice the bias of the data set which has
a lower percentage of imputations. For the TOTEX2 variable, this method produces more
biased estimates for all NRR than the two previous methods.
As in the OMI method, the variance for this method is also zero since the
In contradictory to the results of the OMI method under this criterion, the DRI3
maintained its distribution for all the NRRs and VIs. It is even much better than the HDI3
since all of the imputed data sets under all NRRs and VIs preserved the same distribution
as the actual data. It is interesting to note that the regression models that were used in this
study did not follow the same format as the related literature and provided a distinct
result. Earlier studies that made use of categorical auxiliary variables, variables that are
known to be the matching variables in this study, conclude that deterministic regression is
just the same as the mean imputation to generate distorted and peaked distributions.
However, in this study, the independent variable was the first visit VIs and for each
imputation class there is a fitted model which registered better R2 that made the
difference.
Similar to the results in the nonresponse bias, the MD for all VI and NRR
underestimates the actual observations. The underestimation for all NRR is almost stable
because the rate of change is very small as compared to the two previous IMs. The MAD
and RMSD show better results than OMI and HDI providing closer values of the imputed
to the actual observations. As seen in OMI and HDI3, the TOTIN2 have larger values for
the MAD and RMSD criteria. Fitting models with high r2 was the key factor that made
Table 10 shows the results of the different criteria in evaluating the imputed data
using the stochastic regression imputation method with three imputation classes (SRI3).
Table 10
Criteria Results for SRI3
(c) (d)
(a) (b) BIAS Var (e) (f) (g) (h)
VI NRR ( y ' ) ( y ' ) PCD MD MAD RMSD
The only method that produced reasonable estimates is the SRI3 method. The
random residual added to the deterministic predicted observation made the difference.
Clearly, there is no relationship between the nonresponse bias estimates of the population
mean and the nonresponse rate. The biases fluctuate from one nonresponse rate to the
other. This method provided the least bias in the highest nonresponse for both TOTEX2
and TOTIN2. While the other methods reached a four digit bias, the SRI3 generated a
much lesser bias than the other three methods. In fact, there is this huge disparity in the
third nonresponse rate wherein it only produced less than twenty percent of the bias
The variances of the SRI3 proved to be much better than its model-free
counterpart which is the HDI3. In all the methods and nonresponse rate, it is clearly seen
that there is a huge disparity between the variances of the SRI3 and HDI3. Variances
from the HDI3 are almost ten times larger compared to SRI3.
Results from the SRI3 performed better than its model-free counterpart that is the
HDI3 method which also simulated the data 1000 times. Unlike in hot deck imputation,
stochastic regression imputation maintained the same distribution for all imputed data
sets for the first and third nonresponse rates. It also outperformed the former in the
second nonresponse rate, TOTEX2 variable. One of the reasons why 16 out of the 1000
sets failed to maintain the distribution of the actual data set for the imputed data set which
contained twenty percent or 826 imputations might be the unfeasibility of the predicted
values.
In earlier studies, the stochastic regression imputation performs better than any of
the four methods used here. The random residual was added to the deterministic predicted
value to preserve the distribution of the data. However, even if the original deterministic
imputed values were feasible, the stochastic counterpart need not be. After adding the
residual to the deterministic imputation, unfeasible values could namely result. (Nordholt,
1998)
Similar to the results in the nonresponse bias, the MD has no relationship with the
NRR since from one NRR to another, the MD fluctuates. In the same criteria, it
outperformed its regression counterpart but also getting outperformed by the two other
methods. Contradictory to the results and observations in the MD criteria, the SRI3
closely follows second to the DRI3 methods and provides better values than the two other
methods.
In the review of related literature, the stochastic regression performs way better
than the deterministic regression. The researchers look at the same reason from the
previous criteria. It’s likely possible that the predicted values are unrealistic as compared
After comparing the different methods with the criteria proposed in the
methodology, the distribution of the true values (TVs) that were deleted and the imputed
values (IVs) from each of the imputation procedures for all the VIs and nonresponse rates
were computed. Table 11, 12 and 13 shows the frequency distribution of the methods
with their corresponding relative frequencies (RFs) for the first, second and third
nonresponse rates respectively. The RFs for the 1000 simulated data set from HDI3 and
SRI3 were averaged. The first column represents the VIs frequency classes. This was the
same classes that were used in the Kolmogorov-Smirnov Goodness of Fit test in
determining the estimated percentage of similar distributions of the imputed data. The
second column is the relative frequencies of the actual data. The succeeding columns are
Distribution of the TVs and IVs from the imputation procedures: 10% NRR
Imputation Procedures
TOTIN2 TV
OMI HDI3* DRI3 SRI3*
15.10
<40570 9.70% 0.00% % 6.10% 9.10%
10.20
40570- % 0.00% 11.90% 8.70% 7.90%
10.10 14.50
51564- 9.40% 0.00% % % 8.30%
10.20 10.70 10.00
62006.5- % 0.00% 9.50% % %
12.80 12.40
73900.5- 9.00% 0.00% 9.60% % %
10.90
88127- % 0.00% 9.30% 9.20% 9.00%
100.00 10.50
104801- 11.90% % 9.80% 9.90% %
128000- 11.40% 0.00% 7.80% 11.10% 9.30%
10.70
161669- 7.70% 0.00% 8.00% % 11.20%
12.30
233907- 9.90% 0.00% 8.90% 6.30% %
* RFs for each class were obtained by taking the average of
the 1000 simulated data set.
Table 12
Distribution of the TVs and IVs from the imputation procedures: 20% NRR
Imputation Procedures
TOTIN2 TV
GM HDI3* DRI3 SRI3*
<40570 10.00% 0.00% 15.70% 4.80% 11.80%
40570- 10.30% 0.00% 12.10% 11.90% 12.20%
51564- 11.70% 0.00% 10.10% 10.20% 11.30%
62006.5- 10.20% 0.00% 9.60% 11.70% 9.90%
73900.5- 8.60% 0.00% 9.50% 11.90% 8.50%
88127- 9.40% 0.00% 9.30% 9.60% 10.10%
104801- 9.10% 100.00% 9.70% 11.70% 9.00%
128000- 9.20% 0.00% 7.60% 9.80% 8.30%
161669- 11.30% 0.00% 7.80% 9.70% 8.90%
233907- 10.20% 0.00% 8.70% 8.70% 10.10%
* RFs for each class were obtained by taking the average of
the 1000 simulated data set.
Table 13
Distribution of the TVs and IVs from the imputation procedures: 30% NRR
Imputation Procedures
TOTIN2 TV
GM HDI3* DRI3 SRI3*
15.60
<40570 9.40% 0.00% % 6.50% 8.90%
12.10 10.40
40570- 9.00% 0.00% % % 8.20%
10.10 10.80
51564- 9.90% 0.00% % % 8.80%
10.70 10.10
62006.5- % 0.00% 9.60% 11.50% %
10.20 12.20
73900.5- % 0.00% 9.50% % 11.00%
10.30 10.70 10.20
88127- % 0.00% 9.30% % %
10.30 100.00 10.50 10.40
104801- % % 9.70% % %
10.80
128000- 9.80% 0.00% 7.60% 11.20% %
10.70 10.30
161669- % 0.00% 7.70% 8.20% %
233907- 9.90% 0.00% 8.70% 8.00% 11.30%
For the actual and imputed data with the lowest number of observations set to
missing, it clearly illustrates the distortion of the distribution created by the OMI method.
The OMI method assigns the mean of the first visit VI to all the missing cases, as a result,
all the distribution of the missing values replaced by a single value concentrates at one
frequency class. The three methods which implemented imputation classes, gave a better
For the HDI3 method, in all nonresponse rates, most of the imputed observations
clustered in the first frequency class, that is less than 37859.5 for TOTEX2 and 40570 for
TOTIN2. The clustering was also formed for the first and third nonresponse rate in last
frequency class for TOTEX2 and for the all nonresponse rates in second frequency class
for TOTIN2. The percentage of the data in from the lowest class for TOTEX2 and
TOTIN2, for all nonresponse rate ranges from 14-16% compared to the actual percentage
observed from the interval 86103-126254.5 for the 10% and 20% nonresponse imputed
data sets respectively and from the interval 63265-101947 for the 30% nonresponse
imputed data sets. The percentage from the interval indicated for the 10% and 20% under
the actual data totaled about 30% while the imputed data only totaled less than 30%.
For the two regression imputation methods, unlike hot deck and OMI which had
major cluster, produced more spread distribution although there are some areas that are
regression resulted into a severe under representation of the data in particular the first
frequency class. On the other hand, the SRI3 which considered a random residual
provided better results than DRI3. However, there are some areas that the added random
the following IMs will be chosen as the “best” IMs for this particular study and data. The
selection of the best method will be independent for all VIs and NRRs. The ranking are
based on a four-point system wherein the rank value of 4 denotes the worst IM for that
specific criterion and 1 denotes the best IM for that criterion. In case of ties, the average
ranks will be substituted. The IM with the smallest rank total will be declared the “best”
IM for the particular VI and NRR. The ranking of IM will cover the following criteria: (a)
Nonresponse bias, (b) Distribution of correct distributions, and (c) Other measures of
variability. All in all, there are five criteria that each IM will be rank in.
Table 14, 15 and 16 shows the ranking of the different imputation methods for the
10%, 20% and 30% NRR respectively. The table is divided into six columns. The first
column represents the VI, second is the criteria, third up to the sixth column are the
imputation methods.
Table 14
TOTIN2 N.B. 1 2 4 3
PCD 4 1.3 1.3 1.3
MD 1 2 4 3
MAD 3 4 1 2
RMSD 3 4 1 2
TOTAL 12 13.3 11.3 11.3
Category Rank 3rd 4th 1st 1st
Table 15
TOTIN2 N.B. 3 4 2 1
PCD 4 1.3 1.3 1.3
MD 3 4 2 1
MAD 3 4 1 2
RMSD 3 4 1 2
TOTAL 16 17.3 7.3 7.3
Category Rank 3rd 4th 1st 1st
Table 16
TOTIN2 N.B. 3 4 2 1
PCD 4 3 1.5 1.5
MD 3 4 2 1
MAD 3 4 1 2
RMSD 3 4 1 2
TOTAL 16 19 7.5 7.5
Category Rank 3rd 4th 1st 1st
provided better results than their model-free counterparts. For all the nonresponse rates
under the TOTIN2 variable, the two RIMs tied as the best imputation method, and
surprisingly the HDI3 finished the worst imputation method behind OMI. Under the
TOTEX2 variable, mixed rankings were seen for all nonresponse rates. The RIMs still
provided great results. The SRI3 method finished first in the 10% and 30% NRR and
ranked third in the 20% NRR while the DRI3 method finished third, first and second in
the 10%, 20% and 30% NRR respectively. While the HDI3 was seen as the worst IM for
TOTIN2, the OMI was concluded the worst IM for TOTEX2 by ranking last for both
10% and 20% NRR and third for the last NRR.
To conclude, the best imputation method for this study is the stochastic regression
imputation with three imputation classes using the 1997 FIES data. It is very closely
followed by the deterministic regression imputation with three imputation classes. The
SRI3 method never ranked last in all the criteria, NRRs and VIs, unlike for DRI3 which
provided the worst IM in the nonresponse bias and mean deviation criteria. The
researchers selected the HDI3 as the worst IM in this study. The HDI3 method fared the
worst in TOTIN2 and majority of the results in the different criteria under each NRR and