Discussion of Results Complete

Chapter 6
Results and Discussion
A. Descriptive Statistics of the Second Visit Data Variables
Table 1 shows the descriptive statistics of the second visit variables of interest
(VI). This was computed to provide a brief idea on how much a household spends and
earns in a period of time, measure the differences of the statistics between the two
variables and to compare the results with other tests later on. This descriptive statistics
(DS) will be also used in comparing the results of imputation classes (IC), how well the
observations are grouped.
Table 1
Descriptive Statistics
Mean Std.Dev Minimum Maximum N

TOTEX2 102389.8 129866.6 8926.000 3203978 4130
TOTIN2 134119.4 216934.9 9067.000 4357180 4130
The average total spending of a household in the National Capital Region (NCR)
is about P102389.8 while the average total earnings amounted to P134119.4, a difference
of more than thirty thousand pesos. Observations from the TOTIN2 are larger and more
spread than the TOTEX2 because of a larger mean and standard deviation respectively.
The dispersion can be also seen by just looking at the minimum at maximum of the two
variables. The range of TOTIN2 which measured more than four million against the
range of TOTEX2 measured one million lower than TOTIN2 can be also a sign of the
extreme variability of the observations.

B. Formation of Imputation Classes (IC)
Table 2 shows the results of the chi-square test where it was done to determine if
the candidate matching variables (MV) are associated with the VIs. The MV stated in the
methodology must be highly correlated to the variables of interest. The first visit VIs,
TOTIN1 and TOTEX1, were grouped into four categories in order to satisfy the
assumptions in the association tests. The first visit VIs was used in as the variables to be
tested for association rather than second visit VIs since the second visit VI already
contained missing data.
The following candidate MVs that were tested are the provincial area codes
(PROV), recoded education status (CODES1) and recoded total employed household
members (CODEP1). The PROV has four categories, CODES1 has three, and CODEP1
has also four. Originally, CODES1 and CODEP1 have more than what they have now.
Since the original MVs have numerous categories (i.e. In CODES1 and CODEP1, there
were more than 60 and 7 categories respectively.), the MVs were recoded and further
categorized into smaller groups.
The resulting number for each candidate is the χ2 test statistic and below it is p-value.
Table 2
Tests of Association for Matching Variable:

The Chi-Square Test of Independence
CODIN1 CODEX1
χ2 = 151.78 χ2 = 137.83
PROV
(<0.0001) (<0.0001)
χ2 = 613.859 χ2 = 687.342
CODES1
(<0.0001) (<0.0001)
χ2 = 358.436 χ2 = 193.132
CODEP1
(<0.0001) (<0.0001)
The chi-square test of association for the candidates and the variables of interest
showed that PROV, CODES1 and CODEP1 are associated to CODIN1 and CODEX1. In
fact, the p-values for all the candidates were less than 0.0001 indicating that the
association is very significant. The results of succeeding tests of association will
determine which of the three candidates will be chosen as the MV of the study. The chi-
square test is insufficient since it failed to determine the best MV.
Table 3 shows the other tests of association, namely, the Phi-Coefficient, Cramer’s
V and the contingency test. These tests were done in order to assess the degree of
association of the candidates to CODIN1 and CODEX1.
Table 3
Tests of Association:
Degree of association
Phi-Coefficient Cramer's V Contingency Test

CODIN1 CODEX1 CODIN1 CODEX1 CODIN1 CODEX1
PROV 0.192 0.183 0.111 0.105 0.188 0.18
CODES1 0.386 0.408 0.273 0.288 0.36 0.378
CODEP1 0.295 0.216 0.17 0.125 0.283 0.211
The table above displays the degree of association between the candidates and the
variables of interest. The degree of association for all the tests showed weak association.
In real complex data, the association between variable happens to be smaller or even no
association at all. In all the other tests of association, only CODES1 measured at least a
minimum of twenty percent to be used in dividing the data into imputation classes. The
result above is now sufficient to say that CODES1 is the chosen MV for this data.
To have a detailed description of CODES1 imputation classes, a descriptive
statistics for each imputation class was performed. Table 4 shown below is the descriptive
statistics of each imputation class of the data. The descriptive statistics will tell if the best
MV decreases the variability of the observations. In checking for the variability of each
imputation class, the standard deviation will be used and compared with the value from
the overall standard deviation of the variables of interest.
Table 4
Descriptive Statistics of the data grouped into ICs
Valid
Mean Minimum Maximum Std. Dev N
TOTIN2
IC1 93588.32 9067.000 1340900 75619.52 2635
IC2 186940.9 14490.00 4215480 281852.3 1434
IC3 643191.2 54790.00 4357180 829409.3 61
TOTEX2
IC1 74866.68 9025.000 731937.0 47517.69 2635
IC2 135510.8 13575.00 3203978 151984.3 1434
IC3 413184.0 40505.00 2726603 532577.1 61
The table shown above that in the IC1 for both VIs, the first IC which has the largest
number of observations produced lesser spread than the two ICs. The two ICs, IC2 and
IC3 produced large standard deviations however it is being neutralized by a low value
from IC1 which has the largest proportion of the data. It may be that reason why the
standard deviation and the mean of IC3 are large because majority of the extreme values
were contained on that class.
C. Mean of the simulated data by nonresponse rate for each VI
Table 5 shows the result of the means in both VIs under the varying rates of
nonresponse. This was generated to have a brief description on the effects on
nonresponse rate on the population mean ignoring the missing values. More importantly,
the results below will become input in the comparison of the estimates from the imputed
data for each imputation method (IM).
Table 5
Mean of the retained and deleted observations
Observations retained Observations deleted

No. Mean No. Mean
TOTEX2
10% 3717 102748.610 413 99160.235
20% 3304 102219.791 826 103069.697
30% 2891 100709.947 1239 106309.365
TOTIN2
10% 3717 134821.662 413 127799.121
20% 3304 133624.722 826 136098.155
30% 2891 130685.596 1239 142131.636
The mean rates of the observations set to nonresponse and observations retained
showed contrasting results. When the nonresponse rate gets larger for both sets, the mean
rate of observations set to nonresponse increases. Conversely, the mean rate of
observations set to nonresponse decreases when nonresponse rate increases. It’s a
possibility that large values were set to nonresponse that increased the means of the data
sets containing nonresponse for the varying rates of nonresponse.
Comparing the means for the varying nonresponse rates under each VI, the results
showed that there is little difference between the population mean ignoring the missing
data and the population mean of the actual data. However, similar to the description
above, as the number of missing values increase, the deviation between the means of the
actual and retained data slowly increases.
D. Regression model adequacy

Table 6 show the different regression models for all VIs and nonresponse rates
(NRRs) that were checked for adequacy. The columns are represented as follows: (a) VI,
(b) the nonresponse rate (NRR), (c) IC, (d) the prediction model, (e) the coefficient of
determination (R2) and (f) the F-statistic and its p-value.

Table 6
Model Adequacy Results
(a) (b) (c) (d) (e) (f)

VI NRR IC Model Fitted R2 F-Stat
(p-value)
TOTEX2 10% IC1 y i=2.3740800.789973  LNFVE1i  0.728 6363.59
(<0.0001)
IC2 y i=2.3653850.794573  LNFVE2i  0.782 4574.95
(<0.0001)
IC3 y i=1.2694740.890618 LNFVE3i  0.902 516.90
(<0.0001)
20% IC1 y =2.3743640.789949 LNFVE1  0.734 5786.27
i i
(<0.0001)
IC2 0.787 4268.38
y i=2.3615620.794823  LNFVE2i 
(<0.0001)
IC3 y =1.2320730.889574 LNFVE3  0.901 434.66
i i
(<0.0001)
30% IC1 y =2.3731630.789773  LNFVE1  0.705 4382.10
i i
(<0.0001)
IC2 y i=2.3561660.795439  LNFVE2i  0.791 3841.35
(<0.0001)
IC3 y i=1.7227360.853004  LNFVE3i 0.888 333.71
(<0.0001)
TOTIN2 10% IC1 y i=2.2366560.810047  LNFVI1i  0.706 5703.61
(<0.0001)
IC2 y i=1.9707270.840702 LNFVI2i  0.805 5261.48
(<0.0001)
IC3 y i=2.2868820.805997  LNFVI1i  0.920 642.98
(<0.0001)
20% IC1 y i=1.8732060.849244  LNFVI2i  0.703 4954.23
(<0.0001)
IC2 y i=0.9386500.938196  LNFVI3i  0.821 5275.06
(<0.0001)
IC3 y i=0.9638180.936158 LNFVI3i  0.915 517.04
(<0.0001)
30% IC1 y =2.0668150.824476  LNFVI1  0.713 4557.33
i i
(<0.0001)
IC2 y =1.7851090.856330 LNFVI2  0.826 4793.39
i i
(<0.0001)
IC3 y i=0.8771920.939765 LNFVI3i  0.932 574.82
(<0.0001)
The results showed that all of the models are fitted adequately to their respective
data sets. The highest r2 in table 6 measured 93.2%, the coefficient of variation for the
third imputation class of the TOTEX2 variable under the highest NRR while the lowest is
70.3%, the coefficient of variation for the first imputation class of the TOTIN2 variable
under 20% NRR. It is interesting to note that for all NRRs and VIs, the third IC generated
the highest r2 among the ICs. The lowest r2 from all the models under the third imputation
class is 88.8% which is from the 30% NRR of the TOTEX2 variable. Contrary to the r 2 of
the third IC, the first IC generated the lowest r2 for all NRRs and VIs. (For the other
figures and graphs of the fitted models, see the appendix.)
E. Evaluation of the different imputation methods
In the evaluation of the different imputation methods (IMs), each IM will discuss
its results independently. For each IM, the discussion of results will go as follows: (1)
nonresponse bias and variances of the estimates of the population of the imputed data, (2)
distribution of the imputed data using the Kolmogorov-Smirnov Goodness of Fit Test,
and (3) other measures of variability using the mean deviation (MD), mean absolute
deviation (MAD) and root mean square deviation (RMSD).
The table of results will contain the following columns: (a) VI, (b) NRR, (c) the
bias of the population mean of the imputed data, Bias( y ' ), (d) the variance of the
population mean of the imputed data, Var( y ' ), (e) Estimated percentage of correct
distribution of the imputed data set to the actual data set (PCD), (f) Mean Deviation
(MD), (g) Mean Absolute Deviation (MAD) and (h) Root Mean Square Deviation
(RMSD).
Overall Mean Imputation
Table 7 shows the results of the different criteria in evaluating the newly created
data with imputations using the overall mean imputation (OMI) method.
Table 7
Criteria Results for the OMI method
(c) (d)
(a) (b) BIAS Var (f) (g) (h)
VI NRR ( y ' ) ( y ' ) (e) PCD MD MAD RMSD
TOTEX2 10% 640.66 0.00 0.00% -6406.60 56929.61 108547.82

20% 499.43 0.00 0.00% -2497.14 59555.36 119193.32
30% -222.76 0.00 0.00% 20310.91 90396.26 271775.35
TOTIN2 10% -597.84 0.00 0.00% 5978.39 77502.27 167206.24

20% -2855.49 0.00 0.00% 14277.43 87469.87 244758.00
30% -6093.27 0.00 0.00% 742.53 62388.11 151740.94
(1) Nonresponse bias and variance
In (c) of table 7, results show that for nonresponse bias, as the nonresponse rate
increases for both VI, the value of the bias decreases. The decrease in value of the bias in
TOTIN2 was faster and more dramatic than TOTEX2. It seemed that in TOTIN2, the
extents of the decrease in value are almost five hundred percent under twenty percent
NRR and almost tripled the rate of decrease under twenty percent NRR for the highest
NRR. In contrast of the results in TOTIN2, the extent of decrease of the bias for TOTEX2
is much slower. The biases of the twenty and thirty percent for TOTIN2 is more than 6
times larger than TOTEX2.
The variance for all NRR and VI are all zero because the population mean of the
imputed data set is constant. The data was not simulated one thousand times unlike for
hot deck imputation (HDI3) and stochastic regression imputation (SRI3). Further, the
OMI method did not create a sampling distribution for the mean of the created data due to
a single simulation.
(2) Distribution of the imputed data
Results in column (e) of table 7 showed that in all nonresponse rates and
variables, the OMI method failed to maintain the distribution of the actual data. This was
expected primarily because in each missing observation from all data sets with missing
data, the missing observations were replaced by a single value which is the overall mean
of the first visit of the VI.
Results from other studies stated that the OMI is one of the worst among all
imputation methods. It is remarked that even if it is a simple process, inaccurate results
are obviously made. Cases that vary significantly to the imputed values were the primary
cause for inaccuracy. Also, the use of only a single value to be imputed for the missing
data distorts the distribution of the data. The distribution of the data becomes too peaked
which makes this method unsuitable for many post-analysis. (Cheng, 1999)
(3) Other measures of variability
The three criteria in table 7 under the columns (f), (g) and (h) show the other
measures of variability of the imputed data. In all the criteria, the values for TOTEX2 are
increasing as the nonresponse rate increases. However, this is not the case for TOTIN2.
Suprisingly, the data which have twenty percent nonresponse observation that were
imputed have the highest values for the three criteria.

It is worth noting to see that the mean deviation that focuses on each observation
showed contrast with the results of the bias which focused on the population mean of the
imputed data. The mean deviation for all nonresponse rates under the TOTEX2 variable
were overestimating the actual data however in the results of bias on the other hand, the
population mean of the imputed data underestimates the actual data. Likewise in the other
variable, when the result in mean deviation is an underestimate, the result from the bias is
just the opposite which is an overestimation.
Hot Deck Imputation
Table 8 shows the results of the different criteria in evaluating imputed data with
imputations using the hot deck imputation (HDI3) method with three imputation classes.
Table 8
Criteria Results for the HDI3 Method
(b) (c) (d)

(a) NR BIAS Var (e) (f) (g) (h)
VI R ( y ' ) ( y ' ) PCD MD MAD RMSD
TOTEX2 10% 491.91 408.44 100.00% 4919.40 78071.61 79251.22

20% 179.42 913.04 96.90% 897.18 78292.63 67149.16
30% -606.37 1344.20 0.00% -2021.19 81395.79 71390.65
TOTIN2 10% -717.52 804.33 100.00% -7175.25 105369.15 242022.99

20% -3095.41 1778.01 100.00% -15477.09 111748.04 297151.50
30% -6508.65 2547.34 1.00% -21695.52 115087.13 313814.92
(1) Nonresponse Bias and Variance

Similar the results in the OMI method, the bias of the population mean of the
imputed data increases for both variables as the NRR increases. As seen in OMI, for the
TOTIN2 variable, the bias of the data which has twenty percent imputations is more than
four times the bias of the data which contained ten percent imputed and almost half the
bias of the data which has thirty percent imputed. The bias in the TOTIN2 variable in this
method is a little worse than the OMI method.
Similar results were seen in OMI for the other variable, TOTEX2 where in the
data which contained thirty percent imputations, the bias becomes negative. The bias
seemed to decrease in value as the NRR increases. The biases for the first and second
NRR under HDI3 performed better than OMI.
The variance of the population mean of the data which have imputations increases
by more than one hundred percent as the nonresponse rate increases. The data which
contained the lowest number of imputations provided the least spread of the population
means and the data which contained the largest number of imputation provided the worst
spread.
Results in column (e) shows that in TOTIN2, the imputed data maintained the
distribution of the actual data for the data which contained ten and twenty percent
imputations. On the other variable, only the data which contained ten percent imputation
provided maintained the distribution of the actual data for all the one thousand data set. In
the data which contained twenty percent imputations, only 969 out of the 1000 data set
maintained the distribution of the actual data.
In the data sets which contained the largest number of imputations, both variables
failed to maintain the distribution of the actual. Much worse, none of the simulated data
set for TOTEX2 registered the same distribution as the actual. On the other hand, only a
lone data set maintained the same distribution as the actual. The researchers look into the
possibility that more than one recipient are having the same donor or could be that
majority of the imputations are coming from one particular area in the record.
For the three remaining criteria, the values generated were better than the results
in the OMI method. In the MD criterion for both variables, the MD criterion generated an
underestimation of the actual observation. While the OMI method overestimates the
deleted actual values for the TOTIN2 variable, the HDI3 underestimates them. The
underestimation rapidly increases as the nonresponse rate increases. The magnitude of the
MD for TOTIN2 is larger for HDI3 than in OMI for all nonresponse rates.
Similar to the results in MD for TOTIN2, the MAD and RMSD were unusually
large compared to the OMI. In seems that imputation classes for the TOTIN2 variable
were not as effective as compared to the TOTEX2 variable wherein in majority of values
in all the nonresponse rates and criteria showed that HDI3 was better than OMI.
Deterministic Regression Imputation
Table 9 shows the results of the different criteria in evaluating the imputed data
using the deterministic regression imputation method with three imputation classes
(DRI3).
Table 9
Criteria Results for the DRI3 method
(c) (d)
(a) (b) BIAS Var (e) (f) (g) (h)
VI NRR ( y ' ) ( y ' ) PCD MD MAD RMSD
TOTEX2 10% -720.46 0.00 100.00% -7204.56 23839.82 57726.62

20% -1469.57 0.00 100.00% -7347.86 23231.65 53180.02
30% -2266.38 0.00 100.00% -7554.61 24082.88 59795.67
TOTIN2 10% -1128.45 0.00 100.00% -11284.46 32115.80 77228.48

20% -2211.82 0.00 100.00% -11059.09 35274.03 114957.43
30% -4137.78 0.00 100.00% -13792.60 34537.36 103253.12
Looking at table 9, the bias for all NRR and VI showed negative results which
indicates that the population mean of the imputed data is underestimated. The results in
the nonresponse bias from this method are similar to the results of the previous two
methods that the TOTIN2 is underestimated. However, not like the results in OMI and
HDI3 which the bias increases tremendously as the nonresponse rate increases, the
increase in bias for this method is much slower. The bias of the data which has twenty
percent imputations of the imputed data set is just twice the bias of the data set which has
a lower percentage of imputations. For the TOTEX2 variable, this method produces more
biased estimates for all NRR than the two previous methods.
As in the OMI method, the variance for this method is also zero since the
population mean is constant due to a single simulation of the missing observations.

In contradictory to the results of the OMI method under this criterion, the DRI3
maintained its distribution for all the NRRs and VIs. It is even much better than the HDI3
since all of the imputed data sets under all NRRs and VIs preserved the same distribution
as the actual data. It is interesting to note that the regression models that were used in this
study did not follow the same format as the related literature and provided a distinct
result. Earlier studies that made use of categorical auxiliary variables, variables that are
known to be the matching variables in this study, conclude that deterministic regression is
just the same as the mean imputation to generate distorted and peaked distributions.
However, in this study, the independent variable was the first visit VIs and for each
imputation class there is a fitted model which registered better R2 that made the
difference.
Similar to the results in the nonresponse bias, the MD for all VI and NRR
underestimates the actual observations. The underestimation for all NRR is almost stable
because the rate of change is very small as compared to the two previous IMs. The MAD
and RMSD show better results than OMI and HDI providing closer values of the imputed
to the actual observations. As seen in OMI and HDI3, the TOTIN2 have larger values for
the MAD and RMSD criteria. Fitting models with high r2 was the key factor that made
this method better than the other two IM previously evaluated.

Stochastic Regression Imputation
Table 10 shows the results of the different criteria in evaluating the imputed data
using the stochastic regression imputation method with three imputation classes (SRI3).
Table 10
Criteria Results for SRI3
(c) (d)
(a) (b) BIAS Var (e) (f) (g) (h)
VI NRR ( y ' ) ( y ' ) PCD MD MAD RMSD
TOTEX2 10% 536.32 48.10 100.00% 5363.47 33683.48 70553.64

20% 1080.12 123.45 98.40% 5400.71 33782.60 72487.39
30% 398.39 154.74 100.00% 1328.06 32449.49 72803.60
TOTIN2 10% 897.11 167.90 100.00% 9043.98 51363.17 106374.39

20% -1815.39 470.10 100.00% -9076.98 57429.24 148278.49
30% 356.50 726.50 100.00% 1188.31 51886.73 131429.61
The only method that produced reasonable estimates is the SRI3 method. The
random residual added to the deterministic predicted observation made the difference.
Clearly, there is no relationship between the nonresponse bias estimates of the population
mean and the nonresponse rate. The biases fluctuate from one nonresponse rate to the
other. This method provided the least bias in the highest nonresponse for both TOTEX2
and TOTIN2. While the other methods reached a four digit bias, the SRI3 generated a
much lesser bias than the other three methods. In fact, there is this huge disparity in the
third nonresponse rate wherein it only produced less than twenty percent of the bias
produced by its deterministic counterpart.
The variances of the SRI3 proved to be much better than its model-free
counterpart which is the HDI3. In all the methods and nonresponse rate, it is clearly seen
that there is a huge disparity between the variances of the SRI3 and HDI3. Variances
from the HDI3 are almost ten times larger compared to SRI3.
Results from the SRI3 performed better than its model-free counterpart that is the
HDI3 method which also simulated the data 1000 times. Unlike in hot deck imputation,
stochastic regression imputation maintained the same distribution for all imputed data
sets for the first and third nonresponse rates. It also outperformed the former in the
second nonresponse rate, TOTEX2 variable. One of the reasons why 16 out of the 1000
sets failed to maintain the distribution of the actual data set for the imputed data set which
contained twenty percent or 826 imputations might be the unfeasibility of the predicted
values.
In earlier studies, the stochastic regression imputation performs better than any of
the four methods used here. The random residual was added to the deterministic predicted
value to preserve the distribution of the data. However, even if the original deterministic
imputed values were feasible, the stochastic counterpart need not be. After adding the
residual to the deterministic imputation, unfeasible values could namely result. (Nordholt,
1998)
Similar to the results in the nonresponse bias, the MD has no relationship with the
NRR since from one NRR to another, the MD fluctuates. In the same criteria, it
outperformed its regression counterpart but also getting outperformed by the two other
methods. Contradictory to the results and observations in the MD criteria, the SRI3
closely follows second to the DRI3 methods and provides better values than the two other
methods.
In the review of related literature, the stochastic regression performs way better
than the deterministic regression. The researchers look at the same reason from the
previous criteria. It’s likely possible that the predicted values are unrealistic as compared
to the deterministic predicted value.
After comparing the different methods with the criteria proposed in the
methodology, the distribution of the true values (TVs) that were deleted and the imputed
values (IVs) from each of the imputation procedures for all the VIs and nonresponse rates
were computed. Table 11, 12 and 13 shows the frequency distribution of the methods
with their corresponding relative frequencies (RFs) for the first, second and third
nonresponse rates respectively. The RFs for the 1000 simulated data set from HDI3 and
SRI3 were averaged. The first column represents the VIs frequency classes. This was the
same classes that were used in the Kolmogorov-Smirnov Goodness of Fit test in
determining the estimated percentage of similar distributions of the imputed data. The
second column is the relative frequencies of the actual data. The succeeding columns are
the imputation methods.

Table 11
Distribution of the TVs and IVs from the imputation procedures: 10% NRR
10% Nonresponse Rate

Imputation Procedures
TOTEX2 TV
OMI HDI3* DRI3 SRI3*
10.90 13.90
<37859.5 % 0.00% % 7.70% 9.50%
10.20
37869.5- 9.70% 0.00% % 8.70% 8.70%
47056.5- 9.70% 0.00% 9.70% 11.40% 6.10%
12.30
54922- 11.40% 0.00% 8.90% % 9.50%
63265- 8.70% 0.00% 9.10% 11.10% 11.40%
12.60
73868- 9.70% 0.00% 9.40% % 11.10%
10.90
86103- % 0.00% 9.40% 8.00% 11.10%
100.00
101947- 11.10% % 8.90% 11.40% 8.50%
126254.5 12.20
- 9.00% 0.00% 8.90% 9.00% %
12.10
169964- 8.90% 0.00% 11.60% 7.70% %
TOTIN2 TV
OMI HDI3* DRI3 SRI3*
15.10
<40570 9.70% 0.00% % 6.10% 9.10%
10.20
40570- % 0.00% 11.90% 8.70% 7.90%
10.10 14.50
51564- 9.40% 0.00% % % 8.30%
10.20 10.70 10.00
62006.5- % 0.00% 9.50% % %
12.80 12.40
73900.5- 9.00% 0.00% 9.60% % %
10.90
88127- % 0.00% 9.30% 9.20% 9.00%
100.00 10.50
104801- 11.90% % 9.80% 9.90% %
128000- 11.40% 0.00% 7.80% 11.10% 9.30%
10.70
161669- 7.70% 0.00% 8.00% % 11.20%
12.30
233907- 9.90% 0.00% 8.90% 6.30% %
* RFs for each class were obtained by taking the average of
the 1000 simulated data set.
Table 12

TOTEX2 TV
GM HDI3* DRI3 SRI3*
<37859.5 9.40% 0.00% 14.30% 7.40% 8.20%
37869.5- 9.70% 0.00% 10.40% 9.60% 7.60%
47056.5- 11.60% 0.00% 9.70% 9.00% 8.20%
54922- 10.00% 0.00% 9.00% 11.00% 7.90%
63265- 9.60% 0.00% 9.20% 12.30% 10.30%
73868- 8.40% 0.00% 9.40% 12.50% 11.90%
86103- 9.60% 0.00% 9.30% 9.90% 10.30%
101947- 11.30% 100.00% 8.70% 10.80% 11.80%
126254.5- 9.70% 0.00% 8.70% 8.80% 11.70%
169964- 10.70% 0.00% 11.30% 8.70% 12.10%
TOTIN2 TV
GM HDI3* DRI3 SRI3*
<40570 10.00% 0.00% 15.70% 4.80% 11.80%
40570- 10.30% 0.00% 12.10% 11.90% 12.20%
51564- 11.70% 0.00% 10.10% 10.20% 11.30%
62006.5- 10.20% 0.00% 9.60% 11.70% 9.90%
73900.5- 8.60% 0.00% 9.50% 11.90% 8.50%
88127- 9.40% 0.00% 9.30% 9.60% 10.10%
104801- 9.10% 100.00% 9.70% 11.70% 9.00%
128000- 9.20% 0.00% 7.60% 9.80% 8.30%
161669- 11.30% 0.00% 7.80% 9.70% 8.90%
233907- 10.20% 0.00% 8.70% 8.70% 10.10%
Table 13

TOTEX2 TV
GM HDI3* DRI3 SRI3*
14.30 10.30
<37859.5 9.80% 0.00% % 7.80% %
10.40
37869.5- 8.80% 0.00% % 9.00% 9.60%
47056.5- 9.60% 0.00% 9.70% 9.40% 8.30%
10.80
54922- 9.50% 0.00% 8.90% % 9.30%
12.70 10.10
63265- 11.00% 0.00% 9.20% % %
10.70 10.60
73868- % 0.00% 9.40% 11.50% %
10.70 12.10
86103- % 0.00% 9.40% % 9.80%
100.00 10.10
101947- 9.40% % 8.70% 8.80% %
126254.5
- 11.00% 0.00% 8.70% 9.00% 8.10%
169964- 9.50% 0.00% 11.30% 9.00% 13.70
%
TOTIN2 TV
GM HDI3* DRI3 SRI3*
15.60
<40570 9.40% 0.00% % 6.50% 8.90%
12.10 10.40
40570- 9.00% 0.00% % % 8.20%
10.10 10.80
51564- 9.90% 0.00% % % 8.80%
10.70 10.10
62006.5- % 0.00% 9.60% 11.50% %
10.20 12.20
73900.5- % 0.00% 9.50% % 11.00%
10.30 10.70 10.20
88127- % 0.00% 9.30% % %
10.30 100.00 10.50 10.40
104801- % % 9.70% % %
10.80
128000- 9.80% 0.00% 7.60% 11.20% %
10.70 10.30
161669- % 0.00% 7.70% 8.20% %
233907- 9.90% 0.00% 8.70% 8.00% 11.30%

For the actual and imputed data with the lowest number of observations set to
missing, it clearly illustrates the distortion of the distribution created by the OMI method.
The OMI method assigns the mean of the first visit VI to all the missing cases, as a result,
all the distribution of the missing values replaced by a single value concentrates at one
frequency class. The three methods which implemented imputation classes, gave a better
outcome than OMI by spreading the distribution of the imputed data.
For the HDI3 method, in all nonresponse rates, most of the imputed observations
clustered in the first frequency class, that is less than 37859.5 for TOTEX2 and 40570 for
TOTIN2. The clustering was also formed for the first and third nonresponse rate in last
frequency class for TOTEX2 and for the all nonresponse rates in second frequency class
for TOTIN2. The percentage of the data in from the lowest class for TOTEX2 and
TOTIN2, for all nonresponse rate ranges from 14-16% compared to the actual percentage
which only ranges from 9-11%.
While there is an over representation of the data, an under representation was
observed from the interval 86103-126254.5 for the 10% and 20% nonresponse imputed
data sets respectively and from the interval 63265-101947 for the 30% nonresponse
imputed data sets. The percentage from the interval indicated for the 10% and 20% under
the actual data totaled about 30% while the imputed data only totaled less than 30%.
For the two regression imputation methods, unlike hot deck and OMI which had
major cluster, produced more spread distribution although there are some areas that are
under represented. The failure to consider a random residual term in deterministic
regression resulted into a severe under representation of the data in particular the first
frequency class. On the other hand, the SRI3 which considered a random residual
provided better results than DRI3. However, there are some areas that the added random
produced significant excess mostly from the last frequency class.
F. Choosing the best imputation

For this section, the rankings of all the tests are the basis to determine which of
the following IMs will be chosen as the “best” IMs for this particular study and data. The
selection of the best method will be independent for all VIs and NRRs. The ranking are
based on a four-point system wherein the rank value of 4 denotes the worst IM for that
specific criterion and 1 denotes the best IM for that criterion. In case of ties, the average
ranks will be substituted. The IM with the smallest rank total will be declared the “best”
IM for the particular VI and NRR. The ranking of IM will cover the following criteria: (a)
Nonresponse bias, (b) Distribution of correct distributions, and (c) Other measures of
variability. All in all, there are five criteria that each IM will be rank in.
Table 14, 15 and 16 shows the ranking of the different imputation methods for the
10%, 20% and 30% NRR respectively. The table is divided into six columns. The first
column represents the VI, second is the criteria, third up to the sixth column are the
imputation methods.
Table 14
Ranking of the different imputation methods: 10% NRR
10% NONRESPONSE RATE

IMPUTATION METHODS
VI CRITERIA
OMI3 HDI3 DRI3 SRI3
TOTEX2 N.B. 3 1 4 2
PCD 4 1.3 1.3 1.3
MD 3 1 4 2
MAD 3 4 1 2
RMSD 4 3 1 2
TOTAL 17 10.3 11.3 9.3
Category Rank 4th 2nd 3rd 1st
TOTIN2 N.B. 1 2 4 3
PCD 4 1.3 1.3 1.3
MD 1 2 4 3
MAD 3 4 1 2
RMSD 3 4 1 2
TOTAL 12 13.3 11.3 11.3
Category Rank 3rd 4th 1st 1st
Table 15

IMPUTATION METHODS
VI CRITERIA
OMI3 HDI3 DRI3 SRI3
TOTEX2 N.B. 2 1 4 3
PCD 4 3 1 2
MD 2 1 4 3
MAD 3 4 1 2
RMSD 4 2 1 3
TOTAL 15 11 11 13
Category Rank 4th 1st 1st 3rd
TOTIN2 N.B. 3 4 2 1
PCD 4 1.3 1.3 1.3
MD 3 4 2 1
MAD 3 4 1 2
RMSD 3 4 1 2
TOTAL 16 17.3 7.3 7.3
Table 16

`
VI CRITERIA
OMI3 HDI3 DRI3 SRI3
TOTEX2 N.B. 1 3 4 2
PCD 3.5 3.5 1.5 1.5
MD 1 3 4 2
MAD 3 4 1 2
RMSD 4 2 1 3
TOTAL 12.5 15.5 11.5 10.5
Category Rank 3rd 4th 2nd 1st
TOTIN2 N.B. 3 4 2 1
PCD 4 3 1.5 1.5
MD 3 4 2 1
MAD 3 4 1 2
RMSD 3 4 1 2
TOTAL 16 19 7.5 7.5
Rankings show that the two regression imputation methods (RIMs)
provided better results than their model-free counterparts. For all the nonresponse rates
under the TOTIN2 variable, the two RIMs tied as the best imputation method, and
surprisingly the HDI3 finished the worst imputation method behind OMI. Under the
TOTEX2 variable, mixed rankings were seen for all nonresponse rates. The RIMs still
provided great results. The SRI3 method finished first in the 10% and 30% NRR and
ranked third in the 20% NRR while the DRI3 method finished third, first and second in
the 10%, 20% and 30% NRR respectively. While the HDI3 was seen as the worst IM for
TOTIN2, the OMI was concluded the worst IM for TOTEX2 by ranking last for both
10% and 20% NRR and third for the last NRR.
To conclude, the best imputation method for this study is the stochastic regression
imputation with three imputation classes using the 1997 FIES data. It is very closely
followed by the deterministic regression imputation with three imputation classes. The
SRI3 method never ranked last in all the criteria, NRRs and VIs, unlike for DRI3 which
provided the worst IM in the nonresponse bias and mean deviation criteria. The
researchers selected the HDI3 as the worst IM in this study. The HDI3 method fared the
worst in TOTIN2 and majority of the results in the different criteria under each NRR and
VI in particular under 30% NRR and in the TOTIN2 variable.

Discussion of Results Complete

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Discussion of Results Complete

Uploaded by

Copyright:

Available Formats

Chapter 6

Results and Discussion

A. Descriptive Statistics of the Second Visit Data Variables

observations are grouped.

Mean Std.Dev Minimum Maximum N

extreme variability of the observations.

contained missing data.

categorized into smaller groups.

Tests of Association for Matching Variable:

association is very significant. The results of succeeding tests of association will

square test is insufficient since it failed to determine the best MV.

association of the candidates to CODIN1 and CODEX1.

Phi-Coefficient Cramer's V Contingency Test

To have a detailed description of CODES1 imputation classes, a descriptive

the overall standard deviation of the variables of interest.

Descriptive Statistics of the data grouped into ICs

were contained on that class.

C. Mean of the simulated data by nonresponse rate for each VI

nonresponse. This was generated to have a brief description on the effects on

data for each imputation method (IM).

Mean of the retained and deleted observations

Observations retained Observations deleted

rate of observations set to nonresponse increases. Conversely, the mean rate of

observations set to nonresponse decreases when nonresponse rate increases. It’s a

sets containing nonresponse for the varying rates of nonresponse.

actual and retained data slowly increases.

D. Regression model adequacy

determination (R2) and (f) the F-statistic and its p-value.

Model Adequacy Results

(a) (b) (c) (d) (e) (f)

figures and graphs of the fitted models, see the appendix.)

E. Evaluation of the different imputation methods

deviation (MAD) and root mean square deviation (RMSD).

Criteria Results for the OMI method

TOTEX2 10% 640.66 0.00 0.00% -6406.60 56929.61 108547.82

TOTIN2 10% -597.84 0.00 0.00% 5978.39 77502.27 167206.24

(1) Nonresponse bias and variance

times larger than TOTEX2.

(2) Distribution of the imputed data

of the first visit of the VI.

imputation methods. It is remarked that even if it is a simple process, inaccurate results

(3) Other measures of variability

imputed have the highest values for the three criteria.

just the opposite which is an overestimation.

Hot Deck Imputation

(b) (c) (d)

TOTEX2 10% 491.91 408.44 100.00% 4919.40 78071.61 79251.22

TOTIN2 10% -717.52 804.33 100.00% -7175.25 105369.15 242022.99

(1) Nonresponse Bias and Variance

method is a little worse than the OMI method.

NRR under HDI3 performed better than OMI.

(2) Distribution of the imputed data

maintained the distribution of the actual data.

(3) Other measures of variability

Deterministic Regression Imputation

Criteria Results for the DRI3 method

TOTEX2 10% -720.46 0.00 100.00% -7204.56 23839.82 57726.62

TOTIN2 10% -1128.45 0.00 100.00% -11284.46 32115.80 77228.48

(1) Nonresponse bias and variance

population mean is constant due to a single simulation of the missing observations.

(3) Other measures of variability

this method better than the other two IM previously evaluated.

TOTEX2 10% 536.32 48.10 100.00% 5363.47 33683.48 70553.64