You are on page 1of 12

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/325060898

GOODNESS-OF-FIT OF THE IMPUTATION DATA IN BIPLOT ANALYSIS

Article  in  Far East Journal of Mathematical Sciences · June 2018


DOI: 10.17654/ms103111839

CITATIONS READS

0 42

3 authors, including:

Ridho Ananda Toni Bakhtiar


Bogor Agricultural University Bogor Agricultural University
1 PUBLICATION   0 CITATIONS    51 PUBLICATIONS   133 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

vehicle routing problem with heterogeneous fleet capacity and velocity View project

All content following this page was uploaded by Toni Bakhtiar on 10 May 2018.

The user has requested enhancement of the downloaded file.


Far East Journal of Mathematical Sciences (FJMS)
© 2018 Pushpa Publishing House, Allahabad, India
http://www.pphmj.com
http://dx.doi.org/10.17654/MS103111839
Volume 103, Number 11, 2018, Pages 1839-1849 ISSN: 0972-0871

GOODNESS-OF-FIT OF THE IMPUTATION DATA


IN BIPLOT ANALYSIS

Ridho Ananda, Siswadi and Toni Bakhtiar


Department of Mathematics
Bogor Agricultural University
Jl. Raya Dramaga, Bogor 16880
Indonesia

Abstract

Missing value is the lacking information on an object that will inhibit


statistical analysis such as biplot analysis. To overcome it, some
statisticians have found several methods, some of them are imputation
methods. Some researches have shown that imputation methods were
better than other methods in simulation study. This paper discussed a
method to obtain the goodness-of-fit of the imputation data obtained
by imputation methods. The method compared the covariance and the
proximity matrices from the imputation data and the initial data by
using Goodness-of-fit of the Procrustes. There were four imputation
methods discussed namely distribution free multiple imputation
(DFMI), Gabriel eigen, expectation maximization-singular value
decomposition (EM-SVD), and biplot imputation. Those methods
were used to complete the missing values of the 2016 EPI data. The
result showed that the approach of goodness-of-fit of the Procrustes
could be used to determine the goodness-of-fit of the imputation data,
and the goodness-of-fit of the imputation data obtained were quite
similar. Based on the simplicity, biplot imputation is suggested
imputing the missing value of the 2016 EPI data.

Received: November 25, 2017; Accepted: January 29, 2018


Keywords and phrases: goodness-of-fit, proximity matrices, biplot imputation.
1840 Ridho Ananda, Siswadi and Toni Bakhtiar

1. Introduction

Missing value is the lacking information on an object and often arises


in some researches such as social, computation, biology, health, and physics
research [10]. Missing value may be caused by human error or other factors.
It will inhibit the statistical analysis such as biplot analysis. To overcome it,
some statisticians have found several methods, some of them are imputation
methods.
Imputation methods are the process to complete missing value. They
are categorized into deterministic and stochastic imputation methods.
The outcome of the deterministic imputation methods is the uniquely
imputation data different from the stochastic imputation methods. This
paper is restricted to the deterministic imputation methods. There are
four deterministic imputation methods discussed namely distribution
free multiple imputation (DFMI) [9], Gabriel eigen [7], expectation
maximization-singular value decomposition (EM-SVD) [11], and biplot
imputation [12] that have been researched in simulation study. The newly
research compared those imputation methods in simulation study [1].
The problem that has not been addressed in previous works is to
measure the quality of the imputation data that is the goodness-of-fit of the
imputation data obtained. This paper intends to find a method which can be
used to obtain the goodness-of-fit of the imputation data and then we use the
method to know the best imputation method.

2. Material and Methods

2.1. The data


The 2016 Environmental Performance Index (EPI) is a project led by
Yale University, Columbia University, Samuel Family Foundation, McCall
MacBain Foundation, and the World Economic Forum. This project ranks
performance of countries on high-priority environmental issues in two areas:
protection of human health and protection of ecosystems [5]. The 2016 EPI
data has 405 missing values on 113 objects or 11 variables from 180 objects
and 35 variables and it is represented in matrix data form [4].
Goodness-of-fit of the Imputation Data in Biplot Analysis 1841

In this paper, missing values of the 2016 EPI data are imputed by DFMI,
Gabriel eigen, EM-SVD, and biplot imputation and then will be determined
how we obtain the goodness-of-fit of the imputation data. Finally, we
conclude the best imputation method.

2.2. Distribution free multiple imputation method


DFMI method that is provided by [9] has the central idea that any matrix
nXp can be decomposed by the singular value decomposition (SVD) into
X  ULW form where U  uij , W   wij  , and L  diag l1, l2 , ..., lr .
Conversely, by using elements of U, L, and W, we can obtain every

k 1 lk uik w jk , i,
r
elements of X by the calculations of xij  j. If xij

is missing value from n X p data matrix, so xij can be estimated by

k 1 lk uik w jk ,
r
xˆij  in which the lk , uik and w jk must be estimated from

the remaining data. The first step, we denote X i  and X  j  matrices

where X i  is obtained by deleting the ith row of X and X  j  is obtained

by deleting the jth column of X. The next step, we compute SVD of X i 


and X  j  that are X i   ABC and X  j   DEF, where A  aij  ,
C  cij  , B  diagb1 , b2 , ..., br1  , D  d ij , F   f ij  , and E

diage1 , e2 , ..., er2 . By choosing uik  dik , w jk  c jk , lk  bk ek and

r  minr1, r2 , we will obtain xˆij   k 1 lk uik w jk , where x̂ij is imputation


r

value. If there are missing values, so in the beginning, they are imputed by
their respective columns means, thereby providing a complete matrix. The
next step, we supersede every imputation value separately by using DFMI
method.

2.3. Gabriel eigen method


Gabriel eigen method that is provided by [7] combines the regression
and the lower-rank approximation to find the imputation value in any data
1842 Ridho Ananda, Siswadi and Toni Bakhtiar

set that can be arranged in matrix form. If xij is missing value from n X p
data matrix, so we denote the matrix partition by (1),

 xij xi. 
 , (1)
 x. j X  i ,  j  

where xij is missing value, xi. is the ith row from X by deleting xij , x. j is
the jth column from X by deleting xij , and X  i ,  j  is obtained from X by
deleting the ith row and the jth column. Furthermore from (1), we make the
multiple regression model x. j  X   i ,  j β  ε. j with min x. j  X   i ,  j β .

Assume that X  i ,  j  has full column rank so we shall obtain β̂ 

 X  i,  j X  i,  j  1 X  i,  j x. j . By using SVD of X  i,  j  , we obtain


X  i ,  j   ULW, then in the next step, we substitute it into β̂ so we

will obtain βˆ  WL1Ux. j . Finally, we estimate xij by regression model

xˆij  xi.βˆ  xi.WL1Ux. j , where x̂ij is imputation value. If there are


missing values, so in the beginning, they are imputed by their respective
columns means, thereby providing a complete matrix. The next step, we
supersede every imputation value separately by using Gabriel eigen method.

2.4. Expectation maximization-singular value decomposition method


EM-SVD method that is provided by [11] combines EM algorithm and
SVD. Suppose that there are missing values in X. The first step, they are
imputed by their respective columns means, thereby providing a complete
matrix X0 . Furthermore, in the maximization step, we compute SVD

of X0  that is X 0   ULW  k 1 lk0u k0w k0 , and then X0  is
r

i 1 li
s
ˆ 0   0  0  0 

s
approximated by X k 1
l uk wk with s  r and  0.75.
i 1 li
k r

Afterwards, the expectation step, we supersede the imputation values in


Goodness-of-fit of the Imputation Data in Biplot Analysis 1843

X0  with elements of X̂0  correspondingly so we obtain X1 that is the


second matrix completely. The process is iterated until relative difference of
the residual sum of squares (RSS) between the non-missing value of X and
the rank-s SVD is small (usually 1  104 or less).

2.5. Biplot imputation method


Biplot imputation method that is provided by [12] is based on biplot
analysis that is provided by [6]. If there are missing values in X, so the first
step, they are imputed by their respective columns means, thereby providing
a complete matrix X0 . The next step, we compute SVD of X0  that

is X 0   ULW  k 1 lk0uk0w k0 , and then X0  is approximated
r


ˆ 0   k 1 lk0uk0w k0
s
by X with s  2 or s  3. We supersede the

imputation values in X0  with elements of X̂0  correspondingly so we


obtain X1 that is the second matrix completely. This process is iterated
d
until we obtain the convergence criterion,  0.01 with:
x
0 .5 0 .5
 n p   n p 
 1
d    

 na  i 1 j 1
   
 xij  xij  
n n 1 2

 1
and x    
 xij  . (2)
2
 N  i 1 j 1 
   

In (2), na is the total number of missing values in the matrix X, xij n  is

element of X n  in the current iteration, xij n 1 is element of X n 1 in the

previous iteration, xij is the observation value (not missing) in the ith row
and jth column, and N is the total number of the observation values.

2.6. The goodness-of-fit of the imputation data


The main problem that may arise in the imputation data is the
distorted correlation among variables because imputation values are just the
approximation to the unknown missing values [3]. Certainly, we will also
found the distortion in the dissimilarity measures among objects. We know
1844 Ridho Ananda, Siswadi and Toni Bakhtiar

that the small distortions provide a good approximation to represent the


correlation among variables and the dissimilarity measures among objects
from the initial data.
To know the approximation measure, we need the covariance and the
proximity matrix that represent the correlation among variables and the
dissimilarity measures among objects, respectively. We shall use the formula
is provided by [8] to obtain the covariance and the proximity matrix on the
initial data that is incomplete data. Suppose that S   sij  is the covariance
matrix of initial data, then the computation of S is obtained after we compute
the covariance by (3),

n
sij  s ji 
1
  ykj  y j   yki  yi  wijk , i, j , (3)

n
w  1 k 1
k 1 ijk

where sij is the covariance between the ith and jth variables, n is the total
number of objects in data, ykj is the value of the jth variable on kth object,
y j is the mean of elements on jth variable that is not missing value, and
wijk is weight that be 0 if ykj or yki are missing and 1 otherwise. Suppose
that D  dij  is the Euclidean distance as the proximity matrix of the initial
data, the computation of D is obtained after we compute the Euclidean
distance by (4),

s 1  xis  x js 2 mijs ,
p

dij  d ji  i, j , (4)


s 1 mijs
p

where dij is the Euclidean distance between the ith and the jth objects, p is
the total number of variables in data, xis is the value of the ith object on sth
variable, and mijs is weight that be 0 if xis or x js are missing and 1
otherwise.
Goodness-of-fit of the Imputation Data in Biplot Analysis 1845

In the imputation data, the covariance and the proximity matrix will be
obtained by using biplot analysis that is provided by [6]. Suppose that Si  is
the covariance matrix and Di  is the proximity matrix of the ith imputation
data or Xi . The first step, we decompose Xi   ULW by SVD, let

G  UL and H  L1  W so Xi   GH. With the result that, Si  is
obtained from HH (by choosing   0) because HH is proportional with
the covariance matrix of the initial data. Di  is obtained from the Euclidean

distance of G (by choosing   1) because g h  g i  g h  g i  is equal to

x h  xi  x h  xi , i, j , that is the Euclidean distance of the initial data.


Because of the covariance and the proximity matrix in matrix form, we
can use the goodness-of-fit of Procrustes provided by [2]. To know the
approximation measure of the covariance matrix from imputation data to the
initial data so we use (5):
2
 r 
Si  , S   
 ii  ,

(5)
 i 1 

where Si  and S are the covariance matrix from imputation data and
the initial data, respectively, r and ii i  1, 2, ..., r  is rank and singular
value, respectively, from Si  T ST or ST Si  T . ST is S after the translation-
normalization procedure. Si  T is Si  after the translation-normalization
procedure. The measure belongs to the interval of 0, 1, if Si  , S   1 so
it means that Si  has a good approximation to represent the correlation
among variables from the initial data. Conversely, Si  , S   0 so it means
that Si  has a bad approximation. Because of that, Si  , S  can be used to
obtain the goodness-of-fit of the covariance matrix. We must also compute
Di  , D that is the goodness-of-fit of the proximity matrix from
imputation data.
1846 Ridho Ananda, Siswadi and Toni Bakhtiar

3. Results and Discussion

Table 1 shows that the goodness-of-fit of the covariance matrices


obtained by (5) in the first nth principal components have values more than
0.83 so it means that the covariance matrices provided good approximation
to represent the correlation among variables in the 2016 EPI data. Table 2
shows that the goodness-of-fit of the proximity matrices obtained by (5) in
the first nth principal components have values more than 0.81 so it means
that the proximity matrices provided good approximation to represent the
dissimilarity measures among objects in the 2016 EPI data. Figure 1 shows
the increasingly graph’s visualization of the goodness-of-fit in Table 1 and
Table 2 from each the first nth principal components.

Table 1. The goodness-of-fit of the covariance matrices from the first nth
principal components
Gabriel Imputation Imputation
n DFMI EM-SVD
eigen biplot biplot
 s  3  s  2
2 0.835 0.843 0.834 0.838 0.837
3 0.893 0.900 0.888 0.895 0.893
4 0.936 0.945 0.933 0.941 0.937
     
32 0.990 0.990 0.987 0.983 0.983

Table 2. The goodness-of-fit of the proximity matrices from the first nth
principal components
Gabriel Imputation Imputation
n DFMI EM-SVD
eigen biplot biplot
 s  3  s  2
2 0.832 0.828 0.835 0.822 0.818
3 0.891 0.893 0.894 0.885 0.884
4 0.943 0.943 0.938 0.937 0.935
     
32 0.987 0.994 0.986 0.985 0.982
Goodness-of-fit of the Imputation Data in Biplot Analysis 1847

The results of Tables 1 and 2 show that the goodness-of-fit of the


imputation data of each imputation method are quite similar in the first two
principal components. Based on the simplicity, suppose that we choose the
result of biplot imputation with s  2, we will obtain the two-dimensional
representation that is given by Figure 2.

(a) (b)

Figure 1. Graph’s visualization of (a) the goodness-of-fit of the covariance


matrix and (b) the goodness-of-fit of the proximity matrix.

Figure 2 shows that the objects are plotted as points, whereas variables
are plotted as lines. The interesting property of the biplot when   0 is that
the lengths of the lines are proportional to the standard deviation of the
variables and the cosines of the angles between two lines represent
correlations between variables correspondingly in the 2016 EPI data. The
visualization is satisfactory because the goodness-of-fit of the covariance
matrix in the first two principal components is 0.837, it means that the first
two principal components account for 83.7% of the total information of
the correlation among variables in the 2016 EPI data, so that the two-
dimensional representation is a reasonably faithful representation of the
correlation among variables in the 2016 EPI data. The Euclidean distance
1848 Ridho Ananda, Siswadi and Toni Bakhtiar

between two points in the biplot is proportional to the Mahalanobis distance


between two objects in the 2016 EPI data. We cannot use the Mahalanobis
distance as the proximity matrix in this matter because the covariance matrix
of the 2016 EPI data is positive semidefinite.

In the biplot with   1, the property relating to lines and points


separately are different from those for   0. With   1 we have the
Euclidean distance between two points in the biplot is equal to the Euclidean
distance between two objects correspondingly in the 2016 EPI data. The
visualization of objects is satisfactory because the goodness-of-fit of the
proximity matrix in the first two principal components is 0.818, it means that
the first two principal components account for 81.8% of the total information
of the dissimilarity measures among objects in the 2016 EPI data, so that the
two-dimensional representation is a reasonably faithful representation.

Figure 2. Biplot from the biplot imputation result with (a)   0 and
(b)   1.

4. Conclusions

In this paper, we have discussed the method to obtain the goodness-of-fit


of the imputation data. The results conclude that the goodness-of-fit of
the imputation data can be obtained by knowing the goodness-of-fit of
the covariance and the proximity matrix. Based on the simplicity, biplot
imputation is suggested imputing the missing value of the 2016 EPI data.
Goodness-of-fit of the Imputation Data in Biplot Analysis 1849

References

[1] S. Arciniegas-Alarcon, M. Garcia-Pena, CTDS Dias and W. J. Krzanowski,


Imputing missing values in multi-environment trials using the singular
value decomposition: an empirical comparison, Commun. Biometry Crop Sci.
9(2) (2014), 54-70.
[2] T. Bakhtiar and Siswadi, On the symmetrical property of Procrustes measure of
distance, Int. J. Pure Appl. Math. 99(3) (2015), 315-324.
[3] A. L. Bello, Choosing among imputation techniques for incomplete multivariate
data: a simulation study, Comm. Statist. Theory Methods 22(3) (1993), 853-877.
[4] Environmental Performance Index, Framework and Indicator Scores, 2016.
http://epi.yale.edu/sites/default/files/2016_epi_framework_indicator_scores
_friendly.xls.
[5] Environmental Performance Index. Global metrics for the environment
http://epi.yale.edu/sites/default/files/2016EPI_Full_ Report_opt.pdf, 2016.
[6] K. R. Gabriel, The biplot graphic display of matrices with application to principal
component analysis, Biometrika 58(3) (1971), 453-468.
[7] K. R. Gabriel, Le biplot-outil d’exploration de donnees multidimensionnelles,
Journal de la Societe Francaise de Statistique 143(4) (2002), 5-55.
[8] J. C. Gower, A general coefficient of similarity and some of its properties,
Biometrics 27(4) (1971), 857-871.
[9] W. J. Krzanowski, Cross-validation in principal component analysis, Biometrics
43(3) (1987), 575-584.
[10] X. L. Meng, Missing data: Dial M for ???, J. Amer. Statist. Assoc.
95(452) (2000), 1325-1330.
[11] P. O. Perry, Cross-validation for unsupervised learning, Department of Statistics
Stanford University, 2009.
[12] W. Yan, Biplot analysis of incomplete two-way data, Crop Sci. 53(1) (2013),
48-57.

Ridho Ananda: ananda.ridmate@gmail.com

Siswadi: siswadimathipb@gmail.com

Toni Bakhtiar: tbakhtiar@ipb.ac.id

View publication stats

You might also like