Ridho - 2018

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/325060898
GOODNESS-OF-FIT OF THE IMPUTATION DATA IN BIPLOT ANALYSIS
Article in Far East Journal of Mathematical Sciences · June 2018

DOI: 10.17654/ms103111839
CITATIONS READS
0 42
3 authors, including:
Ridho Ananda Toni Bakhtiar

Bogor Agricultural University Bogor Agricultural University
1 PUBLICATION 0 CITATIONS 51 PUBLICATIONS 133 CITATIONS
SEE PROFILE SEE PROFILE
Some of the authors of this publication are also working on these related projects:
vehicle routing problem with heterogeneous fleet capacity and velocity View project
All content following this page was uploaded by Toni Bakhtiar on 10 May 2018.
The user has requested enhancement of the downloaded file.

Far East Journal of Mathematical Sciences (FJMS)
© 2018 Pushpa Publishing House, Allahabad, India
http://www.pphmj.com
http://dx.doi.org/10.17654/MS103111839
Volume 103, Number 11, 2018, Pages 1839-1849 ISSN: 0972-0871
GOODNESS-OF-FIT OF THE IMPUTATION DATA

IN BIPLOT ANALYSIS
Ridho Ananda, Siswadi and Toni Bakhtiar

Department of Mathematics
Bogor Agricultural University
Jl. Raya Dramaga, Bogor 16880
Indonesia
Abstract
Missing value is the lacking information on an object that will inhibit

statistical analysis such as biplot analysis. To overcome it, some
statisticians have found several methods, some of them are imputation
methods. Some researches have shown that imputation methods were
better than other methods in simulation study. This paper discussed a
method to obtain the goodness-of-fit of the imputation data obtained
by imputation methods. The method compared the covariance and the
proximity matrices from the imputation data and the initial data by
using Goodness-of-fit of the Procrustes. There were four imputation
methods discussed namely distribution free multiple imputation
(DFMI), Gabriel eigen, expectation maximization-singular value
decomposition (EM-SVD), and biplot imputation. Those methods
were used to complete the missing values of the 2016 EPI data. The
result showed that the approach of goodness-of-fit of the Procrustes
could be used to determine the goodness-of-fit of the imputation data,
and the goodness-of-fit of the imputation data obtained were quite
similar. Based on the simplicity, biplot imputation is suggested
imputing the missing value of the 2016 EPI data.
Received: November 25, 2017; Accepted: January 29, 2018

Keywords and phrases: goodness-of-fit, proximity matrices, biplot imputation.
1840 Ridho Ananda, Siswadi and Toni Bakhtiar
1. Introduction
Missing value is the lacking information on an object and often arises

in some researches such as social, computation, biology, health, and physics
research [10]. Missing value may be caused by human error or other factors.
It will inhibit the statistical analysis such as biplot analysis. To overcome it,
some statisticians have found several methods, some of them are imputation
methods.
Imputation methods are the process to complete missing value. They
are categorized into deterministic and stochastic imputation methods.
The outcome of the deterministic imputation methods is the uniquely
imputation data different from the stochastic imputation methods. This
paper is restricted to the deterministic imputation methods. There are
four deterministic imputation methods discussed namely distribution
free multiple imputation (DFMI) [9], Gabriel eigen [7], expectation
maximization-singular value decomposition (EM-SVD) [11], and biplot
imputation [12] that have been researched in simulation study. The newly
research compared those imputation methods in simulation study [1].
The problem that has not been addressed in previous works is to
measure the quality of the imputation data that is the goodness-of-fit of the
imputation data obtained. This paper intends to find a method which can be
used to obtain the goodness-of-fit of the imputation data and then we use the
method to know the best imputation method.
2. Material and Methods
2.1. The data

The 2016 Environmental Performance Index (EPI) is a project led by
Yale University, Columbia University, Samuel Family Foundation, McCall
MacBain Foundation, and the World Economic Forum. This project ranks
performance of countries on high-priority environmental issues in two areas:
protection of human health and protection of ecosystems [5]. The 2016 EPI
data has 405 missing values on 113 objects or 11 variables from 180 objects
and 35 variables and it is represented in matrix data form [4].
Goodness-of-fit of the Imputation Data in Biplot Analysis 1841
In this paper, missing values of the 2016 EPI data are imputed by DFMI,
Gabriel eigen, EM-SVD, and biplot imputation and then will be determined
how we obtain the goodness-of-fit of the imputation data. Finally, we
conclude the best imputation method.
2.2. Distribution free multiple imputation method

DFMI method that is provided by [9] has the central idea that any matrix
nXp can be decomposed by the singular value decomposition (SVD) into
X  ULW form where U  uij , W   wij  , and L  diag l1, l2 , ..., lr .
Conversely, by using elements of U, L, and W, we can obtain every
k 1 lk uik w jk , i,
r
elements of X by the calculations of xij  j. If xij
is missing value from n X p data matrix, so xij can be estimated by
k 1 lk uik w jk ,
r
xîj  in which the lk , uik and w jk must be estimated from
the remaining data. The first step, we denote X i  and X  j  matrices
where X i  is obtained by deleting the ith row of X and X  j  is obtained
by deleting the jth column of X. The next step, we compute SVD of X i 

and X  j  that are X i   ABC and X  j   DEF, where A  aij  ,
C  cij  , B  diagb1 , b2 , ..., br1  , D  d ij , F   f ij  , and E
diage1 , e2 , ..., er2 . By choosing uik  dik , w jk  c jk , lk  bk ek and
r  minr1, r2 , we will obtain xîj   k 1 lk uik w jk , where x̂ij is imputation

r
value. If there are missing values, so in the beginning, they are imputed by
their respective columns means, thereby providing a complete matrix. The
next step, we supersede every imputation value separately by using DFMI
method.
2.3. Gabriel eigen method

Gabriel eigen method that is provided by [7] combines the regression
and the lower-rank approximation to find the imputation value in any data
set that can be arranged in matrix form. If xij is missing value from n X p
data matrix, so we denote the matrix partition by (1),
 xij xi. 
 , (1)
 x. j X  i ,  j  
where xij is missing value, xi. is the ith row from X by deleting xij , x. j is
the jth column from X by deleting xij , and X  i ,  j  is obtained from X by
deleting the ith row and the jth column. Furthermore from (1), we make the
multiple regression model x. j  X   i ,  j β  ε. j with min x. j  X   i ,  j β .
Assume that X  i ,  j  has full column rank so we shall obtain β̂ 
 X  i,  j X  i,  j  1 X  i,  j x. j . By using SVD of X  i,  j  , we obtain

X  i ,  j   ULW, then in the next step, we substitute it into β̂ so we
will obtain βˆ  WL1Ux. j . Finally, we estimate xij by regression model
xîj  xi.βˆ  xi.WL1Ux. j , where x̂ij is imputation value. If there are

missing values, so in the beginning, they are imputed by their respective
columns means, thereby providing a complete matrix. The next step, we
supersede every imputation value separately by using Gabriel eigen method.
2.4. Expectation maximization-singular value decomposition method

EM-SVD method that is provided by [11] combines EM algorithm and
SVD. Suppose that there are missing values in X. The first step, they are
imputed by their respective columns means, thereby providing a complete
matrix X0 . Furthermore, in the maximization step, we compute SVD

of X0  that is X 0   ULW  k 1 lk0u k0w k0 , and then X0  is
r
i 1 li
s
ˆ 0   0  0  0 

s
approximated by X k 1
l uk wk with s  r and  0.75.
i 1 li
k r
Afterwards, the expectation step, we supersede the imputation values in

X0  with elements of X̂0  correspondingly so we obtain X1 that is the

second matrix completely. The process is iterated until relative difference of
the residual sum of squares (RSS) between the non-missing value of X and
the rank-s SVD is small (usually 1  104 or less).
2.5. Biplot imputation method

Biplot imputation method that is provided by [12] is based on biplot
analysis that is provided by [6]. If there are missing values in X, so the first
step, they are imputed by their respective columns means, thereby providing
a complete matrix X0 . The next step, we compute SVD of X0  that

is X 0   ULW  k 1 lk0uk0w k0 , and then X0  is approximated
r

ˆ 0   k 1 lk0uk0w k0
s
by X with s  2 or s  3. We supersede the
imputation values in X0  with elements of X̂0  correspondingly so we

obtain X1 that is the second matrix completely. This process is iterated
d
until we obtain the convergence criterion,  0.01 with:
x
0 .5 0 .5
 n p   n p 
 1
d    

 na  i 1 j 1
   
 xij  xij  
n n 1 2

 1
and x    
 xij  . (2)
2
 N  i 1 j 1 
   
In (2), na is the total number of missing values in the matrix X, xij n  is
element of X n  in the current iteration, xij n 1 is element of X n 1 in the
previous iteration, xij is the observation value (not missing) in the ith row
and jth column, and N is the total number of the observation values.
2.6. The goodness-of-fit of the imputation data

The main problem that may arise in the imputation data is the
distorted correlation among variables because imputation values are just the
approximation to the unknown missing values [3]. Certainly, we will also
found the distortion in the dissimilarity measures among objects. We know
that the small distortions provide a good approximation to represent the

correlation among variables and the dissimilarity measures among objects
from the initial data.
To know the approximation measure, we need the covariance and the
proximity matrix that represent the correlation among variables and the
dissimilarity measures among objects, respectively. We shall use the formula
is provided by [8] to obtain the covariance and the proximity matrix on the
initial data that is incomplete data. Suppose that S   sij  is the covariance
matrix of initial data, then the computation of S is obtained after we compute
the covariance by (3),
n
sij  s ji 
1
  ykj  y j   yki  yi  wijk , i, j , (3)

n
w  1 k 1
k 1 ijk
where sij is the covariance between the ith and jth variables, n is the total
number of objects in data, ykj is the value of the jth variable on kth object,
y j is the mean of elements on jth variable that is not missing value, and
wijk is weight that be 0 if ykj or yki are missing and 1 otherwise. Suppose
that D  dij  is the Euclidean distance as the proximity matrix of the initial
data, the computation of D is obtained after we compute the Euclidean
distance by (4),
s 1  xis  x js 2 mijs ,
p
dij  d ji  i, j , (4)

s 1 mijs
p
where dij is the Euclidean distance between the ith and the jth objects, p is
the total number of variables in data, xis is the value of the ith object on sth
variable, and mijs is weight that be 0 if xis or x js are missing and 1
otherwise.
In the imputation data, the covariance and the proximity matrix will be
obtained by using biplot analysis that is provided by [6]. Suppose that Si  is
the covariance matrix and Di  is the proximity matrix of the ith imputation
data or Xi . The first step, we decompose Xi   ULW by SVD, let
G  UL and H  L1  W so Xi   GH. With the result that, Si  is
obtained from HH (by choosing   0) because HH is proportional with
the covariance matrix of the initial data. Di  is obtained from the Euclidean
distance of G (by choosing   1) because g h  g i  g h  g i  is equal to
x h  xi  x h  xi , i, j , that is the Euclidean distance of the initial data.

Because of the covariance and the proximity matrix in matrix form, we
can use the goodness-of-fit of Procrustes provided by [2]. To know the
approximation measure of the covariance matrix from imputation data to the
initial data so we use (5):
2
 r 
Si  , S   
 ii  ,

(5)
 i 1 
where Si  and S are the covariance matrix from imputation data and
the initial data, respectively, r and ii i  1, 2, ..., r  is rank and singular
value, respectively, from Si  T ST or ST Si  T . ST is S after the translation-
normalization procedure. Si  T is Si  after the translation-normalization
procedure. The measure belongs to the interval of 0, 1, if Si  , S   1 so
it means that Si  has a good approximation to represent the correlation
among variables from the initial data. Conversely, Si  , S   0 so it means
that Si  has a bad approximation. Because of that, Si  , S  can be used to
obtain the goodness-of-fit of the covariance matrix. We must also compute
Di  , D that is the goodness-of-fit of the proximity matrix from
imputation data.
3. Results and Discussion
Table 1 shows that the goodness-of-fit of the covariance matrices

obtained by (5) in the first nth principal components have values more than
0.83 so it means that the covariance matrices provided good approximation
to represent the correlation among variables in the 2016 EPI data. Table 2
shows that the goodness-of-fit of the proximity matrices obtained by (5) in
the first nth principal components have values more than 0.81 so it means
that the proximity matrices provided good approximation to represent the
dissimilarity measures among objects in the 2016 EPI data. Figure 1 shows
the increasingly graph’s visualization of the goodness-of-fit in Table 1 and
Table 2 from each the first nth principal components.
Table 1. The goodness-of-fit of the covariance matrices from the first nth
principal components
Gabriel Imputation Imputation
n DFMI EM-SVD
eigen biplot biplot
 s  3  s  2
2 0.835 0.843 0.834 0.838 0.837
3 0.893 0.900 0.888 0.895 0.893
4 0.936 0.945 0.933 0.941 0.937
     
32 0.990 0.990 0.987 0.983 0.983
Table 2. The goodness-of-fit of the proximity matrices from the first nth
principal components
Gabriel Imputation Imputation
n DFMI EM-SVD
eigen biplot biplot
 s  3  s  2
2 0.832 0.828 0.835 0.822 0.818
3 0.891 0.893 0.894 0.885 0.884
4 0.943 0.943 0.938 0.937 0.935
     
32 0.987 0.994 0.986 0.985 0.982
The results of Tables 1 and 2 show that the goodness-of-fit of the

imputation data of each imputation method are quite similar in the first two
principal components. Based on the simplicity, suppose that we choose the
result of biplot imputation with s  2, we will obtain the two-dimensional
representation that is given by Figure 2.
(a) (b)
Figure 1. Graph’s visualization of (a) the goodness-of-fit of the covariance

matrix and (b) the goodness-of-fit of the proximity matrix.
Figure 2 shows that the objects are plotted as points, whereas variables
are plotted as lines. The interesting property of the biplot when   0 is that
the lengths of the lines are proportional to the standard deviation of the
variables and the cosines of the angles between two lines represent
correlations between variables correspondingly in the 2016 EPI data. The
visualization is satisfactory because the goodness-of-fit of the covariance
matrix in the first two principal components is 0.837, it means that the first
two principal components account for 83.7% of the total information of
the correlation among variables in the 2016 EPI data, so that the two-
dimensional representation is a reasonably faithful representation of the
correlation among variables in the 2016 EPI data. The Euclidean distance
between two points in the biplot is proportional to the Mahalanobis distance

between two objects in the 2016 EPI data. We cannot use the Mahalanobis
distance as the proximity matrix in this matter because the covariance matrix
of the 2016 EPI data is positive semidefinite.
In the biplot with   1, the property relating to lines and points

separately are different from those for   0. With   1 we have the
Euclidean distance between two points in the biplot is equal to the Euclidean
distance between two objects correspondingly in the 2016 EPI data. The
visualization of objects is satisfactory because the goodness-of-fit of the
proximity matrix in the first two principal components is 0.818, it means that
the first two principal components account for 81.8% of the total information
of the dissimilarity measures among objects in the 2016 EPI data, so that the
two-dimensional representation is a reasonably faithful representation.
Figure 2. Biplot from the biplot imputation result with (a)   0 and
(b)   1.
4. Conclusions
In this paper, we have discussed the method to obtain the goodness-of-fit

of the imputation data. The results conclude that the goodness-of-fit of
the imputation data can be obtained by knowing the goodness-of-fit of
the covariance and the proximity matrix. Based on the simplicity, biplot
imputation is suggested imputing the missing value of the 2016 EPI data.
References
[1] S. Arciniegas-Alarcon, M. Garcia-Pena, CTDS Dias and W. J. Krzanowski,

Imputing missing values in multi-environment trials using the singular
value decomposition: an empirical comparison, Commun. Biometry Crop Sci.
9(2) (2014), 54-70.
[2] T. Bakhtiar and Siswadi, On the symmetrical property of Procrustes measure of
distance, Int. J. Pure Appl. Math. 99(3) (2015), 315-324.
[3] A. L. Bello, Choosing among imputation techniques for incomplete multivariate
data: a simulation study, Comm. Statist. Theory Methods 22(3) (1993), 853-877.
[4] Environmental Performance Index, Framework and Indicator Scores, 2016.
http://epi.yale.edu/sites/default/files/2016_epi_framework_indicator_scores
_friendly.xls.
[5] Environmental Performance Index. Global metrics for the environment
http://epi.yale.edu/sites/default/files/2016EPI_Full_ Report_opt.pdf, 2016.
[6] K. R. Gabriel, The biplot graphic display of matrices with application to principal
component analysis, Biometrika 58(3) (1971), 453-468.
[7] K. R. Gabriel, Le biplot-outil d’exploration de donnees multidimensionnelles,
Journal de la Societe Francaise de Statistique 143(4) (2002), 5-55.
[8] J. C. Gower, A general coefficient of similarity and some of its properties,
Biometrics 27(4) (1971), 857-871.
[9] W. J. Krzanowski, Cross-validation in principal component analysis, Biometrics
43(3) (1987), 575-584.
[10] X. L. Meng, Missing data: Dial M for ???, J. Amer. Statist. Assoc.
95(452) (2000), 1325-1330.
[11] P. O. Perry, Cross-validation for unsupervised learning, Department of Statistics
Stanford University, 2009.
[12] W. Yan, Biplot analysis of incomplete two-way data, Crop Sci. 53(1) (2013),
48-57.
Ridho Ananda: ananda.ridmate@gmail.com
Siswadi: siswadimathipb@gmail.com
Toni Bakhtiar: tbakhtiar@ipb.ac.id
View publication stats

Ridho - 2018

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ridho - 2018

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author proﬁles for this publication at: https://www.researchgate.

GOODNESS-OF-FIT OF THE IMPUTATION DATA IN BIPLOT ANALYSIS

Article in Far East Journal of Mathematical Sciences · June 2018

Ridho Ananda Toni Bakhtiar

SEE PROFILE SEE PROFILE

The user has requested enhancement of the downloaded ﬁle.

GOODNESS-OF-FIT OF THE IMPUTATION DATA

Ridho Ananda, Siswadi and Toni Bakhtiar

Missing value is the lacking information on an object that will inhibit

Received: November 25, 2017; Accepted: January 29, 2018

Missing value is the lacking information on an object and often arises

2. Material and Methods

2.1. The data

2.2. Distribution free multiple imputation method

is missing value from n X p data matrix, so xij can be estimated by

the remaining data. The first step, we denote X i  and X  j  matrices

where X i  is obtained by deleting the ith row of X and X  j  is obtained

by deleting the jth column of X. The next step, we compute SVD of X i 

diage1 , e2 , ..., er2 . By choosing uik  dik , w jk  c jk , lk  bk ek and

r  minr1, r2 , we will obtain xˆij   k 1 lk uik w jk , where x̂ij is imputation

2.3. Gabriel eigen method

Assume that X  i ,  j  has full column rank so we shall obtain β̂ 

 X  i,  j X  i,  j  1 X  i,  j x. j . By using SVD of X  i,  j  , we obtain

will obtain βˆ  WL1Ux. j . Finally, we estimate xij by regression model

xˆij  xi.βˆ  xi.WL1Ux. j , where x̂ij is imputation value. If there are

2.4. Expectation maximization-singular value decomposition method

Afterwards, the expectation step, we supersede the imputation values in

X0  with elements of X̂0  correspondingly so we obtain X1 that is the

2.5. Biplot imputation method

imputation values in X0  with elements of X̂0  correspondingly so we

In (2), na is the total number of missing values in the matrix X, xij n  is

element of X n  in the current iteration, xij n 1 is element of X n 1 in the

2.6. The goodness-of-fit of the imputation data

that the small distortions provide a good approximation to represent the

dij  d ji  i, j , (4)

distance of G (by choosing   1) because g h  g i  g h  g i  is equal to

x h  xi  x h  xi , i, j , that is the Euclidean distance of the initial data.

3. Results and Discussion

Table 1 shows that the goodness-of-fit of the covariance matrices

The results of Tables 1 and 2 show that the goodness-of-fit of the

Figure 1. Graph’s visualization of (a) the goodness-of-fit of the covariance

between two points in the biplot is proportional to the Mahalanobis distance

In the biplot with   1, the property relating to lines and points

In this paper, we have discussed the method to obtain the goodness-of-fit

[1] S. Arciniegas-Alarcon, M. Garcia-Pena, CTDS Dias and W. J. Krzanowski,

Ridho Ananda: ananda.ridmate@gmail.com

Toni Bakhtiar: tbakhtiar@ipb.ac.id

View publication stats

You might also like