Professional Documents
Culture Documents
Abstract. Many real world phenomena are better represented by non-precise data
rather than by single-valued data. In fact, non-precise data represent two sources
of variability: the natural phenomena variability and the variability or uncertainty
induced by measurement errors or determined by specific experimental conditions.
The latter variability source is named imprecision. When there are information
about the imprecision distribution the fuzzy data coding is used to represent the
imprecision. However, in many cases imprecise data are natively defined only by the
minimum and maximum values. Technical specifications, stock-market daily prices,
survey data are some examples of such kind of data. In these cases, interval data
represent a good data coding to take into account the imprecision. This paper aims
at describing multiple imprecise data by means of a suitable Principal Component
Analysis that is based on specific interval data coding taking into account both
sources of variation.
1 Introduction
Generally, in statistical analysis, we handle single-valued variables; however,
in many cases, imprecise d a t a represent a variable coding that better pre-
serves the variables information. This paper deals with variables that cannot
be measured in a precise way. Therefore, in order t o represent the vagueness
and uncertainty of the data, we propose t o adopt a set-valued coding for the
generic variable Y, instead of the classical single-valued one. An interval [ y ]
is a coding able t o represent a continuous and uniformly dense set y c EX,
under the hypothesis that the distribution of Y is uniform or unknown over
the interval. Under these assumptions, any point E y represents an ad-
missible numerical coding of Y and the interval numerical coding [ y ]of y is
given by:
[YI= b i n (Y), (Y)l [g,= 4,
where - y and jj indicate the interval lower bound and upper bound, respec-
tively.
174 Lauro and Palumbo
where inf[X] := X and sup[X] := 7 indicate the interval lower bound and
A generic row interval vector [x] ( [ x ].~. .,[ X ] ~.,.., [XI,) corresponds to
a p-dimensional box and is generally identified with the (nonempty) set of
1
rad([x]) = A([x])= - ( S - x).
2
We will introduce some very basic definitions of the Interval Arithmetic.
They allow to define the mean interval.
The arithmetic operators in the IA framework are defined according to the
following basic principle: let [xIi and [xIi, be two generic bounded intervals
in IR and let xi E [xIi and xi? E [ x ] , ~be two generic values, if [y] = [ x ] ~ O [ X ] ~ I
then xioxit = y E [y],V(xi,xi/), where O indicates any generic operator.
Sum of [ x ] and
~ [ x ] ~is, defined as:
where [ x ] C
~ R Vi E (1,. . . , n ) .
Interval matrices. An interval matrix is a n x p matrix [XI whose - entries
[%Iij = .:[ 23 ' Zij] (i = 1 , .. . , n; j = 1 , .. . , p ) are intervals and X E [XI is a
generic single valued data matrix satisfying the following X 5 2 5 x. The
notation for boxes is adapted to interval matrices in the natural component-
wise way.
The vertices matrix associated to the generic interval matrix [XI will be
noted as Z and has n x 2 P rows and p columns.
176 Lauro and Palumbo
The quantity
Standardization
Moving from (2), we define the Standard Deviation for interval-valued vari-
ables Let o: be the variance of the generic [XIj variable: o j = @ is the
standard deviation of [XIj and the square diagonal p x p matrix C has the
generic term aj. The standardized interval matrix: [Y] = {XC-', A([x])C-')
assuming [XI to be centered and divided by fi.
Let us denote the correlation matrix by R :
where (Y'A([Y])) and (A([Y])'Y) have the same diagonal elements. A note-
worthy aspect is given by the decomposition of the total inertia. In fact,
Principal Component Analysis for Non-Precise Data 177
where u& and A& are defined under the usual orthonormality constraints.
Similarly to the PCA on midpoints, we solve the following eigensystem
to get the ranges PCA solutions:
with the same orthonormality constraints on A& and u& as in eq. (6) and
with m = [1,. . . , p ] .Both midpoints and ranges PCA's admit an independent
representation. Of course, they have different meanings and outline different
aspects. The quantity Em(A; + A;) <p but it does not include the whole
variability because the residual inertia, given by the midpoints-radii inter-
connection, has not yet been taken into account.
where A; and A? represent the first eigenvalues related to the midpoints and
radii, respectively. They express a partial information; in fact, there is a
residual variability that depends on the midpoints and radii connection that
cannot be explicitly taken into account.
In spite of the role they assume in classical PCA, in MR-PCA the
squared cosines have an important role to evaluate the achieved results.
Squared cosines, also called "relative contributions" represent the amount
of the original distances displayed on the factorial plane. From the classi-
cal PCA, we define these quantities as the ratio between the vector norms
in the Principal components space and the original norms computed in RP:
SqCosi = CPZ1 y&/ Ca(CP=l yi,j~j,~ ) ~ , a E [I, . . . ,p] represents the
where
set of eigenvectors with respect to which we intend compute the relative con-
< <
tributes. It is obvious that 0 SqCos 1, in the case of a =
[I, . . . ,p] the
s q c o s = 1.
In the case of interval-data, squared cosines are defined as:
S ~ C O S= Ea (I %,a I + I $*IyaI ) ~
E;=l(I&,j I + l r a d ( [ ~ ] ) i , j
where y!fi,, and +*I,, are defined in (9) and are centered variables. Differently
from the case of single-valued data, the condition a = 1 , . . . , p does not ensure
that SqCos = 1. In most cases, we get squared cosines less then one even
if we consider the whole set of the eigenvectors ( u l , u2, . . . , u p ) . Due to the
effects of the rotation, it may happen that SqCos > 1. In such a case the
SqCos reveals that the rectangle associated to the element is oversized with
respect to its original size.
The radii rotation is obtained in the sense of a "least squares" analysis and
this rotation does not ensure that the total variance is completely represented
by the principal components. A measure of goodness-of-fit allows to evaluate
the quality of the representation. We propose to adopt a generalization of
Principal Component Analysis for Non-Precise Data 179
the R2 index obtained as the ratio between the variance defined with respect
to the principal components and the variances in the original IIRP space.
Variances are determined by the formula in (3).
This section shows the results obtained by the method described in section 3.
Data are reported in the table (1) and refer to some characteristics describing
eight different species of Italian peppers.
Protein -0.221
Lipid -0.468 ,173 .612 .341
Glucide -0.352 ,177 ,341 ,429
Protein
The correlation matrix, resulting from the element-wise sum of the partial
matrices, can be interpreted as a classical symmetric correlation matrix. It
has values equal to one on the main diagonal and values between -1 and 1
otherwise.
Global correlation matrix
1.000
Protein
Lipid -0.183 1.OOO .553
Glucide -0.060 ,456
The figure (1) shows the midpoints (a) and ranges ( b ) variables. Circles
indicate the maximum norm that can be represented, determined according
to the correlation decomposition. Let us consider the midpoints variables, in
the present example the maximum variability is 0.612 (corresponding to the
Lipid variable); this implies that the maximum variable length is =
0.782. As the two graphics represent a part of the total variance, radii are
5 1. The interpretation of the graphical results can be done following the
usual rules adopted in the case of single-valued data, singly for midpoints
and ranges. Figure (2) displays the initial solution (a) and the final solution
(b) obtained after the rotation. In this case the algorithm stopped after 4
iterations. The residual variance resulted to be 0.178 that is equivalent to
4.45% of the total inertia. This little part of residual variance indicates the
good result obtained by the analysis. The percentage of inertia associated
to the first two principal components is equal to 79.33%. In table (4) we
summarized the most important analytical results necessary for a correct
interpretation. The first two columns refer to the SqCos with respect to the
first two factors singly considered. The third one represents the quality of the
Principal Component Analysis for Non-Precise Data
151
.lA
4 5 -1 0 1 2 4 -2 4 0 1 2
representation on the factorial plane spanned by the first two factors. Taking
into account the SqCos, we observe that Grosso dz Nocera and Cuban Nano
have the highest values. The segment traced inside each rectangle represents
the rotated range and indicate which variables have mainly contributed to
the ranges orientation. Referring to Grosso dz Nocera and Cuban Nano, we
observe that, with respect to the first factorial plan, their sizes and shapes
were characterized by the same variables, but with opposite versus.
Sometimes, the complexity of interval data can generate unclear graphical
representations, when the number of statistical units is large, because boxes
representation, either in the original Rp variable space and even more in the
R2 subspaces, can cause a severe overlapping of the statistical units, making
182 Lauro and Palumbo
Cos2 Abs.Contr.
F1 F2 Fl+Fl F1% F1%
Corno di Bue 0.048 0.450 0.498 0.34 7.05
Cuban 0.632 0.068 0.700 12.50 6.33
Cuban Nano 0.979 0.029 1.008 46.09 1.65
Grosso di Nocera 0.358 0.670 1.028 4.25 22.92
Pimiento 0.396 0.019 0.414 5.48 0.25
Quadrato d'Asti 0.646 0.097 0.742 18.69 21.67
Sunnybrook 0.187 0.380 0.567 2.31 27.12
Yolo Wonder 0.704 0.171 0.875 10.34 13.01
Total 100.00 100.00
Table 2. SqCos and Absolute Contribution for Italian Peppers dataset
Acknowledgments:
This paper was financially supported by the grant "Metodi statistici e tecniche di
visualizzazione grafica per grandi basi di dati" (F.Palumbo, Universit di Macerata,
2002/2003) and by the IST-2000-26341 Vitamin-s European project (C. Lauro,
DMS Napoli, 2000/2003).
References
CAZES, P., CHOUAKRIA, A,, DIDAY, E. and SCHEKTMAN, Y. (1997): Exten-
sion de l'analyse en composantes principales & des donnkes de type intervalle,
Revue de Statistique Applique'e XIV(3): 5-24.
GIORDANI, P. and KIERS, H. A. L. (2004): Principal component analysis of sym-
metric fuzzy data, Computational Statistics and Data Analysis 45, 519-548.
HICKEY, T., JU, Q. and VAN EMDEN, M. H. (2001): Interval arithmetic: From
principles to implementation, Journal of the A C M 48(5): 1038-1068.
INSELBERG, A. (1999): Don't panic ...just do it in parallel!, Comp. Stat.14: 53-77.
JANSSON, C. and ROHN, J. (1999): An alghoritm for checking regularity of inter-
val, SZAM Journal of Matrix Analysis and Applications 20(3): 756-776.
KEARFOTT, R. B. (1996): Rigorous Global Search: Continuous Problems, Kluver,
Dordrecht.
KIERS, H. A. L. and GROENEN, P. (1996): A monotonically convergent algorithm
for orthogonal congruence rotation, Psychometrika 61(2): 375-389.
LAURO, C. N. and PALUMBO, F. (2000): Principal component analysis of interval
data: A symbolic data analysis approach, Comp. Stat. 15(1): 73-87.
LAURO, C. N. and PALUMBO, F. (2003): Some results and new perspectives in
principal component analysis for interval data, in 'CLADAG'03 Book of Short
Papers', Bologna, pp. 237-244.
LEBART, L., MORINEAU, A. and PIRON, M. (1995): Statistique exploratorie
multidimensionelle, Dunod, Paris.
MARINO, M. and PALUMBO, F. (2003): Interval arithmetic for the evalua-
tion of imprecise data effects in least squares linear regression, Stat.Applicata
14(3): 277-291.
NEUMAIER, A. (1990): Interval methods for systems of Equations, Cambridge
University Press, Cambridge.
PALUMBO, F. and LAURO, C. N. (2003): A PCA for interval valued data based
on midpoints and radii, in H. Yanai, A. Okada, K. Shigemasu, Y. Kano and
J. Meulman, eds, 'New developments in Psychometrics', Psychometric Society,
Springer-Verlag, Tokyo.