PCA For Non Precise Data-Carlo N. Lauro

Principal Component Analysis
for Non-Precise Data
Carlo N. Laurol and Francesco Palumbo2

Dipartimento di Matematica e Statistica
Universit "Federico 11" - Napoli, Italy
clauro@unina.it
Dipartimento di Istituzioni Economiche e Finanziarie
Universit di Macerata - Macerata, Italy
palumbo@unimc.it
Abstract. Many real world phenomena are better represented by non-precise data
rather than by single-valued data. In fact, non-precise data represent two sources
of variability: the natural phenomena variability and the variability or uncertainty
induced by measurement errors or determined by specific experimental conditions.
The latter variability source is named imprecision. When there are information
about the imprecision distribution the fuzzy data coding is used to represent the
imprecision. However, in many cases imprecise data are natively defined only by the
minimum and maximum values. Technical specifications, stock-market daily prices,
survey data are some examples of such kind of data. In these cases, interval data
represent a good data coding to take into account the imprecision. This paper aims
at describing multiple imprecise data by means of a suitable Principal Component
Analysis that is based on specific interval data coding taking into account both
sources of variation.
1 Introduction
Generally, in statistical analysis, we handle single-valued variables; however,
in many cases, imprecise d a t a represent a variable coding that better pre-
serves the variables information. This paper deals with variables that cannot
be measured in a precise way. Therefore, in order t o represent the vagueness
and uncertainty of the data, we propose t o adopt a set-valued coding for the
generic variable Y, instead of the classical single-valued one. An interval [ y ]
is a coding able t o represent a continuous and uniformly dense set y c EX,
under the hypothesis that the distribution of Y is uniform or unknown over
the interval. Under these assumptions, any point E y represents an ad-
missible numerical coding of Y and the interval numerical coding [ y ]of y is
given by:
[YI= b i n (Y), (Y)l [g,= 4,
where - y and jj indicate the interval lower bound and upper bound, respec-
tively.
174 Lauro and Palumbo
The treatment of interval data is a very fascinating problem in Statistics,

it allows to take into account the natural variability of phenomena under
investigation and the other sources of variability, separately.
This paper proposes a generalization of Principal Component Analysis
(PCA) to interval data.
We shortly remind the aim of PCA "...each point can be considered as a
vector i n the p dimensional space ... The goal of the P C A i s to look for the
best axis, the best plane or the best subspace to represent the projections of
the distances among any generic couple of points with m i n i m u m distortion."
(Lebart et al. 1995). This allows to visualize data as points in reduced sub-
spaces and to analyze the points proximity in terms of their location with
respect to the center of gravity. Dealing with interval data, the above defini-
tion does not fit anymore and proper definitions and data treatments must
be introduced.
It is worth noticing that statistical units described by interval variables are
no longer represented by points in Rp but as segments in R, parallelograms in
R2 and boxes in higher dimensional spaces. As a consequence, other aspects
than the location can be investigated in the analysis, i.e.: boxes size and
shape.
The example in section (4) describes a situation where data are natively
generated as interval-valued and any punctual coding should induce a severe
loss of information.
2 Intervals and boxes

Interval data numerical notation adopted in this paper is mainly derived from
the Interval Arithmetic (IA) notation (Hickey et al. 2001).
The generic interval-valued variable [XI is represented in the following
notation:
where inf[X] := X and sup[X] := 7 indicate the interval lower bound and
IIRp (Kearfott 1996).

-
upper bound, respectively. The set of all boxes of dimension p is denoted by
A generic row interval vector [x] ( [ x ].~. .,[ X ] ~.,.., [XI,) corresponds to
a p-dimensional box and is generally identified with the (nonempty) set of
that a vector x E Rp is contained in a box [x],i.e., 5 E x -

points between its lower and upper bounds, [x]= (2 E Rp I a: 5 5 5 T ) , so
(2 5 5 5 5 ) ,
where 5 represents a generic (arbitrary) point in a box x. The set of vertices
of a box x represents the polytope S and corresponds to the 2p combinations
of 3 and .: Combining all the vertices of S,we define the vertices matrix z,
having 2p rows and p columns, that satisfies the following symmetric relation:
Z H X.
Principal Component Analysis for Non-Precise Data 175
Single valued variables represent a special case of interval variables. An

interval of zero width [x]= [x, XI, called thin box, is identified with the unique
point x it contains.
Interval boxes can also be described in terms of midpoints and radii (or
ranges) vectors that are defined as functions of min and max, as follows:
1
rad([x]) = A([x])= - ( S - x).
2
We will introduce some very basic definitions of the Interval Arithmetic.
They allow to define the mean interval.
The arithmetic operators in the IA framework are defined according to the
following basic principle: let [xIi and [xIi, be two generic bounded intervals
in IR and let xi E [xIi and xi? E [ x ] , ~be two generic values, if [y] = [ x ] ~ O [ X ] ~ I
then xioxit = y E [y],V(xi,xi/), where O indicates any generic operator.
Sum of [ x ] and
~ [ x ] ~is, defined as:
or equivalently in terms of midpoints and ranges we have
The difference between two intervals is defined as [y] = [x]i - [ x ] i ~= [ ( s -

-
xi,), ( z - X ~ I )The
] . computation of the product between two intervals corre-
sponds t o t h e min and max values i11 the set of all possible products between
( 3 , c ) and (xi!, q): v].
[y] = [x]i * [x]il = [y, Writing the product formula in
extended notation we have: - y = min{(xi - * zi,),(3 * zi,),( z * -
xi!), (c* zi,))
and % = max{(xi - -* xi,), (xi * F ) ,
( c * xi,),(z* z
i ,))
. The same definition
holds for the division of two generic intervals and can be generalized t o the
case in which one interval is a tiny interval.
Taking into account the definitions of sum and product we define the
mean interval [z]as:
. n
where [ x ] C
~ R Vi E (1,. . . , n ) .
Interval matrices. An interval matrix is a n x p matrix [XI whose - entries
[%Iij = .:[ 23 ' Zij] (i = 1 , .. . , n; j = 1 , .. . , p ) are intervals and X E [XI is a
generic single valued data matrix satisfying the following X 5 2 5 x. The
notation for boxes is adapted to interval matrices in the natural component-
wise way.
The vertices matrix associated to the generic interval matrix [XI will be
noted as Z and has n x 2 P rows and p columns.
As shown in the above definitions, statistical units described by interval

variables can be numerically represented in different ways. The choice of the
representation affects the global analysis results.
3 Midpoints-Ranges PCA (MR-PCA)

PCA on interval-valued data can be resolved in terms of ranges, midpoints
and inter-connection between midpoints and radii (Palumbo and Lauro 2003).
The basic idea behind this proposal consists in the definition of the variance
for interval variables based on the the notion of distance between intervals
(Neumaier 1990):
The quantity
vaq.]([x]) := n-' C d(xi, [XI)"

represents the variance for interval variables, where [x] indicates the mean
interval vector obtained by the (1).The generalization of the above definitions
to matrices is:
The variance decomposition for interval-valued data suggests facing the

PCA problem singly; the terms ( X X ) and A([x])'A([x]) are two standard
var-cov matrices computed on single-valued data. Two independent PCA's
could be singly exploited on these two matrices that do not cover the whole
variance. We propose a solution that takes into account the residual vari-
+I
ance ( P A ( [ X ] ) ~ A ( [ x ] ) ' ~ )and, at the same time, allows getting a logical
graphical representation of the statistical units as a whole.
Standardization
Moving from (2), we define the Standard Deviation for interval-valued vari-
ables Let o: be the variance of the generic [XIj variable: o j = @ is the
standard deviation of [XIj and the square diagonal p x p matrix C has the
generic term aj. The standardized interval matrix: [Y] = {XC-', A([x])C-')
assuming [XI to be centered and divided by fi.
Let us denote the correlation matrix by R :
where (Y'A([Y])) and (A([Y])'Y) have the same diagonal elements. A note-
worthy aspect is given by the decomposition of the total inertia. In fact,
t r ( R ) = p and we observe that the quantity t r ( ~ / and ~ ) the quantity

~ ~ ( A ( [ z ] ) ' A ( [ z ]are
) ) the partial contributions to the total inertia given by
midpoints and ranges, respectively. A residual inertia is given by 2 t r ( ~ / ~ ( [ ~ ] ) )
Midpoints a n d Ranges analysis

We first consider a partial analysis based on the matrix of centers (or mid-
points) values. This is a classical PCA on the interval midpoints whose solu-
tions are given by the following eigensystem:
where u& and A& are defined under the usual orthonormality constraints.
Similarly to the PCA on midpoints, we solve the following eigensystem
to get the ranges PCA solutions:
with the same orthonormality constraints on A& and u& as in eq. (6) and
with m = [1,. . . , p ] .Both midpoints and ranges PCA's admit an independent
representation. Of course, they have different meanings and outline different
aspects. The quantity Em(A; + A;) <p but it does not include the whole
variability because the residual inertia, given by the midpoints-radii inter-
connection, has not yet been taken into account.
Global analysis a n d graphical representations

Hereinafter, we propose a reconstruction formula that takes into account the
three components of the variance (3). The interval bounds over the Principal
Components (PC's) are derived from the midpoints and ranges coordinates,
if PC's of ranges are superimposed on the PC's of midpoints. This can be
achieved if ranges are rotated proportionally to their connections with mid-
points.
There exist several rotation techniques, we verified the properties of many
of them. In this paper, as orthogonal congruence rotation criterion, we pro-
pose to maximize the congruence coefficient proposed by Tucker between
midpoints and radii:
Verified under several different conditions, this rotation technique ensured

best results in most cases. The computation of the rotation matrix T =
[ t l , .. . , tl, . . . , tp] can be done in several different ways. We choose the one
based on the iterative algorithm proposed by (Kiers and Groenen 1996).
Let $J; = XU; be the midpoints coordinates on the ath axis. The interval-
described statistical units reconstruction on the same axis is given by the
rotated radii on u:. In mathematical notation: 4': = T(A([X])u:). The

interval projection is obtained as:
Like in single-valued PCA, also in interval-valued variables PCA, it is

possible to define some indicators that are related to interval contribution.
Measures of explanatory power can be defined with respect to the partial
analyses (midpoints and radii) as well as with respect to the global analysis.
Let us remind that the variability associated to each dimension is expressed
by its related eigenvalue.
The proportion of variability associated to the first dimension is given by:
where A; and A? represent the first eigenvalues related to the midpoints and
radii, respectively. They express a partial information; in fact, there is a
residual variability that depends on the midpoints and radii connection that
cannot be explicitly taken into account.
In spite of the role they assume in classical PCA, in MR-PCA the
squared cosines have an important role to evaluate the achieved results.
Squared cosines, also called "relative contributions" represent the amount
of the original distances displayed on the factorial plane. From the classi-
cal PCA, we define these quantities as the ratio between the vector norms
in the Principal components space and the original norms computed in RP:
SqCosi = CPZ1 y&/ Ca(CP=l yi,j~j,~ ) ~ , a E [I, . . . ,p] represents the
where
set of eigenvectors with respect to which we intend compute the relative con-
< <
tributes. It is obvious that 0 SqCos 1, in the case of a =
[I, . . . ,p] the
s q c o s = 1.
In the case of interval-data, squared cosines are defined as:
S ~ C O S= Ea (I %,a I + I $*IyaI ) ~
E;=l(I&,j I + l r a d ( [ ~ ] ) i , j
where y!fi,, and +*I,, are defined in (9) and are centered variables. Differently
from the case of single-valued data, the condition a = 1 , . . . , p does not ensure
that SqCos = 1. In most cases, we get squared cosines less then one even
if we consider the whole set of the eigenvectors ( u l , u2, . . . , u p ) . Due to the
effects of the rotation, it may happen that SqCos > 1. In such a case the
SqCos reveals that the rectangle associated to the element is oversized with
respect to its original size.
The radii rotation is obtained in the sense of a "least squares" analysis and
this rotation does not ensure that the total variance is completely represented
by the principal components. A measure of goodness-of-fit allows to evaluate
the quality of the representation. We propose to adopt a generalization of
the R2 index obtained as the ratio between the variance defined with respect
to the principal components and the variances in the original IIRP space.
Variances are determined by the formula in (3).
4 Application: Italian peppers dataset
This section shows the results obtained by the method described in section 3.
Data are reported in the table (1) and refer to some characteristics describing
eight different species of Italian peppers.
Id Hz0 Protein Lipid Glucide

Corno di Bue 90.45 93.15 0.67 0.95 0.23 0.30 5.07 7.76
Cuban 90.46 91.55 0.97 1.11 0.24 0.33 6.42 7.65
Cuban Nano 87.89 91.40 0.89 1.25 0.28 0.35 6.80 9.91
Grosso di Nocera 90.91 92.55 0.52 0.80 0.21 0.27 5.98 7.58
Pimiento 89.92 93.43 0.61 1.09 0.23 0.24 5.23 7.94
Quadrato D'Asti 91.31 92.99 0.74 0.90 0.20 0.27 6.64 7.10
Sunnybrook 89.65 92.58 0.85 1.50 0.20 0.28 5.52 8.52
Yolo Wonder 90.80 94.26 0.73 1.30 0.20 0.25 4.39 7.34
Table 1. Italian peppers dataset

Data are natively defined as interval-valued variables, they represent some
of the chemio-physical characteristics of eight different species of Italian pep-
pers. This is a good example of data in which we can distinguish two different
sources of variability: variability among different species; variation admitted
inside one specific breed. Variation associated to each species is represented
by the range: difference between the maximum and the minimum value.
The correlation decomposition in the three parts: midpoints, ranges and
the congruence between midpoints and ranges is reported in the tables be-
low. Taking into account that the total inertia is equal t o p, let we analyze
the variance components. The trace of the midpoints correlation matrix is
equal t o 1.973 and corresponds to 49.32% (1.973/4.000 = 0.4932) of the total
variance.
Midpoints variance part (C)
Protein -0.221
Lipid -0.468 ,173 .612 .341
Glucide -0.352 ,177 ,341 ,429
The ranges variability is equal t o 0.534 and corresponds to 13.35% of the

total variability.
Ranges variance part (R)
Protein
The residual part, corresponding to the connection between centers and

ranges, is the complement to p. In this case this quantity is equal to
(4 - 1.973 - 0.534) = 1.493 and corresponds to the 37.32% of the total
variance.
Range-midpoints co-variance part (CR)
Hz0 Protein Lipid Glucide
Hz0 0.391 0.339 0.348 0.284
Protein 0.339 0.400 0.042 0.294
Lipid 0.3481 0.301 0.2921 0.214
Glucide 0.2841 0.294 0.2141 0.409
The correlation matrix, resulting from the element-wise sum of the partial
matrices, can be interpreted as a classical symmetric correlation matrix. It
has values equal to one on the main diagonal and values between -1 and 1
otherwise.
Global correlation matrix
1.000
Protein
Lipid -0.183 1.OOO .553
Glucide -0.060 ,456
The figure (1) shows the midpoints (a) and ranges ( b ) variables. Circles
indicate the maximum norm that can be represented, determined according
to the correlation decomposition. Let us consider the midpoints variables, in
the present example the maximum variability is 0.612 (corresponding to the
Lipid variable); this implies that the maximum variable length is =
0.782. As the two graphics represent a part of the total variance, radii are
5 1. The interpretation of the graphical results can be done following the
usual rules adopted in the case of single-valued data, singly for midpoints
and ranges. Figure (2) displays the initial solution (a) and the final solution
(b) obtained after the rotation. In this case the algorithm stopped after 4
iterations. The residual variance resulted to be 0.178 that is equivalent to
4.45% of the total inertia. This little part of residual variance indicates the
good result obtained by the analysis. The percentage of inertia associated
to the first two principal components is equal to 79.33%. In table (4) we
summarized the most important analytical results necessary for a correct
interpretation. The first two columns refer to the SqCos with respect to the
first two factors singly considered. The third one represents the quality of the
Principal Component Analysis for Non-Precise Data
Fig. 1. (a) Midpoints Fig.1. (b) Ranges
151
.lA
4 5 -1 0 1 2 4 -2 4 0 1 2
Fig. 2. (a) Initial solution Fig.2. (b) Final solution
representation on the factorial plane spanned by the first two factors. Taking
into account the SqCos, we observe that Grosso dz Nocera and Cuban Nano
have the highest values. The segment traced inside each rectangle represents
the rotated range and indicate which variables have mainly contributed to
the ranges orientation. Referring to Grosso dz Nocera and Cuban Nano, we
observe that, with respect to the first factorial plan, their sizes and shapes
were characterized by the same variables, but with opposite versus.
Sometimes, the complexity of interval data can generate unclear graphical
representations, when the number of statistical units is large, because boxes
representation, either in the original Rp variable space and even more in the
R2 subspaces, can cause a severe overlapping of the statistical units, making
Cos2 Abs.Contr.
F1 F2 Fl+Fl F1% F1%
Corno di Bue 0.048 0.450 0.498 0.34 7.05
Cuban 0.632 0.068 0.700 12.50 6.33
Cuban Nano 0.979 0.029 1.008 46.09 1.65
Grosso di Nocera 0.358 0.670 1.028 4.25 22.92
Pimiento 0.396 0.019 0.414 5.48 0.25
Quadrato d'Asti 0.646 0.097 0.742 18.69 21.67
Sunnybrook 0.187 0.380 0.567 2.31 27.12
Yolo Wonder 0.704 0.171 0.875 10.34 13.01
Total 100.00 100.00
Table 2. SqCos and Absolute Contribution for Italian Peppers dataset
any interpretation difficult. Alternatively, instead of representing boxes in the

RP space, Lauro and Palumbo (2003) proposed to adopt the parallel axes as
the geometric space where to visualize statistical units described by set-valued
variables. Parallel axes can be defined as a visualization support derived
from the parallel coordinates schema, firstly proposed by Inselberg (1999),
who exploits the projective geometry properties consisting in the definition
of duality between the Rn Euclidean space and a system of n parallel axes
in R2. It can be proved that these relationships correspond to a duality point
t, line; each point in Rn corresponds to (n-1)-segments polygonal line in
the projective R2 parallel coordinates system. With respect t o other graphic

visualization methods for complex data, the most interesting aspect in the
proposal is the relevance given to the possibility of comparing syntheses of
statistical units, as well as single variables on different parallel axes systems.
5 Conclusion and perspective

The above presented method is based on a data coding procedure (midpoints
and ranges) that transforms interval data into punctual data to perform the
treatment and recovers intervals when data are graphically represented. In
other words, it is based on the coding process: Interval -+ Punctual -, Inter-
val. The same approach with a different data coding has been presented by
Cazes et al. (1997); they propose a PCA analysis on the vertices matrix Z. On
the same vertices data structure, Lauro and Palumbo (2000) introduced a co-
hesion constraint matrix and a system of vertices weighting. Both approaches
represent the boxes on the factorial plan by means of the Maximum Covering
Area Rectangles (MCAR) that are obtained as the rectangle recovering all
the vertices belonging to the same interval valued statistical unit. The major
drawback of these approaches consists in producing oversized MCAR's. On
the other hand, the vertices coding assumes an ordinary PCA on the vertices
matrix preserving the possibility of using standard interpretation tools.
A direct treatment of interval data should avoid the loss of information

due t o the coding in the data structure. There are two possible approaches
t o jointly treat interval data taking into account both midpoints and ranges.
An approach for the direct treatment of interval data was proposed by
Giordani and Kiers (2004). They present a model named Principal Compo-
nents for Fuzzy data (PCAF), where interval data can represent a special
case. The advantage of this approach consists in the possibility of treating
fuzzy numbers, however the formalization of the method substantially corre-
sponds to the vertices approach proposed by Cazes et al. (1997). In fact, they
propose an alternate iterative algorithm to minimize the quantity:
K
n2= I X - x*12+ C I I ( X +S H ~ A )- ( x * + S * H ~ A ) I J * , (12)
k=l
where x and X* are the matrices of the observed and estimated midpoints,
respectively. Matrices S and S* have n rows and p columns and represent
the observed and the estimated spread (or ranges), respectively. The matrix
H contains only -1 and 1 values and has no relevance from the point of view
of the model and permits t o combine all the (min, max) vertices (like in the
proposal by Cazes et al.). The matrix A is a weight diagonal matrix with
generic term Xj. In case of A = I, the minimization of the second term in the
left hand side of (12) corresponds to the analysis of Cazes et al.. We refer
the interested readers to the cited paper for further details about the model
estimation procedure. However, it is interesting to notice that results shown
in the Giordani and Kiers example are very similar to the ones obtained
by Cazes et al., in fact, they were obtained using the same well known (to
intervallers) Ichino and Yaguchi's oil data set.
A more promising approach is the one based on the Interval Arithmetic
principles (Neumaier 1990). IA provides us suitable methods for the direct
treatment of the interval variables. As expected, in this context, eigenval-
ues, eigenvectors, principal components are intervals. This makes the results
consistent with the nature of the original data and enriches the data inter-
pretation and visualization, because it preserves the formal aspects of the
single-valued PCA. However, it requires high complexity in both numerical
and computing aspects.
The eigenanalysis of an interval matrix implies that the matrix should be
strictly regular. An interval matrix is said to be regular (Jansson and Rohn
1999) if every matrix x E [XIhas full rank and it is said t o be strongly regular
if the following condition holds: p (Ix-'~A([x])) < 1, where p(.) indicates the
spectral radius. Given a generic square matrix, p ( M ) is defined as:
p ( M ) := max {/XI : X an eigenvalue of M }
From a statistical point of view, the regularity property has a remarkable
significance. Let us assume that the matrix [C] = [x]'[x]
is a p x p variance-
covariance interval matrix; then the covariance midpoints matrix satisfies
2 = X'X a n d t h e matrix A[E] represents t h e variance a n d covariance ranges

matrix. T h e regularity condition implies t h a t , for any couple of variables
[xIj a n d [xIj,, with j, j' = 1, . . . ,p, t h e range is lower t h a n t h e midpoint
variability. I t is quite intuitive t h a t this condition is very restrictive a n d t h e
direct application of t h e interval arithmetic can be exploited only in cases of
intervals representing very small d a t a perturbations. In addition, a proposal
by Marino a n d Palumbo (2003) in t h e interval multiple regression context,
based on t h e optimization/perturbation theory, seems very promising also for
Interval PCA. T h e statistical properties and its interpretability a r e subjects
of actual investigation.
Acknowledgments:
This paper was financially supported by the grant "Metodi statistici e tecniche di
visualizzazione grafica per grandi basi di dati" (F.Palumbo, Universit di Macerata,
2002/2003) and by the IST-2000-26341 Vitamin-s European project (C. Lauro,
DMS Napoli, 2000/2003).
References
CAZES, P., CHOUAKRIA, A,, DIDAY, E. and SCHEKTMAN, Y. (1997): Exten-
sion de l'analyse en composantes principales & des donnkes de type intervalle,
Revue de Statistique Applique'e XIV(3): 5-24.
GIORDANI, P. and KIERS, H. A. L. (2004): Principal component analysis of sym-
metric fuzzy data, Computational Statistics and Data Analysis 45, 519-548.
HICKEY, T., JU, Q. and VAN EMDEN, M. H. (2001): Interval arithmetic: From
principles to implementation, Journal of the A C M 48(5): 1038-1068.
INSELBERG, A. (1999): Don't panic ...just do it in parallel!, Comp. Stat.14: 53-77.
JANSSON, C. and ROHN, J. (1999): An alghoritm for checking regularity of inter-
val, SZAM Journal of Matrix Analysis and Applications 20(3): 756-776.
KEARFOTT, R. B. (1996): Rigorous Global Search: Continuous Problems, Kluver,
Dordrecht.
KIERS, H. A. L. and GROENEN, P. (1996): A monotonically convergent algorithm
for orthogonal congruence rotation, Psychometrika 61(2): 375-389.
LAURO, C. N. and PALUMBO, F. (2000): Principal component analysis of interval
data: A symbolic data analysis approach, Comp. Stat. 15(1): 73-87.
LAURO, C. N. and PALUMBO, F. (2003): Some results and new perspectives in
principal component analysis for interval data, in 'CLADAG'03 Book of Short
Papers', Bologna, pp. 237-244.
LEBART, L., MORINEAU, A. and PIRON, M. (1995): Statistique exploratorie
multidimensionelle, Dunod, Paris.
MARINO, M. and PALUMBO, F. (2003): Interval arithmetic for the evalua-
tion of imprecise data effects in least squares linear regression, Stat.Applicata
14(3): 277-291.
NEUMAIER, A. (1990): Interval methods for systems of Equations, Cambridge
University Press, Cambridge.
PALUMBO, F. and LAURO, C. N. (2003): A PCA for interval valued data based
on midpoints and radii, in H. Yanai, A. Okada, K. Shigemasu, Y. Kano and
J. Meulman, eds, 'New developments in Psychometrics', Psychometric Society,
Springer-Verlag, Tokyo.

PCA For Non Precise Data-Carlo N. Lauro

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PCA For Non Precise Data-Carlo N. Lauro

Uploaded by

Copyright:

Available Formats

Principal Component Analysis

for Non-Precise Data

Carlo N. Laurol and Francesco Palumbo2

The treatment of interval data is a very fascinating problem in Statistics,

2 Intervals and boxes

IIRp (Kearfott 1996).

that a vector x E Rp is contained in a box [x],i.e., 5 E x -

Single valued variables represent a special case of interval variables. An

or equivalently in terms of midpoints and ranges we have

The difference between two intervals is defined as [y] = [x]i - [ x ] i ~= [ ( s -

As shown in the above definitions, statistical units described by interval

3 Midpoints-Ranges PCA (MR-PCA)

vaq.]([x]) := n-' C d(xi, [XI)"

The variance decomposition for interval-valued data suggests facing the

t r ( R ) = p and we observe that the quantity t r ( ~ / and ~ ) the quantity

Midpoints a n d Ranges analysis

Global analysis a n d graphical representations

Verified under several different conditions, this rotation technique ensured

rotated radii on u:. In mathematical notation: 4': = T(A([X])u:). The

Like in single-valued PCA, also in interval-valued variables PCA, it is

4 Application: Italian peppers dataset

Id Hz0 Protein Lipid Glucide

Table 1. Italian peppers dataset

The ranges variability is equal t o 0.534 and corresponds to 13.35% of the

Ranges variance part (R)

The residual part, corresponding to the connection between centers and

Fig. 1. (a) Midpoints Fig.1. (b) Ranges

Fig. 2. (a) Initial solution Fig.2. (b) Final solution

any interpretation difficult. Alternatively, instead of representing boxes in the

the projective R2 parallel coordinates system. With respect t o other graphic

5 Conclusion and perspective

A direct treatment of interval data should avoid the loss of information

2 = X'X a n d t h e matrix A[E] represents t h e variance a n d covariance ranges

You might also like