You are on page 1of 16

Global Principal Component Analysis for Dimensionality Reduction in Distributed Data Mining

Hairong Qi, Tsei-Wei Wang, J. Douglas Birdwell University of Tennessee Knoxville, TN 37996, USA
Previous data mining activities have mostly focused on mining a centralized database. One big problem with a centralized database is its limited scalability. Because of the distributed nature of many businesses and the exponentially increasing amount of data generated from numerous sources, a distributed database becomes an attractive alternative. The challenge in distributed data mining is how to learn as much knowledge from distributed databases as we do from the centralized database without costing too much communication bandwidth. Both unsupervised classication (clustering) and supervised classication are common practices in data mining applications, where dimensionality reduction is a necessary step. Principal component analysis is a popular technique used in dimensionality reduction. This paper develops a distributed principal component analysis algorithm which derives the global principal components from distributed databases based on the integration of local covariance matrices. We prove that for homogeneous databases, the algorithm can derive the global principal components that are exactly the same as those calculated based on a centralized database. We also provide quantitative measurement of the error introduced in the recompiled global principal components when the databases are heterogeneous.

I INTRODUCTION
Data mining is a technology that deals with the discovery of hidden knowledge, unexpected patterns and new rules from large databases. In an information society where we are drowning in information but starved for knowledge [8], data mining provides an effective means to analyze the uncontrolled and unorganized data and turns them into meaningful knowledge. The development of different data mining technologies has been spurred since early 90s. Grossman [4] classied data mining systems into three generations: The rst generation develops single or collection of data mining algorithms to mine vector-valued data. The second generation supports mining of larger datasets and datasets in higher dimensions. It also includes developing data mining schema and data mining languages to integrate mining into database management systems. 1

The third generation provides distributed data mining in a transparent fashion. Current commercially available data mining systems mainly belong to the rst generation. With the advances in computer networking and information technology, new challenges are brought to the data mining community, which we summarize as follows: 1) Large dataset with increased complexity (high dimension); 2) New data types including object-valued attributes, unstructured data (textual data, image, etc.), and semi-structured data (html-tagged data); 3) Geographically distributed data location with heterogeneous data schema; 4) Dynamic environment with data items updated in real time; and 5) Progressive data mining which returns quick, partial or approximate results that can be ne-tuned later in support of more active interactions between user and data mining systems. The focus of the previous data mining research has been on a centralized database. One big problem with a centralized database is its limited scalability. On the other hand, many databases nowadays tend to be maintained distributively not only because many businesses have a distributed nature, but that growth can be sustained more gracefully in a distributed system. The paper discusses the problem of distributed data mining (DDM) from geographically distributed data locations, with databases being either homogeneous or heterogeneous. Data mining in distributed systems can be carried out in two different fashions: data from distributed locations are transferred to a central processing center where distributed databases will be combined into a data warehouse before any further processing is to be done. During this process, large amounts of data are moved through the network. A second framework is to carry out local data mining rst. Global knowledge can be derived by integrating partial knowledge obtained from local databases. It is expected that by integrating the knowledge instead of data, network bandwidth can be saved and computational load can be more evenly distributed. Since the partial knowledge only reects properties of the local database, how to integrate these partial knowledge into the global knowledge in order to represent characteristics of the overall data collection remains a problem. Guo et al. addressed in [5] that in distributed classication problems, the classication error of a global model should, at worst, be the same as the average classication error of local models, at best, lower than the error of the non-distributed learned model of the same domain. Popularly used data mining techniques include association rule discovery [11], clustering (unsupervised classication), and supervised classication. With the growth of distributed databases, distributed approaches to implement all the three techniques have been developed since early 90s. Chan and Stolfo proposed a distributed meta-learning algorithm based on the JAM system [2], which is one of the earliest distributed data mining systems developed. JAM [10] stands for Java Agents for Meta-learning. It is a multiagent framework that carries out meta-learning for fraud detection in banking systems and intrusion detection for network security. In the distributed metalearning system, classiers are rst derived from different training datasets using different classication algorithms. These base classier will then be collected or combined by another learning processing, the meta-learning process, to

generate a meta-classier that integrates the separately learned classier. Guo and Sutiwaraphun proposed a similar approach named distributed classication with knowledge probing (DCKP) [5]. The difference between DCKP and metalearning lies in the second learning phase and the forms of the nal results. In DCKP, the second learning phase is performed on a probing set whose class values are the combinations of predictions from base classiers. The result is one descriptive model at the base level rather than the meta level. The performance reported from the empirical studies of both approaches vary from dataset to dataset. Most of the time, the distributed approach performs worse than the non-distributed approach. Recently, there has been signicant progress in DDM and there are approaches, dealing with massive datasets that do better than the non-distributed learned model [9]. Kargupta et al [7] proposed collective data mining (CDM) to learn a function which approximates the actual relationship between data attributes by inductive learning. The key idea of CDM is to represent this function as a weighted summation of an orthonormal basis. Each local dataset generates its own weights corresponding to the same basis. Cross terms in the function can be solved when local weights are collected at a central site. He also studied distributed clustering using collective principal component analysis (PCA) [6]. Collective PCA has the same objective as global PCA. However, in collective PCA, local principal components, as well as sampled data items from local dataset need to be sent to a central site in order to derive the global principal components that can be applied to all dataset. In global PCA, no data items from the local database are needed in the derivation of the global principal components. Except for the CDM approach proposed by Kargupta, most of the current DDM methods deal with only homogeneous databases. Almost all DDM algorithms need to transfer some data items from local database in order to derive the global model. The objective of global PCA is to derive the exact or high-precision global model, from homogeneous or heterogeneous databases respectively, without the transfer of any local data items.

II PRINCIPAL COMPONENT ANALYSIS


Principal component analysis (PCA) is a popular technique for dimensionality reduction which, in turn, is a necessary step in classication [3]. It constructs a representation of the data with a set of orthogonal basis vectors that are the eigenvectors of the covariance matrix generated from the data, which can also be derived from singular value decomposition. By projecting the data onto the dominant eigenvectors, the dimension of the original dataset can be reduced with little loss of information. In PCA-relevant literature, PCA is often presented using the eigenvalue/eigenvector approach of the covariance matrices. But in efcient computation related to PCA, it is the singular value decomposition (SVD) of the data matrix that is used. The relationship between the eigen-decomposition of the covariance matrix and the SVD of the data matrix itself, is presented below to make the connection. In this paper, eigen decomposition of the covariance matrix and SVD of the data matrix 3

are used interchangeably. Let X be the data repository with m records of dimension d (m d ). Assume the dataset is mean-centered by making E [X ] = 0. A modern PCA method is based on nding the singular values and orthonormal singular vectors of the X matrix as shown in Eq. 1, X = U V T (1) where U and V are the left and the right singular vectors of X , and is a diagonal matrix with positive singular values, 1 , 2 , , d (d = rank(X ), assuming d < m), along the diagonal, arranged in descending order. Using covariance matrix to calculate the eigenvectors, let C = E [X T X ] represent the covariance matrix of X . Then the right singular vectors contained in V of Eq. 1 are the same as those normalized eigenvectors of the covariance matrix C (Eq. 2). In addition, if the nonzero eigenvalues of C are arranged in a descending order, then the kth singular value of X is equal to the square root of the kth 2. eigenvalue of C. That is, k = k C = V 2V T (2)

The paper presents an algorithm to calculate the global principal components from distributed databases by only transferring the eigenvectors and eigenvalues (or the singular vectors and the singular values) of the local covariance matrix instead of the original data set. We assume the dimension of the data (d ) is much less than the number of data samples (m). We prove that in homogeneous databases, the global principal components derived are exactly the same as those derived from a centralized data warehouse. We also quantitatively measure the error introduced to the integrated global principal components when the databases are heterogeneous.

III GLOBAL PCA FOR DISTRIBUTED HOMOGENEOUS DATABASES


We rst derive the global PCA algorithm assuming the distributed databases are homogeneous and the dimension of the dataset is much less than the number of data samples (d m). Let Xmd and Ypd be two distributed databases of the same dimension (d ). X is a matrix of m rows and d columns, and Y of p rows and d columns. The number of columns is in fact the dimension of the data samples in the database, and the number of rows is the number of data samples. Assume that m, p d . We present Lemma 1 and provide the proof. Lemma 1 If matrices Xmd and Ypd are mean-centered, that is, E [X ] = E [Y ] = 0, then eig[(m 1)cov(X ) + ( p 1)cov(Y )] = eig[(m + p 1)cov( X Y )] (3)

where eig[A] denotes the eigenvectors and the eigenvalues of matrix A. 4

Proof: Since E [X ] = 0 and E [Y ] = 0, we have E X Y =0

According to the denition of the covariance matrix, cov(X ) = = and similarly, cov(Y ) = 1 Y TY p1
1 m1 T m k=1 (Xk E [X ]) (Xk E [X ])

1 T m1 X X

Therefore, we have the left hand side of Eq. 3 (m 1)cov(X ) + ( p 1)cov(Y ) = X T X + Y T Y The right hand side of Eq. 3 can also be derived in a similar way: cov( X Y ) =
1 m+ p1

X Y XT

X Y X Y

= =

1 m+ p1

YT +Y TY )

1 T m+ p1 (X X

Therefore, the right hand side of Eq. 3 is also equals to (m + p 1) 1 (X T X + Y T Y ) = X T X + Y T Y m+ p1

We can extend Lemma 1 to the case of multiple distributed databases. Let Xi represent the local database of d dimension and i the index of distributed databases where i = 1 r. The global principal components can be derived by Eq. 4, X1 . eig[((r i=1 mi ) 1) . . ] = (4) Xr eig[(m1 1)cov(X1 ) + + (mr 1)cov(Xr )] where mi (i = 1, , r) is the number of samples in the distributed database i. Eq. 4 says that the principal components of the combined matrix (the global principal components) can be derived from the distributed databases through the covariance matrices assuming the distributed databases are homogeneous. We have shown in Eq. 1 and Eq. 2 that the covariance matrix can be calculated if the singular values and singular vectors are known. 5

Based on the above analysis, we design the global principal components derivation algorithm for distributed homogeneous databases. Let Xi denote the local dataset. Xi has mi samples and is of d dimension. i = 1 r is the index of different dataset locations. The algorithm is carried out in two steps: Step 1: At each local site, calculate the singular values and the singular vectors of the local database Xi using SVD method, as shown in Eq. 5. Xi = Ui iViT (5) where according to Eq. 2, the columns of Vi are the eigenvectors of the covariance matrix of Xi , and the singular values (i = (i1 , i2 , , id )) along the diagonal of i are the square root of the eigenvalues of the covariance matrix of Xi . Step 2: Transfer the singular values (i ), the singular vectors (Vi ), and the number of data samples mi from the local site (i) to another site ( j = i + 1). That is, transfer the covariance matrix and the number of data samples from the local site to another site. Let the knowledge transfer process be carried out serially. Reconstruct the covariance matrix based on data from site i and site j as shown in Eq. 6. 1 Ci+ j = m +m 1 ((mi 1)Ci + (m j 1)C j )
i j

(6) =
1 2 T mi +m j 1 ((mi 1)Vi i Vi T + (m j 1)V j 2 jVj )

where Ci , C j are the covariance matrices of Xi , X j respectively, and Ci+ j is the covariance matrix of Xi Xj Step 2 will continue until C1+2++r is generated. We can then use SVD method to derive the global principal components of the distributed datasets. This process is also illustrated in Fig. 1.

    
SVD      

    

...... ......

  
Lemma 1

...... ......


SVD

...... ......

......

SVD

Figure 1: Illustration of global PCA for distributed homogeneous databases. During this learning process, only the singular vectors, the singular values, and the number of samples in each dataset are transferred. Yet, this process generates exactly the same global principal components as if calculated from a central 6

Algorithm 1: Global PCA in distributed homogeneous databases. Data: local dataset Xi of size mi d , the number of distributed datasets, r, i = 1, , r. Result: Global principal components VG . % The following session can be done in parallel at local sites; Mean-center Xi such that E [Xi ] = O (a 1 d zero vector); SVD: Xi = Ui iViT where the non-zero singular values along the diagonal of i form the vector i ; % The following session can be done serially or hierarchically; i = 1; while i < r do transfer Vi and i to where X j=i+1 is located; Reconstruct the covariance matrix Ci+ j using Eq. 6; increase i by 1 end CG = C1++r ; T CG = VG 2 GVG ; VG is the global principal components.

database. The data transferred in the proposed approach is at the order of O(rd 2 ) compared to the transfer of the original local database which is at the order of O(rmd ), where m d . Step 2 can also be carried out hierarchically. Figure 2 shows two different integration scenarios.

(a) sequential

(b) hierarchical

Figure 2: Two scenarios can be used to integrate local singular vectors and singular values. The proposed approach is described in detail in Algorithm III.

IV GLOBAL PCA IN DISTRIBUTED HETEROGENEOUS DATABASES


The global PCA algorithm presented in Sec. III can also be used if the databases are heterogeneous, except that the global principal components are not accurate any more if we only transfer the singular values and singular vectors as shown in Eq. 7. Assume again that X and Y are two databases at two different loca7

tions. Assume the structure of X and Y are not the same. The dataset X is of dX dimension, and Y of dY dimension. X and Y are related through a public key. For example, different departments of a hospital might use different databases to record patients information. Patients personal data can be organized in one table and patients diagnostic history might be stored in another table. However, patients ID is unique which is used as a public key to link these two tables. Without loss of generality, assume X and Y have the same number of samples. That is, X is an m dX matrix, and Y an m dY matrix. The combined covariance matrix of X Y can then be calculated as Eq. 7. cov( X Y ) = =
1 m1 1 m1

X XT YT

X Y

X XTY Y TY

1 m1

XT X YTX

(7)

1 m1

m cov(X ) YTX

XTY m cov(Y )

If we assume data are transferred from where X locates to where Y locates, then the error introduced in the global principal components (Eq. 7) actually comes from the estimation of X since Y is the local dataset. Therefore, the key problem in deriving the global principal components from distributed heterogeneous databases is how to estimate local dataset when the computation is away from that local site. According to SVD algorithm, X can be decomposed as Eq. 8,
T T T X = u 1 1 vT 1 + u 2 2 v2 + + u j j v j + + u d d vd

(8)

where u j is the jth column vector of the left singular matrix of X , v j the jth column vector of the right singular matrix of X , and j the jth singular value of X . The singular values are arranged in descending order, i.e., 1 > 2 > > j > d . Usually, the rst component in Eq. 8 (u 1 1 vT 1 ) contains most of the information of X and thus would be a good estimation of X . In another word, besides transferring the singular vectors and singular values, we also transfer u1 , the rst column vector of the left singular matrix of X . The loss of information by estimating X using only the rst component in Eq. 8 can be formulated using Eq. 9,

d j=2 j d j=1 j

(9)

Therefore, the amount of data transferred among heterogeneous databases is at the order of O(rd 2 + m). The more u j s transferred, the more accurate the estimation to X , the more data need to be transferred as well.

Apply the above analysis to Eq. 7, we have cov( X Y )= 1 m1 m cov(X ) YTX TY X m cov(Y )

= ti=1 u vT where X xi xi xi approximates X . t is the number of u j s transferred. The loss of information is then calculated by

d j=t +1 j d j=1 j

V EXPERIMENTS AND RESULTS


The experiments are done on three data sets (Abalone, Pageblocks, and Mfeat) from the UCI Machine Learning Repository [1]. We use Abalone and Pageblocks to simulate the homogeneous distributed environment, and Mfeat, the heterogeneous distributed environment. The details of all data sets are shown in Table 1. We adopted two metrics to evaluate the performance of global PCA: the classication accuracy and the Euclidean distance between the major components derived based on the global PCA and a local PCA respectively. For the purpose of simplication, we choose the minimum distance classier, where a sample is assigned to a class if its distance to the class mean is the minimum. In each subset, 30% of the data are used as the training set, and the other 70% for the test set. Data Set Abalone Pageblocks Mfeat Nr. of Attributes 8 10 646 Nr. of Classes 29 5 10 Nr. of Samples 4177 5473 2000

Table 1: Details of data sets used in the simulation. Here, we outline the PCA and classication processes designed in our experiments given the distributed data sets (including both the training and test sets). Assume Xi is a subset at location i, XiTr the training set, XiTe the test set, where T Xi = XiTr XiTe . Step 1: Apply PCA on XiTr to derive the principal components and use those components which keep most of the information as Pi . A parameter indicating the information loss ( ) is used to control how many components need to be used. In all the experiments, an information loss ratio of 10% is used. Step 2: Project both XiTr and XiTe onto Pi to reduce the dimension of the original data set, and get PXiTr and PXiTe respectively.

Step 3: Use PXiTr and PXiTe as the local model for local classication. The local classication accuracy is averaged and compared to the classication accuracy derived from the global PCA. Step 4: Calculate the distance between the major component in Pi and that in the global principal component for performance evaluation.

A Global PCA for Distributed Homogeneous Databases


Abalone is a data set used to predict the age of abalone from physical measurements. It contains 4177 samples from 29 classes. We randomly divide the whole data set into 50 homogeneous subsets of the same size. All the subsets have the same number of attributes (or features). Pageblocks is a data set used to classify all the blocks of the page layout of a document that has been detected by a segmentation process. It has 5473 samples from 5 classes. We also randomly divide this data set into 50 homogeneous subsets. Figure 3 and 4 show the performance comparisons with respect to the classication accuracy and Euclidean distance on the Abalone data set. We observe (Fig. 3 that even though some of the local classication accuracy is higher than the accuracy using the global PCA, the average local accuracy (0.8705) is 7.8% lower than the global classication accuracy (0.9444). Similar patterns can be observed from Fig. 5 and 6 which are results generated from the Pageblocks data set. For this data set, the global classication accuracy (0.7545) is 23% higher than the averaged local classication accuracy (0.6135).

B Global PCA for Distributed Heterogeneous Databases


Mfeat is a data set that consists of features of handwritten numerals (09) extracted from a collection of Dutch utility maps. Six different feature selection algorithms are applied and the features are saved in six data les. 1. mfeat-fac: 216 prole correlations 2. mfeat-fou: 76 Fourier coefcients of the character 3. mfeat-kar: 64 Karhunen-Love coefcients 4. mfeat-mor: 6 morphological features 5. mfeat-pix: 240 pixel averages in 2 3 windows 6. mfeat-zer: 47 Zernike moments Each data le has 2000 samples and corresponding samples in different feature sets (les) correspond to the same original character. We use these six feature les to simulate a distributed heterogeneous environment. Figure 7 shows a comparison between the global and local classications. Notice that the global classication accuracy is calculated with the assumption that no information is lost, that is, all the u j s are transferred and the local data set

10

1 Classification accuracy based on global PCA

0.95

Classification accuracy

0.9

0.85

0.8

Average classification accuracy based on local PCA

0.75

10

15

20

25 30 Local data sets

35

40

45

50

Figure 3: Classication accuracy comparison. Note: the upper solid straight line indicates the classication accuracy based on the global principal components. The lower dash straight line is the averaged local classication accuracy based on local principal components at each of the 50 subsets (Abalone).
1.8

1.6

Distance between the global and local PCA

1.4

1.2

0.8

0.6

0.4

0.2

10

15

20

25 30 Local data sets

35

40

45

50

Figure 4: Euclidean distance between the major components derived from the global PCA and the local PCA (Abalone).

11

1 Classification accuracy based on global PCA

0.9

0.8

0.7 Classification accuracy

0.6

0.5

0.4

0.3

Average classification accuracy based on local PCA

0.2

0.1

10

15

20

25 30 Local data sets

35

40

45

50

Figure 5: Classication accuracy comparison. Note: the upper solid straight line indicates the classication accuracy based on the global principal components. The lower dash straight line is the averaged local classication accuracy based on local principal components at each of the 50 subsets (Pageblocks).
0.7

0.6 Distance between the global and local PCA

0.5

0.4

0.3

0.2

0.1

10

15

20

25 30 Local data sets

35

40

45

50

Figure 6: Euclidean distance between the major components derived from the global PCA and the local PCA (Pageblocks).

12

0.7 Classfication accuracy based on global PCA 0.6

0.5 Classification accuracy

0.4

Averaged classification accuracy based on local PCA

0.3

0.2

0.1

1.5

2.5

3.5 4 Local data sets

4.5

5.5

Figure 7: Classication accuracy comparison. Note: the upper solid straight line indicates the classication accuracy based on the global principal components (calculated based on all features). The lower dash straight line is the averaged local classication accuracy based on local principal components at each of the 6 subsets (Mfeat). is accurately regenerated. However, in real applications, this is very inefcient since it consumes tremendous amount of computer bandwidth and computing resources. Figure 8 shows the trade-off between the classication accuracy and the amount of u j s being transferred between local data sets. We use

d j=t +1 j d j=1 j

to calculate the information loss and t is the number of u j s transferred. We observe that when only one u j is transferred, the information loss is about 40%, but the classication accuracy is, interestingly, a little bit higher than that calculated with all u j transferred. As the number of transferred u j s increases to 10 and 20, the information loss drops to about 15% and 10% respectively, but the classication accuracy does not change. Actually, it converges to the accuracy derived when all u j s are transferred. Figure 8 shows a good example on the effectiveness of the rst component (u 1 1 vT 1 ) in approximating the original data set (Eq. 8).

13

0.4 0.35 Information loss 0.3 0.25 0.2 0.15 0.1 1 10 20

0.06 Classification accuracy 1 10 20 0.05 Euclidean distance 0.04 0.03 0.02 0.01 0

0.5445 0.544 0.5435 0.543 0.5425 0.542 0.5415 0.541 1 10 20

t: Number of uj transferred

Figure 8: Effect of the number of left singular vectors (u j ) transferred. Top-left: information loss ( ) vs. t . Bottom-left: Euclidean distance between the major component derived from the global PCA with t amount of u j transferred and the major component derived from the global PCA with all u j s transferred. Bottomright: Classication accuracy vs. t .

14

VI CONCLUSION
This paper discusses the problem of distributed data mining. It develops an algorithm to derive the global principal components from distributed databases by mainly transferring the singular vectors and singular values of the local dataset. When the database is homogeneous, the derived global principal components are exactly the same as those calculated from a centralized database. When the databases are heterogeneous, the global principal components cannot be accurate using the same algorithm. We quantitatively analyze the error introduced with respect to different amount of local data transferred.

References
[1] C. L. Blake and C. J. Merz. UCI repository of machine learning databases. http://www.ics.uci.edu/mlearn/MLRepository. html. University of California, Irvine, Department of Information an d Computer Sciences. [2] P. Chan and S. Stolfo. Toward parallel and distributed learning by metalearning. In Working Notes AAAI Work Knowledge Discovery in Databases, pages 227240. AAAI, 1993. [3] Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern Classication. John Wiley & Sons, Inc., 2nd edition, 2001. [4] R. L. Grossman. Data mining: challenges and opportunities for data mining duri ng the next decade. http://www.lac.uic.edu, May 1997. [5] Y. Guo and J. Sutiwaraphun. Advances in Distributed Data Mining, chapter 1: Distributed Classication with Knowledge Probing, pages 125. AAAI, 2001. [6] H. Kargupta, W. Huang, K. Sivakumar, and E. Johnson. Distributed clustering using collective principal component an alysis. Under consideration for publication in Knowledge and Informati on Systems, 2000. [7] H. Kargupta, B. Park, D. Hershberger, and E. Johnson. Advances in Distributed Data Mining, chapter Collective data mining: a new perspective toward distributed d ata mining. AAAI Press, 2002. Submitted for publication. [8] J. Naisbitt and P. Aburdene. Megatrends 2000: Ten New Directions for the 1990s. Morrow, New York, 1990. [9] F. J. Provost and V. Kolluri. A survey of methods for scaling up inductive algorithms. Data Mining and Knowledge Discovery, 3(2):131169, 1999. [10] S. Stolfo et al. JAM: Java agents for meta-learning over distributed databa ses. In D. Heckerman, H. Mannila, D. Pregibon, and R. Uthurusamy, editors, 15

Proceedings Third International Conference on Knowledge Discov ery and Data Mining, pages 7481, Menlo Park, CA, 1997. AAAI Press. [11] M. J. Zaki. Parallel and distributed association mining: A survey. IEEE Concurrency, pages 1425, October-December 1999.

16

You might also like