Examining Outlier Detection Performance For Principal Components Analysis Method and Its Robustification Methods

International Journal of Advances in Engineering & Technology, May 2013.
IJAET ISSN: 2231-1963
EXAMINING OUTLIER DETECTION PERFORMANCE FOR PRINCIPAL COMPONENTS ANALYSIS METHOD AND ITS ROBUSTIFICATION METHODS
Nada Badr, Noureldien A. Noureldien Department of Computer Science University of Science and Technology, Omdurman, Sudan
ABSTRACT
Intrusion detection has gasped the attention of both commercial institutions and academic research area. In this paper PCA (Principal Components Analysis) was utilized as unsupervised technique to detect multivariate outliers on the dataset of an hour duration of time. PCA is sensitive to outliers since it depend on non-robust estimators. This lead us using MCD (Minimum Covariance Determinant) and PP (Projection Pursuit) as two different robustification techniques for the PCA. The results obtained from experiments show that PCA generates a high false alarms due to masking and swamping effects, while MCD and PP detection rate is much accurate and both reveals the effects of masking and swamping undergo the PCA method .
KEYWORDS:
Multivariate Techniques, Robust Estimators, Principal Components, Minimum Covariance Determinant, Projection Pursuit.
I.
INTRODUCTION
Principal Components Analysis (PCA) is a multivariate statistical method that concerned with analyzing and understanding data in high dimensions, that is to say, PCA method analyzes data sets that represent observations which are described by several dependent variables that are inter correlated. PCA is one of the best known and most used multivariate exploratory analysis technique [5]. Several robust competitors to classical PCA estimators have been proposed in the literature. A natural way to robustify PCA is to use robust location and scatter estimators instead of the PCA's sample mean and sample covariance matrix when estimating the eigenvalues and eigenvectors of the population covariance matrix. The minimum covariance determinant (MCD) method is a highly robust estimator of multivariate location and scatter. Its objective is to find h observations out of n whose covariance matrix has the lowest determinant. The MCD location estimate then is the mean of these h points, and the estimate of scatter is their covariance matrix. Another robust method for principal component analysis uses the Projection-Pursuit (PP) principle. Here, one projects the data on a lower-dimensional space such that a robust measure of variance of the projected data will be maximized. In this paper we investigate the effectiveness of the robust estimators provided by MCD and PP, by applying PCA on Abilene dataset and compare its detection performance of dataset outliers to MCD and PP. The rest of this paper is organized as follows. Section 2 is an overview to related work. Section 3 was dedicated for classical PCA. PCA robustification methods, MCD and PP are discussed in section 4. In section 5 the experiment results are shown, conclusions and future work are drawn in section 6.
II.
RELATED WORK
A number of researches have utilized principal components analysis to reduce the dimensionality and to detect anomalous network traffic. The use of PCA to structure network traffic flows was introduced
573
Vol. 6, Issue 2, pp. 573-582
International Journal of Advances in Engineering & Technology, May 2013. IJAET ISSN: 2231-1963
by Lakhina [13] whereby principal components analysis is used to decompose the structure of OriginDestination flows from two backbone networks into three main constituents, namely periodic trends, bursts and noise. Labib [2] utilized PCA in reducing the dimension of the traffic data and for visualizing and identifying attacks. Bouzida et, al. [7] presented a performance study of two machine learning algorithms, namely, nearest neighbors and decision trees algorithms, when used with traffic data with or without PCA. They discover that when PCA is applied to the KDD99 dataset to reduce dimension of the data, the algorithms learning speed was improved while accuracy remained the same. Terrel [9] used principal components analysis on features of aggregated network traffic of a link connecting a university campus to the Internet in order to detect anomalous traffic. Sastry [10] proposed the use of singular value decomposition and wavelet transform for detecting anomalies in self similar network traffic data. Wong [12] proposed an anomaly intrusion detection model based on PCA for monitoring network behaviors. The model utilizes PCA in reducing the dimensions of a historical data and in building the normal profile, as represented by the first few components principals. An anomaly is flagged when distance between the new observation and normal profile exceeds a predefined threshold. Mei-ling [4] proposed an anomaly detection scheme on robust principal components analysis. Two classifiers were implemented to detect anomalies, one was based on major components that capture most of the variations in the data, and the second was based on minor components or residuals. A new observation is considered to be an outlier or anomalous when the sum of squares of the weighted principal components exceeds the threshold in any of the two classifiers. Lakhina [6] applied principal components analysis to Origin-Destination (OD) flows traffic , the traffic isolated into normal and anomalous spaces by projecting the data onto the resulting principal components one at a time, ordered from high to low, Principal components (PC) are added to the normal space as long as a predefined threshold is not exceeded. When the threshold is exceeded, then the PC and the subsequent PCs are added to anomalous space. New OD flow traffic is projected into the anomalous space and anomaly is flagged if the value of the square prediction error or Q-statistic exceeds a predefined limit. Therefore PCA is widely used to identify lower dimensional structure in data, and is commonly applied to high-dimensional data. PCA represents data by a small number of components that account for the variability in the data. This dimension reduction step can be followed by other multivariate methods, such as regression, discriminant analysis, cluster analysis, etc. In classical PCA the sample mean and the sample covariance matrix are used to derive the principal components. These two estimators are highly sensitive to outlying observations, and render PCA unreliable, when outliers are encountered.
III.
CLASSICAL PCA MODEL
The PCA detection model detects outliers by projecting observations of the dataset on the new computed axes known as PCs. The outliers detected by PCA method are two types, outliers detected by major PCs, and outliers detected by minor PCs. The basic goals of PCA [5] are to extract important information from data set, to compress the size of the data set by keeping only this important information and to simplify the description of data and analyze the structure of the observation and variables (finding patterns with similarities and difference). To achieve these goals PCA calculate new variables from the original variables, called Principal Components (PCs). The computed variables are linear combination of the original variables (to maximize variance of the projected observation) and uncorrelated. The first computed PCs, called major PCs has the largest inertia ( total variance in data set ), while the second calculated PCs, called minor PCs has the greater residual inertia ,and orthogonal to the first principal components. The Principal Components define orthogonal directions in the space of observations. In other words, PCA just makes a change of orthogonal reference frame, the original variables being replaced by the Principal Components.
574
Vol. 6, Issue 2, pp. 573-582
International Journal of Advances in Engineering & Technology, May 2013. IJAET ISSN: 2231-1963 3.1 PCA Advantages
PCA common advantages are: 3.1.1 Exploratory Data Analysis PCA is mostly used for making 2-dimensional plots of the data for visual examination and interpretation. For this purpose, data is projected on factorial planes that are spanned by pairs of Principal Components chosen among the first ones (that is, the most significant ones). From these plots, one will try to extract information about the data structure, such as the detection of outliers (observations that are very different from the bulk of the data). Due to most researches [8][11], PCA detect two types of outliers, type(1): the outlier that inflate variance and this is detected by the major PCs and type (2): outlier that violate structure, which are detected by minor PCs. 3.1.2 Data Reduction Technique All multivariate techniques are prone to the bias variance tradeoff, which states that the number of variables entering a model should be severely restricted. Data is often described by many more variables than necessary for building the best model. PCA is better than other statistical reduction techniques in that, it select and feed the model with reduced number of variables. 3.1.3 Low Computational Requirement PCA needs low computational efforts since its algorithm constitutes simple calculations.
3.2 PCA Disadvantages

It may be noted that the PCA is based on the assumptions that, the dimensionality of data can be efficiently reduced by linear transformation and most information is contained in those directions where input data variance is maximum. As it is evident, these conditions are by no means always met. For example, if points of an input set are positioned on the surface of a hyper sphere, no linear transformation can reduce dimension (nonlinear transformation, however, can easily cope with this task). From the above the following disadvantage of PCA are concluded. 3.2.1 Depending On Linear Algebra It relies on simple linear algebra as its main mathematical engine, and is quite easy to interpret geometrically. But this strength is also a weakness, for it might very well be that other synthetic variables, more complex than just linear combinations of the original variables, would lead to a more complex data description. 3.2.2 Smallest Principal Components Have No Attention in Statistical Techniques The lack of interest is due to the fact that, compared with the largest principal components that contain most of the total variance in the data, the smallest principal components only contain the noise of the data and, therefore, appear to contribute minimal information. However, because outliers are a common source of noise, the smallest principal components should be useful for outlier detection. 3.2.3 High False Alarms Principal components are sensitive to outliers, since the principal components are determined by their directions and calculated from classical estimator such classical mean and classical covariance or correlation matrices.
IV.
PCA ROBUSTIFICATION
In real datasets, it often happens that some observation are different from the majority, such observation are called outliers, intrusion, discordant, etc. However classical PCA method can be
575
Vol. 6, Issue 2, pp. 573-582
affected by outliers so that PCA model cannot detect all the actual real deviating observation, this is known as masking effect. In addition some good data points might even appear to be outliers which are known as swamping effect . Masking and swamping cause PCA to generate a high false alarm. To reduce this high false alarms using robust estimators was proposed, since outlying points are less likely to enter into the calculation of the robust estimators. The well-known PCA Robustification methods are the minimum covariance determinant (MCD) and Projection-Pursuit (PP) principle. The objective of the raw MCD is to find h > n/2 observations out of n whose covariance matrix has the smallest determinant. Its breakdown value is ( bn= [n- h+1]/n), hence the number h determines the robustness of the estimator. In Projection-Pursuit principle [3], one projects the data on a lower-dimensional space such that a robust measure of variance of the projected data will be maximized. PP is applied where the number of variables or dimensions is very large, so PP has an advantage over MCD, since the MCD proposes the dimensions of the dataset not to exceed 50 dimensions. Principal Component Analysis (PCA) is an example of the PP approach, because they both search for directions with maximal dispersion of the data projected on it, but PP instead of using variance as measure of dispersion, they use robust scale estimator [4].
V.
EXPERIMENTS AND RESULTS
In this section we show how we test PCA and its robustification methods MCD and PP on a dataset. The data that was used consist of OD (Origin-Destination) flows which, are collected and made available by Zhang [1]. The dataset is an extraction of sixty minutes traffic flows from first week of the traffic matrix on 2004-03-01, which is the traffic matrix Yin Zhang was built from Abilene network. Availability of the dataset is on offline mode, where it is extracted from offline traffic matrix.
5.1 PCA on Dataset

At first, the dataset or the traffic matrix is arranged into the data matrix X, where rows represent observations and columns represent variables or dimensions. 1,1 1,12 ], X (14412) =[ 144,1 144,12 The following steps are considered in apply PCA method on the dataset. Centering the dataset to have zero mean, so the mean vector is calculated from the following equation: 1 = (1) =1 and subtracted off the mean for each dimension. The product of this step is another centered data matrix Y, which has the same size as original dataset (,) = (, ()) (2) Covariance matrix is calculated from the following equation: 1 ()() = 1 ( ()) . ( ()) (3) Finding eigenvectors and eigenvalues from the covariance matrix where eigenvalues are diagonal elements of the matrix by using eigen-decomposition technique in equation (4). 1 YE = (4) Where E is the eigenvectors, is the eigenvalues . Ordering eigenvalues in decreasing order and sorting eigenvectors according to the ordered eigenvalues in loadings matrix. The Eigenvectors matrix is then sorted to be loading matrix. Calculating scores matrix (dataset projected on principal components), which declares the relations between principal components and observations. The scores matrix is calculated from the following equations: (,) = (,) (,) (5)
576
Vol. 6, Issue 2, pp. 573-582
Applying the 97.5 tolerance ellipse of the bivariate dataset (data projected on first PCS, data projected on minor PCS) to reveal outliers automatically. The ellipse is defined by these data points whose distance is equal to the chisquare root of the 97.5 quantile with 2 degrees of freedom. The form of the distance is 2 ,0.975 (6)
The screeplot is used and studied and the first and the second principal components accounted for 98% of total variance of the dataset, so retaining the first two principal components to represent the dataset as whole, figure (1) shows the screeplot, the plotting of the data projected onto the first two principal components in order to reveal the outliers on the dataset visually is shown in figure (2).
100 90 80
totalvariance variances
x 10
data projected on major pcs 120
1.5
119
70
1
60 50 40 30 20
-0.5 0
135
PC2
0.5 84 128 134 121 141 127 136 1 5 4 46 16 6 61 11 3 2 50 49 48 47 20 19 18 17 25 24 23 22 35 34 33 32 31 52 54 21 53 55 29 28 106 51 30 8 7 15 27 43 45 10 40 39 38 37 9 14 13 44 123 122 12 42 91 139 36 57 101 59 58 60 109 70 69 26 63 62 65 108 110 76 114 140 56 64 107 85 115 112 94 92 93 113 79 95 78 77 80 82 83 111 138 88 89 90 71 87 81 68 75 67 103 143 73 74 72 137 132 105 142 98 104102 86 126 96 133 124 125 130 144 129 131 117 118 66
116
10 0
-1 -2
6 8 principal components
10
12
-1
2 PC1
6 x 10
7
7
Figure 1: PCA Screeplot
Figure 2: PCA Visual outliers
Figure (3) shows tolerance ellipse on major PCS, and figures (4) and (5) shows the visual recording of outliers from scatter plots of data projected on robust minor principal components and the outliers detected by robust minor principal components tuned by tolerance ellipse respectively.
x 10
6
Tolerance ellipse (97.5%)

8
x 10
data projected on minor pcs
120 15 119
116 6 76
10 135
last PC
99 100 81
5 84 128 134 121 136 141 127 41 46 61 11 16 50 49 48 47 21 33 54 5 4 3 2 1 6 106 123 20 19 18 17 25 24 23 22 32 31 35 34 53 52 55 30 29 15 28 27 40 43 10 39 8 7 9 122 51 38 45 14 37 44 13 12 42 91 139 36 57 59 58 109 60 70 69 26 63 62 108 110 65 76 114 140 64 107 85 115 112 10171 56 94 93 113 79 92 78 77 95 82 80 111 83 88 100 138 87 99 73 72 89 90 67 81 75 68 103 143 74 137 132 105 102 142 98 104 97 86 126 96 133 124 125 130 144 129 0 PC1 2 131 117 4 118 6 x 10
7
66
-2
-5
116
-4
68 67 127 84 141 128 86 85 133 132 70 64 134 62 135 72 130 139 12965 140 63 61 122 123 11 16 73 126 124 56 75 36 98 87 88 31 138 1 93 5 4 125 8 7 6 74 9 13 12 137 58 57 28 10 25 24 39 40 27 15 14 3 2 17 18 33 32 60 30 29 20 19 35 34 49 38 92 37 121 59 42 53 52 43 78 48 47 50 55 54 44 45 77 23 22 80 46 79 51 21106 95 109 94 89 110 41 118 90 120 117 107 66 26 114 96 136 108 143 115 142 119 113 112 101 83 82 91 131 103 105 102 104 144 71 111 -4 -2 0 last PC-1 2 4 x 10 6
5
PC2
-4
-2
-6 -8
-6
Figure 3: PCA Tolerance Ellipse
Figure 4: PCA type2 Outliers
577
Vol. 6, Issue 2, pp. 573-582
x 10 6 116 4 141 127 128 2 10099 84 85 68 67
5
Tolerance ellipse (97.5%) 76
81
-2
-4
86 133 132 69 70 64 134 63 62 72 130 13561 139 129 65 140 122 123 11 126 16 73 12456 75 36 97 98 88 31138 87 1 5 4 125 8 7 93 92 6 13 12 9 137 58 10 25 24 57 27 15 14 28 3 2 40 39 30 18 17 33 32 20 19 38 74 79 60 59 42 53 52 29 47 54 35 34 50 49 48 37 121 43 78 55 77 23 44 45 22 80 46 106 51 94 21 95 109 89 110 90 41 118 107 120 117 108 26 114 66 96 136 115 143 142 119 113 112 101 83 91 82 131 103 105 102 104 144 71 111 -4 -2 PC11 0 2 4 x 10
5
PC12
-6
Figure 5: Tuned Minor PCS
5.2 MCD on Dataset

Testing robust statistics MCD (Minimum Covariance Determinant) estimator yields robust location measure Tmcd and robust dispersion mcd. The following steps are applied to test MCD on the dataset in order to reach the robust principal components. MCD measure is calculated from the formula: R=(xi-Tmcd(X))T.inv(mcd(X)).(xi-Tmcd(X) ) for i=1 to n (7) Tmcd or mcd =1.0e+006 * From robust covariance matrix mcd calculating the followings: C(X)mcd or (x)mcd = 1.0e+012 * * find robust eigenvalues as diagonal matrix as in equation (4) by replacing n with h * find robust eigenvectors as loading matrix as in equation (5). Calculating robust scores matrix as in the following form (,) = (,) (,) (8) The robust screeplot retaining the first two robust principal components which accounted above of 98% of total variance is shown in figure (6). Figures (7) and (8) shows respectively the visual recording of outliers from scatter plots of data projected on robust major principal components, and the outliers detected by robust major principal components tuned by tolerance ellipse, and Figures (9) and (10) shows the visual recording of outliers from scatter plots of data projected on robust minor principal components and the outliers detected by robust minor principal components tuned by tolerance ellipse respectively.
robust mcd screeplot to retain robust PCS 100 90
2.5
x 10
major pcs from robust estimator
2
80
120 119
70
1.5
robustmcd PC2
total variance
60 50 40 30 20
135 1
0.5
66
0 116 -0.5 118
134 128 69 70 99 127 141 84121 71 101 72 73 100 88 136 87 139 41 74 103 75 102 91 11 61 46 68 67 90 89 112 113 115 114 76 9 10 39 106 27 40 28 6 21 29 18 17 20 25 24 23 31 35 55 54 53 52 16 19 22 34 33 32 50 49 5 4 3 2 1 48 47 78 140 57 13 12 42 14 37 44 123 7 15 38 43 45 8 30 51 98 111 77 93 92 109 58 60 59 122 36 83 82 95 79 94 107 108 110 105 80 85 65 63 62 26 104 97 132 56 138 133 143 81 64 86 96 137 142 131 117 130 129 124 125
10 0
6 8 principal components
10
12
-1 -8
-7
-6
-5
-4 -3 robustmcd PC1
-2
-1
0 x 10
1
7
Figure 6: MCD screeplot
Figure 7: MCD Visual Outliers
578
Vol. 6, Issue 2, pp. 573-582
x 10 20
6
2
120
x 10
data project on robustmcd minor PCS 131 71 73 134 67 86 102 26 80 104 441 624 144 56 117 139 136 8188 113 74 126 112 85 91 141 11876 116 96
1.5
119
15
1 0.5 97 98 99 100 101
69 70
10
robustmcd last pc
robustmcdPC2
135
0 -0.5 -1 -1.5 -2 -2.5
84
5 66 0 116 -5 118 134 128 69 70 99 71101 127 14184 72 73100 121 136 88 87 139 41 74 103 75 102 91 11 68 113 115 114 106 61 21 16 46 6 1 5 4 67 90 89 112 140 76 123 12 14 13 10 15 27 40 39 38 43 45 28 30 29 53 52 55 9 20 19 18 17 25 24 23 22 35 34 33 32 31 50 49 54 48 47 8 7 3 2 78 77 122 57 42 44 37 51 98 111 93 92 109 58 60 59 36 83 82 80 95 79 94 107 108 110 105 85 65 63 62 26 132 56 138 133 10497 143 8164 86 96 137 126 142 124 131 117 130 144 129 -2 0 robustmcdPC1 125
119
120
66
2 4 x 10
7
-6
-4
-3 -2.5
-2
-1.5
-1 -0.5 0 robustmcd last-1 pc
0.5
1.5 x 10
6
Figure 8: MCD Tolerance Ellipse

x 10 1.5 1 0.5
6
Figure 9: MCD type2 Outliers

T olerance ellipse (97.5%) 131 71 73 134 67 77 68 86 137 78 79 80 102 138 125 89 26 61 104 36 142 110 109 21 60 108 107 13 9 114 59 58 10 42 16 115 133 143 37 12 57 7 124 38 44 14 127 15 43 8 1 5 4 6 90 94 39 56 40 45 27 28 30 29 18 17 103 23 22 53 52 55 20 19 35 34 33 32 48 47 51 54 50 49 82 105 111 128 83 25 24 3 2 75 72 117 41 95 144 132 46 121 106 129 139 11 140 31 130 123 122 135 92 93 62 63 74 136 8165 64 113 87 112 126 88 85 91 141 118 76 116 96
69 70 97 98 10099 101
robustmcd pclast
0 -0.5 -1 -1.5 -2 -2.5 66 -3 -2.5 -2 -1.5
84
119
120
-1 -0.5 0 robustmcd pclast-1
0.5
1 x 10
6
Figure 10: MCD Tuned Minor PCs
5.3 Projection Pursuit on Dataset

Testing the projection pursuit method on the dataset is included in the following steps: Center the data matrix X(n,p) , around L1-median to reach centralized data matrix Y(n,p) as : (9) (,) = ( (,) 1()) Where L1(X) is high robust estimator of multivariate data location with 50% resist of outliers [11]. Construct the directions pi as normalized rows of matrix , `this process include the following: (10) = ([, : ]) , 1: = max(()) (11) Where SVD stand for singular value decomposition. (12) = Project all dataset on all possible directions. = ( ) (13) Calculate robust scale estimator for all the projections and find the directions that maximize qn (14) estimator, = max(( )) qn is a scale estimator, essentially it is the first quartile of all pairwise distance between two data points [5]. The results of these steps yields the robust eigenvectors (PCs), and the squared of value of the robust scale estimator is the eigenvalues. project all data on the selected direction q to obtain robust principal components as in the following : = , (15) Update data matrix by its orthogonal complement as in the followings: = ( (16) ).
579
Vol. 6, Issue 2, pp. 573-582
Project all data on the orthogonal complement, = (17) The Plotting of the data projected on the first two robust principal components to detect outliers visually, is shown in figure (11), and the tuning the first two robust principal components by tolerance ellipse is shown in figure (12). Figures (13) and (14) show respectively the plotting of the data projected on minor robust principal components to detect outliers visually, and the tuning of the last robust principal components by tolerance ellipse.
1 0.5 0 -0.5
PP robust PC2
x 10
data projected on robust major PCS by PP method
x 10
1
144 130 129 142 137 107 81 143 138 94 79 121 76 85 91 95 80 114 140 126 93 92 82 77 78 111 83 139 84115 113 112 136 141 89 127 101 90 86 128 96 131 71 68 67 132 87 75 88 105 103 97 104 117 13498 74 73 100 99 70 102 72 133 69
0
PProbust PC2
-1 -1.5 -2 -2.5 -3 -3.5 -4 -1
-1
118 116 66
125 124 144 130 129 48 50 47 49 19 34 22 23 20 35 17 18 32 33 54 25 24 51 31 36 26 3 2 1 4 5 46 16 55 53 52 122 29 30 21 61 15 43 28 45 27 44 38 6 63 62 56 8 7 123 40 14 39 37 10 42 12 13 57 59 9 106 60 58 65 64 11 41 109 110 108 142 137 107 81 138 143 94 79 121 76 85 91 95 80 114 140 126 92 93 115 82 77 111 78 83 139 84 113 112 136 141 89 127 101 90 86 128 96 131 71 68 67 132 87 75 88 105 103 97104 117 13498 74 73 100 9970 102 72133 69 135
118 116 66
135
-2
119
-3
119 120 -4 -2 0 2 PProbust PC1 4 6 x 10

7
120 0 1 2 3 4 PProbust PC1 5 6 7 x 10 8

7
-4
Figure 11: PP Visual Outliers

x 10
6
Figure 12: PP Tolerance Ellipse

x 10
6
data projected on robust minor PCS by PP
Tolerance ellipse (97.5%) 69 70 116 66
1.5
97
99 100 116 76
70
98 97
99 100
0.5
PP robust PC12
68 67
0.5
76
68 67
PProbust PC12
-0.5
129 125117 130126 118 72 85 141 124 61 16 127 36 73 131 11 56 128 84 96 90 94 93 92 1 5 4 9 8995 38 37 39 7 10 40 8 12 13 58 106 6 28 27 57 88 60 14 30 29 35 62 109 3 2 15 20 19 18 23 22 34 33 32 50 49 48 47 63 80 79 87 134 137 144 17 25 24 43 42 59 55 54 45 44 53 52 110 142 51 26 123 122 46 65 64 31 77 78 21 138 83 132 143 107 82 108 86 121 140 114 103 81 139 133 41 115 102 75 71 91 74 112 136 113 105 101 111 104 135
-0.5
129 125117 130126 118 72 61 85 141 124 127 16 36 73 131 11 56 128 84 96 90 95 9 94 92 93 10 1 4 5 89 37 38 40 39 7 12 13 58 8 106 88 57 6 60 109 137 134 144 62 14 30 15 19 34 80 79 63 87 22 23 20 35 29 50 47 49 32 33 43 17 18 25 24 27 28 59 42 3 2 110 54 55 48 45 53 52 142 122 44 64 51 123 26 46 65 31 77 78 21 138 83 132 1 43 82 107 108 86 121 140 103 114 139 81 133 41 75 71 11591102 74 112 136 113 105 101 111 104 135
-1 119 -1.5 120
-1 119 -1.5 120 -2 -1 0 1 PProbust PC11 2 3 x 10

6
-2 -3
-2
-1
0 1 PProbust PC11
3 x 10
4
6
Figure 13: MCD type2 Outliers
Figure 14: MCD Tuned Minor PCs
5.4 Results
Table (1) summarizes the outliers detected by each method. The table shows that PCA suffers from both masking and swamping. The MCD and PP methods results reveal the effects of masking and swamping of the PCA method. The PP method results are similar to MCD with slight difference since we use 12 dimensions on the dataset.
PCA Outlier detected by major and Minor PCS 66 99 100 116 117 118 119 120 Table 1: Outliers Detection MCD Outliers PP Outliers detected by major and detected by major minor PCS and minor PCS 66 66 99 99 100 100 116 116 117 117 118 118 119 119 120 120 False alarms effects Masking Swamping No No No No No No No No No No No No No No No No
580
Vol. 6, Issue 2, pp. 573-582
129 131 135 Normal Normal 71 76 81 101 104 111 144 Normal Normal Normal Normal 129 131 135 Normal Normal Normal Normal Normal Normal Normal Normal Normal 84 96 97 98 129 131 135 69 70 normal normal normal normal normal normal normal normal normal 97 98 No No No Yes Yes No No No No No No No Yes Yes Yes Yes No No No No No Yes Yes Yes Yes Yes Yes Yes No No No No
VI.
CONCLUSION AND FUTURE WORK
The study has examined the PCA and its robustification methods (MCD, PP) performance for intrusion detection by presenting the bi-plots and extracted outlying observation that are very different from the bulk of data. The study showed that tuned results are identical to visualized one. The study returns the PCA false alarms shortness due to masking and swamping effect. The comparison proved that PP results are similar to MCD with slight difference in outliers type 2 since are considered as source of noise. Our future work will go into applying the hybrid method (ROBPCA), which takes PP as reduction technique and MCD as robust measure for further performance, and applying dynamic robust PCA model with regards to online intrusion detection.
REFERENCES
[1]. Abilene TMs, collected by Zhang . www.cs.utexas.edu/yzhang/ research, visited on 13/07/2012 [2]. Khalid Labib and V.Rao Vemuri. "An application of principal Components analysis to the detection and visualization of computer network ". Annals of telecommunications, pages 218-234, 2005 . [3]. C. Croux, A. Ruiz-Gazen, "A fast algorithm for robust principal components based on projection pursuit", COMPSTAT: Proceedings in Computational Statistics, Physica-Verlag, Heidelberg,1996, 211 217. [4]. Mei-ling Shyu, Schu-Ching Chen,Kanoksri Sarinnapakorn,and Li Wuchang. "Anovel anomaly detection scheme based on principal components classifier". In proceedings of the IEEE foundations and New directions of Data Mining workshop, in conjuction with third IEEE international conference on data mining (ICOM03) . [5]. J.Edward Jackson . "A user guide to principal components". Wiely interscience Ist edition 2003. [6]. Anukool Lakhina,. Mark Crovella, and Christoph Diot. "Diagnosing network wide traffic anomalies" .Proceedings of the 2004 conference on Applications, technologies, architectures, protocols for computer communication. ACM 2004. [7]. Yacine Bouzida, Frederic Cuppens, NoraCuppens-Boulahio, and Sylvain Gombaul. "Efficient Intrusion Detection Using Principal Component Analysis ". La londe, France, June 2004. [8]. R.Gnandesikan, "Methods for statistical data analysis of multivariate observations". Wiely-interscience publication New York, 2nd edition 1997. [9]. J.Terrel, K.Jeffay L.Zhang, H.Shen, Zhu, and A.Nobel, "Multivariate SVD analysis for a network anomaly detection ". In proceedings of the ACM SIGOMM Conference 2005. [10]. Challa S.Sastry, Sanjay Rawat, Aurn K.Pujari and V.P Gulati, "Netwok traffic analysis using singular value decomposition and multiscale transforms ". information sciences : an international journal 2007.
581
Vol. 6, Issue 2, pp. 573-582
[11]. I.T.Jollif, "Principal components analysis", springer series in statistics, Springer Network ,2nd edition 2007. [12]. Wei Wong, Xiachong Guan, and Xiangliong Zhang, "Processing of massive audit data streams for real time anomaly intrusion detection". Computer communications , Elsevier 2008. [13]. A Lkhaina, K Papagiannak, M Crovella, C-Diot, E Kolaczy, and N. Taft, "Structural Analysis of network traffic flows". In proceedings of SIGMETRICS, New York, NY, USA, 2004.
AUTHORS BIOGRAPHIES
Nada Badr earned her BSC in Mathematical and Computer Science at University of Gezira, Sudan. She received the MSC in Computer Science at University of Science and Technology. She is pursuing her PHD in Computer Science at University of Science and Technology, Omdurman, Sudan. She currently serving lecturer at the University of Science and Technology, Faculty of Computer Science and Information Technology.
Noureldien A. Noureldien is working as an associate professor in Computer Science, department of Computer Science and Information Technology, University of Science and Technology, Omdurman, Sudan. He received his B.Sc. and M.Sc. from School of Mathematical Sciences, University of Khartoum, and received his PhD in Computer Science in 2001 from University of Science and Technology, Khartoum, Sudan. He has many papers published in journals of repute. He currently working as the dean of the Faculty of Computer Science and Information Technology at the University of Science and Technology, Omdurman, Sudan.
582
Vol. 6, Issue 2, pp. 573-582

Examining Outlier Detection Performance For Principal Components Analysis Method and Its Robustification Methods

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Examining Outlier Detection Performance For Principal Components Analysis Method and Its Robustification Methods

Uploaded by

Copyright:

Available Formats

International Journal of Advances in Engineering & Technology, May 2013.

IJAET ISSN: 2231-1963

Vol. 6, Issue 2, pp. 573-582

CLASSICAL PCA MODEL

Vol. 6, Issue 2, pp. 573-582

3.2 PCA Disadvantages

Vol. 6, Issue 2, pp. 573-582

EXPERIMENTS AND RESULTS

5.1 PCA on Dataset

Vol. 6, Issue 2, pp. 573-582

data projected on major pcs 120

Figure 1: PCA Screeplot

Figure 2: PCA Visual outliers

Tolerance ellipse (97.5%)

data projected on minor pcs

Figure 3: PCA Tolerance Ellipse

Figure 4: PCA type2 Outliers

Vol. 6, Issue 2, pp. 573-582

Tolerance ellipse (97.5%) 76

Figure 5: Tuned Minor PCS

5.2 MCD on Dataset

major pcs from robust estimator

0 116 -0.5 118

Figure 6: MCD screeplot

Figure 7: MCD Visual Outliers

Vol. 6, Issue 2, pp. 573-582

Tolerance ellipse (97.5%)

1 0.5 97 98 99 100 101

0 -0.5 -1 -1.5 -2 -2.5

-1 -0.5 0 robustmcd last-1 pc

Figure 8: MCD Tolerance Ellipse

Figure 9: MCD type2 Outliers

0 -0.5 -1 -1.5 -2 -2.5 66 -3 -2.5 -2 -1.5

-1 -0.5 0 robustmcd pclast-1

Figure 10: MCD Tuned Minor PCs

5.3 Projection Pursuit on Dataset

Vol. 6, Issue 2, pp. 573-582

data projected on robust major PCS by PP method

Tolerance ellipse (97.5%)

-1 -1.5 -2 -2.5 -3 -3.5 -4 -1

119 120 -4 -2 0 2 PProbust PC1 4 6 x 10

120 0 1 2 3 4 PProbust PC1 5 6 7 x 10 8

Figure 11: PP Visual Outliers

Figure 12: PP Tolerance Ellipse

data projected on robust minor PCS by PP

Tolerance ellipse (97.5%) 69 70 116 66

-1 119 -1.5 120

-1 119 -1.5 120 -2 -1 0 1 PProbust PC11 2 3 x 10

Figure 13: MCD type2 Outliers

Figure 14: MCD Tuned Minor PCs

Vol. 6, Issue 2, pp. 573-582

CONCLUSION AND FUTURE WORK

Vol. 6, Issue 2, pp. 573-582

Vol. 6, Issue 2, pp. 573-582

You might also like