Data Processing For Outliers Detection

Data Processing for Outliers Detection
Silvia Cateni and Valentina Colla

PERCRO - Istituto di tecnologie della Comunicazione, dellInformazione e della Percezione Scuola Superiore S. Anna, Pisa, Italy
Introduction
Outlier detection is an important branch in data pre-processing and data mining, as this stage is required in elaboration and mining of data coming from many application fields such as industrial processes, transportation, ecology, public safety, climatology. Outliers are data which can be considered anomalous due to several causes (e.g. erroneous measurements or anomalous process conditions). Outlier detection techniques are used, for instance, to minimize the influence of outliers in the final model to develop, or as a preliminary preprocessing stage before the information conveyed by a signal is elaborated. On the other hand in many applications, such as network intrusion, medical diagnosis or fraud detection, outliers are more interesting than the common samples and outliers detection techniques are used to search for them. The traditional outlier detection methods can be classified into four main approaches: distance-based, density-based, clustering-based and distribution-based. Each of these approaches presents advantages and limitations, thus in the recent years many contributions have been proposed to overcome them and improve the quality of the data. Classical methods are often not suitable to treat some particular databases, therefore recent studies have been conducted on outlier detection for these kind of datasets. In particular, a high number of contributions based on artificial intelligence, genetic algorithms and image processing have been proposed in order to develop new efficient outliers detection methods that can be suitable in many different applications. This chapter is organized as follows: in Section 2 an introduction on outlier detection definitions and potential applications is proposed. Section 3 presents a review of traditional outlier detection methods, while in Section 4 some outlier detection techniques based on particular data representations are discussed. In Section 5 recent approaches that are capable to outperform the widely adopted traditional methods are described and Section 6 introduces the application of outlier detection methods to the image processing area. Section 7 illustrates the results obtained considering a synthetic case-study and, finally, Section 8 provides some concluding remarks.
Outlier Detection: Definitions and Applications
An outlier in a dataset is defined as a measurement that is different from the other values. The classical definition of outlier is due to Hawkins (Hawkins, 1980) which defines an outlier as "an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism". Another outlier definition is given by Barnett and Lewis (Barnett & Lewis, 1994) and defines an outlier as "an observation (or subset of observations) which appears to be inconsistent with the remainder of that set of data". Aggarwal and Yu (Aggarwal and Yu, 2001) states that "outliers may be considered as noise points lying outside of
a set of defined clusters or alternatively outliers may be considered as the points that lie outside of the set of clusters but also are separated from the noise". The application fields of outlier detection include fraud detection, weather prediction, fault diagnosis, detecting novelties in images (i.e. for robot neotaxis or surveillance systems), motion segmentation, satellite image analysis, medical condition monitoring and others (Hodge, 2004).
Classical Methods
The main traditional approaches to outlier detection can be classified into several categories, such as distancebased, density-based, clustering-based, distribution-based. 3.1 Distance-Based Method
The distance-based outlier method was presented in (Knorr & Ng, 1998), where the definition of outlier becomes: "An object O in a dataset T is a DB(p,D)-outlier if at least fraction p of the objects in T lie at a distance greater than D from O". The parameter p is the minimum fraction of objects that must lie outside an outlier's Dneighborhood. In the several approaches Mahalanobis distance is used as outlying degree (Matsumoto, 2007). The Mahalanobis distance (Mahalanobis, 1936) is defined as in equation 1. DM(x) = (x )T C.1 (x ) (1) where x is the data vector, is center of mass of dataset and C the covariance matrix. The Mahalanobis distance is also defined as the distance between each point and its center of mass. If the covariance matrix is the identity matrix, the Mahalanobis distance becomes the Euclidean distance. Data points that are located far away from the center of mass are detected as outliers. 3.2 Density-Based Method
The density-based approaches calculate the density of data and consider outliers the points that lie in regions with low density. An important contribute was given by Breunig et al. (Breunig et al., 2000) that assigned an index value named Local Outlier Factor (LOF) to each object on the basis of the local density of its neighborhood. In LOF algorithm a high LOF indicates that the considered object is an outlier. The following definitions are necessary to understand the LOF method: - k distance of an instance x. Is the distance between two instances x and y belonging to dataset D such that: a) for at least k instances (k is a positive integer) y'D-{x} it holds that d(x,y') d(x,y). b) for at most k-1 instances y'D-{x} it holds that d(x,y') < d(x,y). k distance neighborhood of an instance x is defined in (2) and includes instances whose distance from x is not greater than the k-distance. Nk-distance(x)(x) = {q D {x} : d(x,q) k-distance(x)} where q are objects called k-nearest neighbors of x. reachability distance of an instance x with respect to the instance y. If k is a natural number the reachability distance of object x with respect to object y is defined as: reach-dist(x,y) = max{k-distance(y) , d(x,y)} (3) (2)
local reachability density of an instance x is the inverse of the average reachability distance based on the MinPts-nearest neighbors of x. Also MinPts is an important parameter required by LOF algorithm which represents the number of nearest neighbors used in defining the local neighborhood of the istance. lrdMinpts(x) = [ o NMinpts(x) reach distMinpts(x,o)/ | NMinpts(x) | ]-1
(4) (5)
Finally the LOF is defined as: LOFMinpts(x) = [ o NMinpts(x) lrdMinpts(o) / lrdMinpts(x)] / | NMinpts(x) |
LOF is defined as the average of the ratio of the local reachability density of x and the cardinality of the set containing the MinPts-nearest neighbors of x. The LOF is an outlier degree and is used to decide if an object is an outlier or not. When LOF assumes a value close to 1, then x is comparable to its neighbors, the region is quite dense and the considered object is not an outlier; on the other hand, a LOF value significantly greater than 1 indicates that x is an outlier. 3.3 Clustering-Based Method
Clustering-based methods consider outliers as objects that do not belong to any cluster after an opportune clustering operation. A variation regarding clustering is the use of a fuzzy model. Fuzzy clustering assigns a membership degree to each sample for each cluster. The most popular known fuzzy clustering algorithm is the Fuzzy C means (FCM). The fuzzy C means is an unsupervised clustering algorithm due to Dunn (1974) and it is based on the minimization of an objective function which is defined as the weighted sum of squared error within groups, as described in the following equation: Jm(U , V ; X) = uik || xk vi||2
m
k=1 i=1 n c
(6)
where V=(v1,v2, ..., vc) is the vector of the centers of the clusters, uik is the grade of membership of data xkX to the cluster i. When a stable condition is reached the iteration stops and a point is associated to the cluster for which the value of membership is maximal. 3.4 Distribution-Based Method
The distribution based approaches use standard distribution to fit the dataset and outliers are detected on the basis of a probability distribution. A fundamental limit of this approach is that this method requires an a priori knowledge of the probability distribution of the data. For many applications the a priori knowledge is not always available and obtainable, moreover the computation cost for fitting data with common distributions (such as, for instance, Gaussian, Log-Normal (Aitchison & Brown, 1957), Gumbel (Castillo,1988) or Weibull (Canfield et al., 1981) distributions) could be considerable. A well known method belonging to this approach was proposed by Grubbs (Grubbs, 1969). The Grubbs test detects outliers if the data distribution can be approximated by a Gaussian function. The Grubbs test computes the following statistics: G=[maxi(xi )] / (7) where is the mean value of data and is their standard deviation. If the variable G is greater than a tabulated value than the sample corresponding to the maximum normalized distance from the mean value is considered an outlier. The Rosners test (Rosner, 1983) is a generalization of the Grubb's test and it is used to find multiple outliers. In the Rosner's test the parameter J, corresponding to the maximum number of possible outliers, must be fixed. Then data are ranked in ascending order. Let define o and o be, respectively, the mean value and the standard deviation of the initial dataset. The sample x0 farthest from o is deleted from data and the mean value
and the standard deviation (1, 1) are computed on the remaining data. This process is repeated until J extreme samples have been removed. Finally the following statistic is calculated and compared to a critical tabulated value (Gilbert, 1987): RJ = | xJ-1 J-1| / J1 (8) If RJ is higher or equal than the critical value then the J selected samples are considered outliers, otherwise the test is repeated. If for some i Ri=|xi-i i-1| / i-1 is at last equal to the critical value, then the samples xk for 0 k i are actually outliers, otherwise there are not outliers.
Outlier Detection based on Particular Data Representations.
A functional dependency (Ramakrishan & Gehrke, 2002) is a relationship between attributes of a given dataset, i.e. for each sample value of an attribute y can be calculated by exploiting the values of some other attributes (x1, x2, xn) in the form y=f(x1, x2, xn) and all the records (x1, x2, xn, y) should respect such relation. When the functional dependency f is unknown, there are algorithms created to discover it (Huhtala et al., 1999; Kivinen & Mannila, 1992). Quasi-functional dependencies (Huhtala et al., 1999; Bruno & Garza, 2007) are relationships that are not satisfied by all the records (x1, x2, xn, y), that are also called tuple. Some methods base outlier detection on the respect of the quasi-functional dependencies, as they label as outliers the few tuples which deviate from the common functional behavior. The use of quasi-functional dependency to detect anomalies was introduced in (Apiletti et al., 2006) and subsequently this approach was improved (Bruno & Garza, 2007). However both the above cited methods are limited to databases which do not contain time information. On the other hand temporal databases contain attributes which vary over time and temporal aspects are embedded in them (Date et al, 2002); they also include all database applications that require some aspect of time when organizing their information. The main difference between non-temporal database and temporal database is that non-temporal database consider the data stored at the same time instant, i.e. without considering past and future database states; while temporal database contain time information attaching a time period to the data. Temporal databases are widely used in several applications (Pakadakis et al., 2006; Weekes et al., 2002; Chundi et al., 2009, Wua et al., 2009). Bruno & Garza (Bruno & Garza, 2010) introduced a new outlier detection method which is suitable for temporal databases. These authors address the outlier detection problem as part of the data mining process, by defining the temporal quasi-functional dependency (i.e. a quasi functional dependency that varies through time), and present the so-called Temporal Outlier Detection (TOD) algorithm. In practice, the proposed approach extracts the temporal association rules from the database and then combines them to discover temporal quasifunction dependency. The association rules is the pattern knowledge existing in a given dataset, i.e. association rules mining is a technique for discovering data dependencies (Liang et al., 2005). Temporal association rules are an extension of the association rules concept considering also the antecedent and the consequence (Bruno & Garza, 2010). The algorithm extracts quasi-functional dependencies with a dependency degree value higher than (or equal to) a user-specified threshold. Then for each temporal quasi-functional dependency a set of data is selected to be deleted in order to change the temporal quasi-functional dependency into a potential temporal functional dependency. The removed data are defined as outliers. Another particular way to represent data that have been frequently used for outliers detection is the one based on rough sets. The rough set concept is based on the assumption that each observation of the universe is associated to a specified amount of information. Objects with the same information are indiscernible; any set of all indiscernible observations is referred to as crisp set, otherwise the set is imprecise or rough. Rough set theory was introduced by Pawlak (Pawlak, 1982; Pawlak, 1991; Pawlak et al., 1995) and it is interesting in the study of intelligent systems characterized by incomplete and insufficient information. Several works demonstrate the importance of the rough set approach especially in the fields of machine learning and data mining
(Lin & Gereone, 1996; Pawlak et al., 1995; Skawron & Ranszer, 1992; Yao et al., 2003). In rough sets, data model information is organized in a table called data system. If there are attributes which derive from a classification operation, the data table is also called a decision system. Each rough set, in contrast to precise sets, cannot be characterized by some information and is also defined by lower approximation, upper approximation and boundary region. The lower approximation is also called positive region, while the upper approximation is called negative region. The positive region includes all the observations belonging to the considered concept while the upper approximation includes observations which possibly belong to the concept. The difference between the two regions is the boundary region. Figure 1 show an example of rough set.
Figure 1. An example of rough set
Jang et al. (Jang et al., 2009) propose a method which combines the rough set theory and the outlier detection methods suggesting two different approaches: a sequence-based outlier detection in information systems of rough set theory and a classical distance-based outlier detection method applied to rough sets. The definition of sequence-based outliers in an informal system is inspired by Hawkin's definition (Hawkin, 1980) and the basic idea is built from the work by Skowron and Sinak (Skowron & Sinak, 2004), that introduced the basic concepts for approximate information exchanges using information granules. The basic idea is as follows. Given an information system, defined by a quadruple IS=(U,A,V,f) where U is a non-empty set of observations, A represents a non empty set of attributes, V is the union of attributes and f is an informational function which links one value of each attribute to each observation included in U, for each data x belonging to U, if x differs (on the basis of some characteristic) from other objects in U, it is labeled as an outlier with respect to IS. The second approach applies a traditional distance-based outlier detection method to rough sets in order to calculate the distance between two objects in an informational system. To this aim it is
necessary to use a suitable distance metric for nominal attributes in rough set theory. An appropriate distance function for nominal attributes, that is called Value Difference Metric (VDM), was introduced by Stanfill & Waltz (Stanfill & Waltz, 1986). A value difference metric between two objects x and y is defined as follows: VDM(x,y) = f df (xf,yf) (9) where f is the feature, xf the value of object x on feature f, yf is the value of object y on the feature f and d is the distance between two objects.
Recent Artificial Intelligence-based Approaches to Outliers Detection
Artificial Intelligence (AI) is a branch of computer science aiming at providing machines with a sort of intelligence, similar to the one characterizing living beings. Actually many definitions of AI can be found in literature: in particular Russel & Norvig (Russel & Norvig, 2003) defined an intelligent agent as a system that perceives its environment and takes actions that maximize its chances of success. Nowadays, the term AI is widely used to indicate a variety of methods and techniques, such as neural networks, fuzzy logic and genetic algorithms. In the most recent years, the ever increasing application of AI techniques leads many researcher to evaluate the possibility of exploiting some of them for outlier detection. Thus many works have been proposed to improve already existing methods already existed or to introduce new algorithms. 5.1 Support Vector Machine-based Methods The SVM algorithm, introduced by Vapnik (Vapnik, 1995) is essentially a binary classification algorithm, although it has been extended to multi-classes problems. The data belonging to the different classes need to be separated by a hyperplane but they are not always well separable. To overcome this, the data are mapped to a feature space with higher dimensionality, where the data separation through hyperplanes is easier. The SVM classifier is widely used in many disciplines because it has a high accuracy and it is able to treat with high dimensional data (Ben-Hur&Weston, 2010). SVM-based methodologies are been widely used for outlier detection, such as, for instance, in (Tax & Juskczak, 2002; Guo et al. 2008; Peng et al., 2010, Zhang et al, 2008), because they do not require a-priori knowledge about any kind of statistical model, can be applied to data with high dimensionality and provide an optimum solution maximizing the margin of decision boundary. A modification of SVM algorithm that is suitable to detect outliers was proposed by Scholkopf (Scholkopf et al., 2001), who suggested a method of adapting the SVM method to one-class classification problems. Oneclass SVM is an unsupervised algorithm which maps input data into a high dimensional feature space and, through several iterations, finds the hyperplane which best separates the training samples from the origin. Also the One-class SVM is a normal two-class SVM where all training samples belong to the first class and the origin belong as only member to the second class. One-class SVM method maps data into a feature space through an appropriate kernel function and the most popular choices of kernel functions used in SVM method are linear, polynomial, gaussian and sigmoidal functions. The final aim is to separate the mapped vector from the origin with maximum margin. An advantage of one-class SVM for outlier detection is due to a high True Positive Rate (TPR) (that is the probability to correctly detect the outliers) but the disadvantage is due to also a high False Positive Rate (FPR) (i.e. the probability to misclassify as outliers samples which are not outliers). To solve this problem Tian & Gu (Thiang & Gu, 2010) proposed a novel one-class model which combines one-class SVM and Particle Swarm Optimization (PSO) algorithms (Kennedy & Ebehart, 1995; Shi & Ebehart, 1998). The PSO algorithm is inspired by social behavior insects, birds and fish. It is used to optimize a given problem by trying to iteratively
candidate solutions. The candidate solution are included into an initial population. This algorithm is successfully applied to a wide variety of problems and has performance comparable to genetic algorithms. In this approach PSO algorithms are used to identify the optimum SVM parameters obtaining a high detection rate with a low FPR. The combination of SVM classifier and PSO algorithm suggests that outliers are effectively detected through the optimization of the classifier which is built through a suitable parameters selection and boundary movement strategy. The results show that the proposed approach improves the robustness of the overall decision and the best compromise between TPR and FPR is obtained. Other recent examples of outliers detection as one-class learning problem are presented in (Schweizer & Moura, 2000, Miller & Brewning, 2003, Scholkopf et al. 2001, Banerjee et al., 2006, Campbell & Bennet, 2001, Ratsch et al. 2002, Markou & Singh, 2006, Han & Cho, 2006, Abe et al., 2006) and several other applications exploit SVM based techniques to detect outliers with satisfactory results (Davy et al. 2006, Zhang et al. 2009, King et al. 2002, Gardner et al. 2006, Eskin et al. 2002, Lazarevic et al. 2003, Giacinto et al. 2008, Roberts & Tarassenko, 1994, Tax & Duin, 1999, Tax & Duin, 2004). 5.2 Fuzzy Logic-based Methods Fuzzy Logic (FL) is connected with the theory of fuzzy sets, a theory which provides classes of objects with unsharp boundaries and a single object can simultaneously belong to different sets with different degrees of membership. A Fuzzy Inference System (FIS) (Ross, 2004) calculates the mapping from a given input to an output by using fuzzy logic. The input variables are also mapped into sets of membership functions called "fuzzy set" and the process of converting a crisp value to a fuzzy value is named " fuzzification". The process of fuzzy inference involves membership functions (MF), i.e. a curve that defines how each point in the input space is mapped to a membership value or degree of membership between 0 and 1; fuzzy logic operators (and, or, not) and if-then rules. The rules results are mapped into a membership function and are combined to give a crisp answer. This last process is called "defuzzification". In the last year, a novel interesting approach, called Fuzzy Rough Semi-Supervised Outlier Detection (FRSSOD) (Xue et al. 2010) was proposed. This approach combines the Semi-Supervised Outlier Detection method (SSOD), that was proposed by Gao et al. (Gao et al, 2006), with a clustering method introduced by Hu and Yu. (Hu & Yu, 2005), that is named Fuzzy Rough C-Means clustering (FRCM). Naturally this method belongs to clustering-based approaches. The proposed method integrates the advantages of SSOD and FRCM and decides only if the points on the boundary can be considered as outliers. In order to deeply understand FRSSOD, the SSOD method and FRCM approach must be known as well. A brief description of this approaches is provided in the following. Many outlier detection methods are unsupervised algorithms (Breunig et al., 2000, Jin et al. 2001, Eskin et al., 2002) and often the unsupervised methods have a high FPR and a low TPR. The supervised detection methods have been introduced in order to improve the algorithm performance (Marsland, 2001, Kazarevic, 2003, Markou,2006), but the collection of a large amount of labeled training data can be quite difficult. For these reasons recently semi-supervised outlier detection methods (Li et al., 2007, Zhang et al. 2005, Gao et al., 2006, Xu & Liu, 2009) have been presented. SSOD uses both unlabeled and labeled data, by thus improving accuracy without the need for a high number of labeled data. Let X = {x1, x2, . . . xn} be a set of data points drawn from Rm. The first l points of X, with l<n, are labeled as null value if the selected point is an outlier, unitary value otherwise. The objective of this method is to predict if a point is an outlier or not. Let us suppose that standard data form K clusters and outliers are not included in any clusters. It is necessary to find a nxK matrix T={tih, i=1, 2, ... n; h=1, 2,...K} where tik has an unitary value if xi belong to the cluster Ch. The optimization problem is represented by minimizing the following function: (10)
where ch is the center of cluster Ch, dist represents the Euclidean distance and
1 , 2 are adjusting parameters.
The first term is inherited from k-means clustering objective function. As only normal points are partitioned into clusters, outliers are not included in this term. The second term is used to constrain the number of outliers not to be too large. The third term is used to maintain consistency of labeling proposed by authors with existing labels. The minimization of the above-defined objective function leads to point out outliers that do not belong to any clusters. FRCM was introduced by Hu & Yu (Hu & Yu, 2005) as a combination between Fuzzy C-means method and Rough C-means. Fuzzy C-means method is based on the partition of data set points into clusters centers. A fuzzy membership for every cluster in the range 0-1 is assigned to each point; each object belongs to some or all of the clusters with some fuzzy degree. The results depend on clusters centers initialization (see Subsection 3.3). In RCM method the concept of C-means clustering is added to the concept of rough set (already treated in Section 4), i.e. each cluster is seen as a rough set which has a lower approximation region, an upper approximation region and a boundary region. The upper approximation region of a rough set includes samples in the cluster which are also members of other clusters, i.e. RCM classifies the object space into three parts, lower approximation, boundary and negative region. The main difference between rough clustering and classical clustering lies in the fact that that in rough clustering a sample is member of more than one cluster and this allows overlaps between clusters. In particular Lingras assumes the following properties: - A data can be a member only of one lower approximation. - The lower approximation of a given cluster must be a subset of its upper approximation - If a data is not a member of any lower approximation then it is a member of two or more upper approximations. - Data in boundary region are uncertain data and are assigned to at least two upper approximations. RCM has many advantages and its applicability is extended to several fields that have uncertain information granulation. FRCM combines the advantage of fuzzy set theory and rough set theory and integrates fuzzy membership value of each sample to the lower approximation and boundary area of a cluster. FRCM can be formulated as follows. Let X ={x1, x2, . . ., xn} be a set of data points and let Ck and approximation of a cluster.
C k be, respectively, the lower and upper
B Ck = Ck Ck is the boundary area, c={c1, c2, ..., ck} is a vector of k centers of
clusters and u={uik} are memberships of a nxK matrix. FRCM partitions data into two classes: a lower approximation region and a boundary region; only objects belonging to the boundary region are fuzzyfied. The problem of FRCM lies in the optimization of the following function: (11) FRSSOD exploits both the above-described methods and combine the two approaches in a novel one. Let X={x1, x2, . . ., xn} be a set of data points and let Y be a subset of X that is formed by l<n elements. Also the first l points are labeled through an sort of membership index yi(0,1), where yi=0 means that the point xi is considered an outlier. Normal points form C clusters, while outliers do not belong to any cluster and also each normal point belongs to a cluster with a membership degree. If such membership degree has small value, than the associated point will be considered as an outlier. The optimization problem is defined as: (12)
where
1 and 2 are adjusting positive parameters and are applied to make the three terms compete with each
other, while m is a fuzziness weighting exponent (m>1). As only normal points are partitioned into clusters (as the idea of SSOD approach) outliers do not contribute to the first term. The second term avoids the detection of an extremely large number of outliers. The third term preserves consistency of user labeling with existing labels and punishes mislabeled points. FRSSOD does not only uses unlabeled and labeled data but also integrates fuzzy and rough sets theory; therefore it can be applied to many fields that have fuzzy information granulation or do not take a decision under certain conditions. The experimental results show that FRSSOD has many advantages over SSOD, as it improves outliers detection accuracy and reduces false alarm rate under the guidance of labeled points. On the other hand, the performance of FRSSOD depends on the selection of the number of clusters and on the adjusting parameters 1 and 2. Fuzzy logic has also been applied for outlier detection as a tool to combine different outlier detection methods, in an attempt to exploit the advantages of each of them while overcoming their drawbacks. Also this approach does not belong only to a category but include more categories of outlier detection based-methods. Cateni et al (Cateni et al., 2009) proposed a novel method based on fuzzy logic theory, which is a substantial improvement of a first attempt previously proposed by the same authors in (Cateni et al., 2007, Cateni et al., 2008) and combines a distance-based method, a density-based method, a clustering-based method and a distribution-based method. This method does not require any a priori assumption on the data and it is able to detect outliers without the need to made preliminary statistical analysis or parameters tuning. Therefore this approach can be adopted even by inexperienced users. For each sample four features are calculated by using the most popular outlier detection techniques (see Section 3). The inputs are represented by Mahalanobis distance (Mahalanobis, 1936), a membership function evaluated through fuzzy c-means technique (Bezdek, 81, Dunn, 1974), the local outlier factor (Breunig et al., 2000) and the result of Grubb test (Grubbs, 1969). Noticeably the Fuzzy C-means algorithm requires the number of cluster to be known a priori, while in this case such number is automatically calculated. In this algorithm the clustering based method is treated through both fuzzy c-means algorithm and the validity measure based on inter and intra-cluster distance measure and proposed by Ray and Turi. (Ray & Turi, 1999). This approach consists in calculating the distance between a point and its cluster center to decide if the clusters are compact. Also two measures are defined, the intra-cluster that is defined as the average between a point and its cluster center and the inter-cluster which is defined as the distance between clusters. To determine the optimal number of clusters the intra-distance must be minimized while the inter-distance must be maximized. Their ratio, that is named validity measure, is defined as follows: validity = intra-distance/inter-distance (13) and the optimum number of clusters is calculated by minimizing the validity measure (13). The four features are fed as inputs to the fuzzy inference system (FIS) (Ross, 2004) that provides as output an index in the range (0,1) which represents a measure of probability that the selected sample is an outlier. The adopted FIS is of the Mandani type (Mandani, 1974). Figure 2 depicts a scheme of the proposed method. The method have been tested in an industrial context and the results show that this approach outperforms the traditional techniques.
Figure 2: Block diagram of the outlier detection based fuzzy logic
5.3 Genetic Algorithm-based Methods Genetic Algorithms (GA) belong to the wider class of evolutionary optimization methods: their main feature consists in their attempt to mimic the evolution of living organisms through generations: this natural process is simulated in order to progressively build a solution to a certain problem which is optimal under a (or, sometimes, more than one) arbitrary criterion. A set of possible solutions to the considered optimization problem is organized into a population of candidate solutions which is evolved by means of the GA engine. At each generation of the GA the goodness of each candidate solution is evaluated through a performance measured usually named fitness function: the individuals with higher fitness are used to build a the population at the subsequent generation. The best candidates not only survive but are also combined in order to generate new (and hopefully better) individuals, such as it happens in natural evolution. The GA population is evolved, generation by generation, until the achievement of an arbitrary stop condition which typically involves the attainment of a particularly high fitness value by one of the candidates or by part of the population, the completion of a predetermined number of generations or the protract evolution of the population without any improvement in terms of goodness of the candidates. Tolvi (Tolvi, 2004) proposed an application of GA for outlier detection based on a statistical approach. The problem of the association of data to the best possible model is faced by firstly finding any outliers in the data; a number of initially candidate models are selected and examined. Tolvi introduced a new nuisance in outlier detection, the possibility of smearing and masking. Smearing means that the presence of an outlier causes another normal observations to be misclassified as outlier, while masking means that an outlier prevents another datum from being correctly classified as outlier through an outlier detection method. In (Tolvi, 2004) an outlier detection method in linear regression modeling is treated. GAs are used for outlier detection avoiding the potential problems of smearing and masking and simultaneously the problem of variable selection is discussed. The motivation to treat two different problems (i.e. outlier detection and variable selection) lies in the fact that the choice of the variables to select can affect the outlier detection and vice-versa (Chatterjee & Hadi,
1994). Potential outliers can be included into the linear regression model using a dummy variable. A dummy variable is a binary vector which is zero for outlier samples and one for non-outlier samples. Also the aim of the proposed approach is the selection of the best model where the models have several combinations of all possible dummy variables. The outlier detection is based on the use of informational criteria, in particular the Bayesian Information Criterion (BIC) (Schwarz, 1978). Schwarz introduced BIC to serve as an approximation to a transformation of the Bayesian posteriori probability of a candidate model. The computation of BIC depends on the model complexity, i.e. on the number of parameters of the model selected. Let us suppose that X = {x1, x2, . . ., xN} is the dataset to be modeled and M = {M1, M2, . . ., Mk} are the candidates of parametric models. Let L(X,M) be the maximization of likelihood function for each model, the definition of BIC is described as: BIC = log L(X,M)-1/2log(N) (14) where is the number of parameters in the model M. If BIC has a low value, which corresponds to have a model with few parameters and small residuals, it is selected by the GA. The proposed GA starts with a population size in each generation compose by 40 individuals, which is randomly generated. Each individual contains genes with value zero with a probability of 0.9 and genes with unitary value with a probability 0.1. The algorithm becomes faster adding preliminary information about which of samples are potentially outliers and, although this paper treats linear regression models, the method is also suitable for other statistical models. Other recent applications of GA for outlier detection (Aggarwal & Yu, 2001; Yan et al., 2004; Bandyopadhyay & Santra, 2008) can be found in literature.
Outlier Detection in Image Processing
Outlier detection method is an important tool beside image processing analysis. In an image an outlier can appear when the image changes over time or can be represented by regions which are anomalous with respect to the rest of a quasi-static image (i.e. with very small variations through time). Outliers can be due to motion, insertion of anomalous objects or instrumentation errors. The outlier detection process is a fundamental preprocessing tool in many interesting image analysis application, such as satellite imagery, spectroscopy, mammographic image or video surveillance (Chandola et al., 2009). Often in image processing the data present both spatial and temporal characteristics and outlier detection is an important task to identify the false matches. Malpica et al. (Malpica et al., 2008) propose an innovative technique for outlier detection in hyperspectral images. A hyperspectral image is a digital image where each element of the image (pixel) consists of an associated electromagnetic spectrum. It can also be seen as a cube of data (called hypercube). Due to high number of bands, the big amount of data can result redundant and the most interesting information is difficult to extract because of the high dimensionality of data themselves. After detection, anomalous points can be retained because they contain interesting information or can be discarded/deleted. The authors propose a method based on Projection Pursuit (PP) (Friedamn & Turkey, 1974; Kruskal, 1969) to detect possible anomalies. This technique is based on the use of one or more linear combinations of the original features with the aim of maximizing an index representing an interestingness measure. The results show that PP technique can detect group of outliers or isolated outlier; the proposed algorithm was applied to AHS and HYDICE hyperspectral imageries. The common Principal Component Analysis (PCA) (Jolliffe, 2002) is a special case of PP. In PCA the reduction of data is made by choosing the linear combination of the considered variables which maximizes the variance of the projected data, also the index is represented by the variance. An important contribution for a new perspective PCA-based approach is suggested by Ding and He (Ding & He, 2004). A method based on PCA to reduce dimensionality and detect outliers in hyperspectral imagery is treated in (Goovaerts et al., 2005) while in (Saha et al., 2009) outlier detection through PCA is used also to automate snake contours for object detection. Snakes are deformable models used to estimate the boundary when the object shape is partially unknown, an example of use of snake is shown in Figure 3.
Figure 3: Example of snake contour
The deformable curve is a sort of elastic curve which is able to approximate the considered image features. A novel method for active contour models or snakes is proposed by Chan (Chan, 2001). It is an interesting approach because the proposed model is able to detect objects that contain boundaries not necessarily defined by the gradient. In this research area outliers are features which do not lie in the object boundary. In (Nascimento, 2005) an algorithm for detection of objects boundaries in the presence of outliers is proposed. A deformable contour (as in snakes) approximates the object boundary through the Maximum A Posteriori (MAP) estimate method (Abrantes & Marques, 1996) using the Expectation Maximization (EM) (McLachlan & Krishnan, 1997). Dashti et al. (Dashti et al., 2010) proposed the ET-DRN method to understand the relationships between objects in a given dataset. The hierarchical clustering procedure includes the Euler algorithm to assign objects to clusters, the GA to increase the density between objects inner each cluster and, finally, the Kullback-Leibler divergence to calculate the dissimilarity of the clusters. Objects are considered in high dimensionality and are examined as objects of a digital geometry; also it is possible to build a sensible mathematical structure where outliers are clearly detectable. Silveira et al. (Silveira et al., 2008) proposed a new method which classifies image features as valid or invalid (i.e. outliers) by organizing edge points into connected segments (the so-called strokes). An adaptive stopping force which allows the contour to bridge the invalid features and stop at the valid features is applied. For each stroke is than assigned a confidence degree during the evolution process, also the weights are given by the probability that a stroke is valid.
Case Study Using Synthetic Data
In order to show how the different classical methods and a recent proposed method work an example using synthetic data is proposed. The created database includes 100 samples of a random variable whose probability density function derives from the composition of two Gaussian functions, as show in Figure 4. 10 outliers, indicated with red circles in Figure 4, have been included in such database.
Figure 4: The synthetic dataset and, on the left, the distribution of the data that are not outliers.
Four classical outlier detection methods have been applied to this database: a distance-based approach, where the Mahalanobis distance is exploited, a density-based approach based on LOF algorithm, a clusteringbased method which exploits the Fuzzy C-means as clustering algorithm and, finally, a distribution-based approach using the Rosners algorithm. Moreover, on the same database also an AI-based technique has been tested, in particular the one proposed in (Cateni et al., 2009). The results of these tests are depicted in Table 1.
Approach Distance-based (Mahalanobis distance) Density-based (LOF) Clustering-based (Fuzzy C-means) Distribution-based (Rosner's test) AI-based (Fuzzy approach) Outlier detected (%) 30% (A - E - I) 30% (G - H - L) 30% (B - D - F) 70% (A - C - E - G - H - I - L) 100% (A - B - C - D - E - F - G - H - I - L)
Table 1: Test results of some outlier detection techniques on the synthetic database of Fig. 4.
The results put into evidence the particular features of the tested algorithms. In particular, the distancebased approach is capable to point out the outliers that mostly differ from the mean value, while the densitybased approach detects only outliers that are isolated from data. The clustering-based approach finds isolated outliers after a clustering operation. Finally the distribution-based approach considers as outliers those points that deviate from the model. In this example the distribution-based method works quite well because the initial dataset is created by two Gaussian distributions. The fuzzy-based approach, combining the several classical methods, outperforms all the traditional techniques as it exploits all their capabilities by compensating their weaknesses.
Conclusion
A survey about outlier detection methods is proposed. Both traditional approaches and their recent enhancements as well as some interesting applications are presented and discussed. Finally a case study is proposed, that is based on a synthetic database, with the purpose to show how the different methods work. The conclusion is that the potential and efficiency of an outlier detection method strongly depend on the kind and distribution of the data that are processed. For instance clustering-based methods are very effective if the data are strongly clustered, while distribution-based methods can work quite well if the hypotheses that are required on the data distribution are correct, which means that they can be applied only when some a-priori knowledge on the data distribution are available. If no information is available on the data to process and/or if the data features can change through time in a non predictable way, than probably the best solution is to try different methods and/or apply a combination of many outlier detection methods which are based on different principles. Fuzzy logic can provide a powerful tool to automatically perform such combination, but also other combination procedures are possible.
References
Abe, N, Zadrozny, B. & Langford, J. Outlier Detection by Active Learning, Proc. ACM SIGKDD 06, 2006, (pp. 504-509). Abrantes, A. and Marques, J. A class of constrained clustering algorithms for object boundary detection, IEEE Trans. Image Process., vol.5, no. 11, pp. 15071521, Nov. 1996. Aggaewal, C.C., Yu, P.S. Outlier detection for high dimentional data. Proceeding of ACM SIGMOD Conference, 2001, (pp. 3747) Aitchison, J. and Brown, J.A.C. The lognormal distribution, Cambridge University Press, Cambridge UK, 1957 Apiletti, D. Baralis, E. Bruno, G. Ficarra, e. Data cleaning and semantic improvement in biological databases, Journal of Integrative Bioinformatics 3 (2) (2006) 111. Bandyopadhyay, S. & Santra, S. A genetic approach for efficient outlierd etection in projected space. Pattern Recognition, 41, 2208, (pp. 1338-1349). Banerjee, A., Burlina, P. & Diehl, C. A Support Vector Method for Anomaly Detection in Hyperspectral Imagery, IEEE Trans. Geoscience and Remote Sensing, vol. 44, no. 8, 2006 (pp. 2282-2291). Barnett, V, Lewis, T. Outliers in Statistical Data, 3rd ed., John Wiley & Sons, New York, 1984. Ben-Hur, A. and Weston, J. A user's guide to Support Vector Machines, Meth Mol Biol 609, 2010, (pp. 223-239) Bezdek, J.C., Pattern Recognition with Fuzzy Objective Function Algoritms, Plenum Press, New York, 1981 Bishop, C. Neural Network for Pattern Recognition. Oxford UniversityPress, Oxford, 1995. Breunig, M.M, Kriegel, H.P, Ng, R.T., and Sander, J. LOF: Identifying Density-based Local Outliers, Proc. of the 2000 ACM SIGMOD Intl Conf. on Management of Data, ACMNew York, NY, USA, June 2000, Vol. 29, Issue 2, pp. 93-104. Bruno, G. & Garza, P. TOD: Temporal outlier detection by using quasi-functional temporal dependencies. Elsevier, Data & Knowledge Engineering, 69, 2010, (pp.619-639). Bruno, G., Garza, P., Quintarelli, E., Rossato, R. Anomaly detection through quasi-functional dependency analysis, Journal of Digital Information Management 5 (4) (2007) 191200. Campbell, C. & Bennet, K.P. A Linear Programming Approach to Novelty Detection, Advances in Neural Information Processing Systems 13, 2001 (pp. 395-401). Canfield, R.V., Taillie, C., Patil, G.P, Baldessari, B.A. Extreme value theory with applications to hydrology. In Statistical distributions in scientific work, Vol. 6, Reidel Pubblishing Company, Dordrecht, Holland (pp. 35-49), 1981. Castillo, E. Extreme Value theory in engineering. New York: Academic, 1988.
Cateni, S., Colla, V., Vannucci , A fuzzy logic based method for outlier detection, Proc. 25th IASTED Int. Conf. Artificial Intelligence and Applications, AIA 2007 pp. 561-66, Innsbruck, Austria, 2007 Cateni, S., Colla, V., Vannucci: Outlier detection methods for industrial applications, in Advances in Robotics, Automation and Control, I-Tech Education and Publishing KG, Croatia, October 2008. Cateni, S., Colla, V., Vannucci, M., A fuzzy system for combining different outliers detection methods, In proceedings of the 25th conference on proceedings of the International conference: Artificial intelligence and Applications, Innsbruck, Austria, 16-18 Febbraio 2009 Chan, T.F. Active contours without edges. IEEE Transaction on image processing, Vol.10, N.2, February 2001. Chandola, V. Banerjee, A., Kumar, V. Anomaly detection survey. ACM Computing Surveys, September 2009. Chatterjee, S. & hadi, A.S. Sensitivity analysis in linear regression, Wiley New York, 1998. Chaudhuri, P. On a Geometric Notion of Quantiles for MultivariateData, J. Am. Statistical Assoc., vol. 91, no. 434, 1996 ( pp. 862-872). Chundi, P., Subramaniam, M., Vasireddy, D.K., An approach for temporal analysis of email data based on segmentation, Data and Knowledge Engineering 68 (11) (2009) 12531270. Dashti, H.T., Kloc, M.E., Simas, T., Ribeiro, R.A., Assadi, A.H. Introduction of empirical topology in construction of relationship networks of informative objects, IFIP Advances in Information and Communication Technology, Springer, 2010. Davy, M., Desobry, F., Gretton, A., Doncarli, C. An online support vector machine for abnormal events detection. Signal Process. 86(8), 2006, (pp. 20092025) C.J. Date, H. Darwen, N. Lorentzos. Temporal Data & the Relational Model, First Edition (The Morgan Kaufmann Series in Data Management Systems); Morgan Kaufmann; 1st edition; 2002, ISBN 1-55860-855-9. Ding, C., He, X. Principal component analysis and effective K-means clustering, SDM, 2004 (pp.497-501). Dunn, J.C. Some recent investigations of a new fuzzy partition algorithm and its application to pattern classification problems, Journal of Cybernetics 4 (1974) 115. Eskin, E., Arnold, A., Prerau, M., Portnoy, L., Stolfo, S. A geometric framework for unsupervised anomaly detection Detecting intrusions in unlabeled data. In: Data Mining for Security Applications, vol. 19 , 2002. Friedman, J.H.,. Tukey, J.W. A projection pursuit algorithm for exploratory data analysis, IEEE Trans. Comput. C-23 (9) (1974) 881--890. Gardner, A.B., Krieger, A.M., Vachtsevanos, G., Litt, B. One-class novelty detection for seizure analysis from intracrania leeg. J. Mach. Learn. Res. 7, 2006, (10251044 ) Giacinto, G., Perdisci, R., Del Rio, M., Roli, F.: Intrusion detection in computer networks by a modular ensemble of one-class classifiers. Inf. Fusion 9(1), 2008, (pp 6982 ). Gao, J.,Cheng, H., Tan, P.N.,. Semi-supervised outlier detection, Proc. of the 2006 ACM Symposium on Applied Computing, ACM Press, 2006, pp. 635636. Gilbert, R.O. statistical methods for enviromental pollution monitoring, Van Nostrand, Reinholds, New York, 1987. Goovaerts, P., Jacqueza, G.M.. Marcus, A. Geostatistical and local cluster analysis of high resolution hyperspectral imagery for detection of anomalies, Remote Sensing Environ. 95 (2005) 351--367. Grubbs, F.E., Procedures for detecting outlying observations in samples, Technometrics 11, pp.1-21, 1969. Guo, S.M., Chen, L.C., Tsai, J.S.H. A boundary method for outlier detection based on support vector domain descrption. Pattern Recognition 42, 2009, (pp. 77-83). S.-J. Han and S.-B. Cho, Evolutionary Neural Networks for Anomaly Detection Based on the Behavior of a Program, IEEE Trans. Systems, Man, and Cybernetics B, vol. 36, no. 3, 2006 (pp. 559-570). Hawkin, D., Identification of outliers, Chapman and Hall, London, 1980. Hodge, V.J., A survey of outlier detection methodologies, Kluver Academic Publishers, Netherlands, January 2004.
Hu, Q, Yu, D. An improved clustering algorithm for information granulation, in: Proceeding of 2nd International Conference on Fuzzy Systems and Knowledge Discovery (FSKD05), vol. 3613, LNCS, Springer-Verlag, Berlin Heidelberg Changsha, China, 2005, pp. 494504. Huhtala, Y., Krkkinen, J., Porkka, Toivonen, H. TANE: an efficient algorithm for discovering functional and approximate dependencies, The Computer Journal 42 (2) (1999) 100111. Jang, F., Sui, Y. & Cao, C. Some issues about outlier detection in rough set theory, Expert Systems with Applications, 36, pp.4680-4687, 2009. Jolliffe, I. Princypal Component Analysis. Springer. New York, 2002. Kennedy, J. & Ebehart, R., Particle Swarm Optimization. Proceeding of IEEE International Conference on Neural Network, IV, pp. 1942-1948. King, S.P., King, D.M., Astley, K., Tarassenko, L., Hayton, P., Utete, S. The use of novelty detection techniques for monitoring high-integrity plant. In: Proceedings of the 2002 International Conference on Control Applications, Cancun, Mexico, vol. 1, 2002, (pp. 221226 ) Kivinen, J., Mannila, H. Approximate inference of functional dependencies from relations, Theoretical Computer Science 149 (1) (1992) 129149. Knorr, E.M., Ng, R. Algorithms for Mining Distance-Based Outliers in Large Datasets., Proceeding VLDB, pp.392-403. Kruskal,J.B. Toward a practical method which helps uncover the structure of a set of multivariate observations by finding the linear transformation which optimizes a new index of condensation, in: R.C. Milton, J.A. Nelder (Eds.), Statistical Computation, Academic Press, New York, 1969, pp. 427--440. Jin, W., Tung, A.K.H., Han, J. Mining Top-n Local Outliers in Large Databases. Proc. of the Seventh A SIGKDD Intl Conf. on Knowledge Discovery and Data Mining, ACM New York, NY, USA, 2001, pp. 293298. Lazarevic, A., Ertoz, L., Kumar, V., Ozgur, A., Srivastava, J. A comparative study of anomaly detection schemes in network intrusion detection. In: Proceedings of Third SIAM Conference on Data Mining, San Francisco, vol. 3, 2003. Li, B. Fang, L. Guo, A novel Data Mining Method for Network Anomaly Detection based on Transductive Scheme, in Advances in Neural Networks, LNCS, vol.4491, Springer Berlin, 2007, pp. 1286-1292. Liang, Z., Ximming, T., Lin, L., Wenliang, J. Temporal association rule mining based on a T a-priori algorithm and its typical application. Proceedings of International Symposium on Spatio-Temporal Modeling, Spatial Reasoning Analysis, Data mining and Data Fusion, 2005. Lin, T. Y., & Gereone, N. (1996). Rough sets and data mining: Analysis of imprecise data. Dordrecht: Kluwer Academic. Lingras, P, West,C. Interval set clustering of web users with rough k-means, Journal of Intelligent Information System, vol. 23, no. 1, July 2004, pp. 516. MacQueen, J. Some methods for classiication and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1, university of California Press, Berkley, 1967, (pp. 16641677). Mahalanobis, P.C.. On the generalized distance in statistics, Proc. of the National Institute of Science of India, pp.49-55, 1936. Malpica, J.A., Rejas, J.C., Alonso, M.C. A projection pursuit algorithm for anomaly detection in hyperspectral imagery., Pattern Recognition, 41 (pp.3313-3327) Mandani E.H., Application of fuzzy algorithms for control of simple dynamic plant, Proc. of the IEEE Control and Science, No. 121, pp. 298-316, 1974. Marsland, S., On-line Novelty Detection Through Selforganisation, with Application to Inspection Robotics. Ph.D. Thesis, Faculty of Science and Engineering, University of Manchester, UK, 2001. Markou, M. & Singh, S. A Neural Network-Based Novelty Detection for Image Sequence Analysis, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 28, no. 10, 2006, ( pp. 1664-1677). Matsumoto, S; Kamey, Y; Monden, A. Comparison of Outlier Detection Methods in Faul proneness
Models. Proceedings of the 1st International Symposium on Empirical Software Engineering and Measurement (ESEM2007), pp.461-463, September 2007. McLachlan, G.J. & Krishnan, T. The EM Algorithm and Extensions. New York: Wiley, 1997. Miller, D.J, and Browning, J. A Mixture Model and EM-Based Algorithm for Class Discovery, Robust Classification, and Outlier Rejection in Mixed Labeled/Unlabeled Data Sets, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 25, no. 11, Nov 2003 (pp. 1468-1483). Nascimento, J.C. Adaptive snakes using the EM algorithm. IEEE TRANSACTIONS ON IMAGE PROCESSING, Vol. 14, N. 11, 2005, (pp. 1678-1686) Papakakis,N., Antoniou, G., Plexousakis, D. The ramification problem in temporal databases: changing beliefs about the past, Data and knowledge Engineering 59, 2, 2006, (pp 379-434). Pawlak, Z. (1982). Rough sets. International Journal of Computer and Information Sciences, 11, 341356. Pawlak, Z. (1991). Rough sets: Theoretical aspects of reasoning about data. Dordrecht: Kluwer Academic Publishers. Pawlak, Z., Grzymala-Busse, J. W., Slowinski, R., & Ziarko, W. (1995). Rough sets. Communications of the ACM, 38(11), 89 95. Peng, X., Chen, J., Shen, H. Outlier Detection Method BasedOn SVS and its applicationin Copper-matte Converting, IEEE ISBN: 978-1-4244-5181-4, 2010. Ramakrishnan, R., Gehrke, J. Database Management Systems, McGraw-Hill Science Engineering Math, 2002. Ratsch, G., Mika, S., Scholkopf, B. and Muller,K., Constructing Boosting Algorithms from SVMs: An Application to One-Class Classification, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24, no. 9, Sept 2002, ( pp. 1184-1199). Ray, S., Turi, .H., 1999. Determination of number of clusters in k-means clustering and application in colour image segmentation, Proc. 4th Int. Conf. Advances in Pattern Recognition and Digital Techniques, (ICAPRDT 99), Calcutta, India, 27-29 December, 1999, pp. 137-43. Roberts, S. & Tarassenko, M.. A Probabilistic Resource Allocating Network for Novelty Detection, Neural Computation, vol. 6, no. 2, 1994, (pp. 270-284). Rosner, B. Percentage points for a generalized ESD many-Outlier procedure. Technometrics, 25, (pp. 165-172), 1983. Ross, T.J. Fuzzy logic with engineering applications, John Wiley &sons ltd England, 2004. Russell, S. J.; Norvig, P. Artificial Intelligence: A Modern Approach (2nd ed.), Upper Saddle River, New Jersey: Prentice Hall, 2003. Saha, B.D.; Roy N. & Zhang, H. Snake validation: a PCA-based outlier detection method, IEEE signal processing letters, vol 16, N 6, 2009. Schwarz, G. Estimating the dimensional of a model. The annual stat, 6, 1978, (pp. 461-464). S.M. Schweizer and J.M.F. Moura, Hyperspectral Imagery: Clutter Adaptation in Anomaly Detection, IEEE Trans. Information Theory, vol. 46, no. 5, Aug.2000, ( pp. 1855-1871). Scholkopf, B., Platt, J.C., Shawe-Taylor, J., Smola, A.J., and Williamson, R.C., Estimating the Support of a High-Dimensional Distribution, Neural Computation, vol. 13, no. 7, 2001, (pp. 1443-1471). Shannon, C. E. (1948). The mathematical theory of communication. Bell System Technical Journal, 27(34), 373423. Shi, Y. Ebehart, R.C. A modified Particle Swarm Optimization. Proceedings of IEEE International Conference on Evolutionary Cmputation, pp. 69-73. Silveira, M., Nascimento, J.C., Marques, J.S., Level set segmentation with outlier rejection. IEEE, ICIP 2008. Skowron, A., & Rauszer, C. (1992). The discernibility matrices and functions in information systems. Handbook of applications and advances of rough set theory (Vol. 11, pp. 331362). Dordrecht: Kluwer Academic Publishers. Skowron, A., & Synak, P. (2004). Reasoning in information maps. Fundamenta Informaticae, 59, 241259. Stanfill, C., & Waltz, D. (1986). Towards memory-based reasoning. Communications of the ACM, 29(12), 12131228.
Tax, D.M.J & Duin, R.P.W. Support vector data description. Machine Learning 54, (pp. 45-66), 2004. Tax, D.M.J & Duin, R.P.W. Support vector domain description. Pattern Recogn. Lett. 20 11-13, (pp. 1191-1199), 1999. Tax D.M.J. & Juszczak, P. Kernel whitening for one-class classification, Lecture Notes in Computer Science, vol. 2388, Springer, Berlin, 2002, (pp. 4052). Theodoridis, 2006. S., Koutroumbas, K. Pattern Recognition, 3rd edn. Academic Press, San Diego,
Thiang, T. & Gu, H. (2010). Anomaly detection combining one-class SVMs and particle swarm optimization algorithms. In Nonlynear Dyn, Springer 61 (pp. 303310). Tolvi, J. Genetic algorithms for outlier detection and variable selection in linear regression models, Soft Computing 8, SpringerVerlagV (pp. 527-533) Vapnik, V. The Nature of Statistical Learning Theory (M).New York: Springer-Verlag. 1995 Weekes, C.D, Vose, J.M,. Lynch, J.C., Weisenburger, D.D.. Bierman, M.M., Greiner, T., Bociek, G., Enke, C,. Bast, M., Chan, W.C., Armitage, J.O, Hodgkinn disease in the elderly: improved treatment outcome with a doxorubicin containing regimen, Journal of Clinical Oncology 20 (4) (2002) 10871093. Wua, S.Y., Chen, Y.L., Discovering hybrid temporal patterns from sequences consisting of point and interval based events, Data and Knowledge Engineering 68 (11) (2009) 13091330. Xue, Z & Liu, S. Rough based Semi-Supervised Outlier Detection. Sixth International Conference on Fuzzy system and knowledge Discovery, (pp. 520-524). Xue, Z, Shang, Y., Feg, S., Semi-supervised outlier detection based on fuzzy rough C-means clustering, Mathematics and Computers in Simulation, 80, 2010, (pp.2011-2021). Yan, C., Chen, G., Shen, Y. Outlier analysis for gene expression data, J. Cmput. Sci. Technol., 19 (1), 2004, (pp.13-21). Yao, Y. Y., Zhao, Y., & Maguire, R. B. (2003). Explanation oriented association mining using rough set theory. In Proceedings of the ninth international conference on rough sets, fuzzy sets, data mining, and granular computing (pp. 165172). China. Zhang, D.Gatica-Perezs, D., Bengio, S..and McCowan,I. Semi-supervised Adapted HMMs for Unusual Event Detection, IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR05), IEEE Press, June 2005, vol.1, pp.611618. Zhang, Y., Meratnia, N. and Havinga, P.J.M. (2008) Outlier Detection Techniques For Wireless Sensor Networks: A Survey. Technical Report TR-CTIT-08-59, Centre for Telematics and Information Technology University of Twente, Enschede. ISSN 1381-3625 Zhang, Y., Liu, X.D., Xie, F.D., Li, K.Q. Fault classifier of rotating machinery based on weighted support vector data description. Expert Syst. Appl. 36(4), 2009, (pp.79287932)

Data Processing For Outliers Detection

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Processing For Outliers Detection

Uploaded by

Copyright:

Available Formats

Data Processing for Outliers Detection

Silvia Cateni and Valentina Colla

Outlier Detection: Definitions and Applications

Outlier Detection based on Particular Data Representations.

Figure 1. An example of rough set

Recent Artificial Intelligence-based Approaches to Outliers Detection

1 , 2 are adjusting parameters.

C k be, respectively, the lower and upper

B Ck = Ck Ck is the boundary area, c={c1, c2, ..., ck} is a vector of k centers of

Figure 2: Block diagram of the outlier detection based fuzzy logic

Outlier Detection in Image Processing

Figure 3: Example of snake contour

Case Study Using Synthetic Data

You might also like