You are on page 1of 6

OUTLIER ANALYSIS

Outlier Analysis
Outlier data objects that are grossly different from
or inconsistent with the remaining set of data
Causes
Measurement / Execution errors
Inherent data variability
Outliers maybe valuable patterns
Fraud detection
Customized marketing
Medical Analysis
Outlier Mining
Given n data points and k expected number of
outliers find the top k dissimilar objects
Define inconsistent data

Residuals in Regression

Difficulties Multi-dimensional data, nonnumeric data


Mine the outliers

Visualization based methods

Not applicable to cyclic plots, high


dimensional data and categorical data
Approaches
Statistical Approach
Distance-based approach
Density based outlier approach
Deviation based approach

Statistical Distribution-based Outlier detection


Assumes data follows a probability distribution and
uses discordancy test
Discordancy testing
Working hypothesis H: oi F i=1,2,..n
Test verifies whether an object oi is significantly
different from F
Significance probability SP(vi) = Prob(T>vi)
IF SP is small oi is discordant and working
hypothesis is rejected and alternate hypothesis
that oi comes from another distribution model G
is adopted
Alternative distributions
Inherent alternative distribution

Alternative hypothesis: All objects arise from


another distribution G
Mixture alternative distribution

Discordant values are not outliers but


contaminants from G H: oi (1G
i=1,2,..n
Slippage alternative distribution

Some Objects are independent observations


from a modified version of F (different
parameters)
Procedures for detecting Outliers
Block procedures

All are outliers or all are consistent


Consecutive Procedures

Inside-out procedure: Least likely object is


tested first

If it is an outlier more extreme values are


also considered as outliers
Disadvantages of Statistical Approach

Tests are for single attributes


Data distribution may not be known

Distance based Outlier Detection


Distance-based outlier
A DB(p, D)-outlier is an object O in a dataset T
such that at least a fraction p of the objects in T
lies at a distance greater than D from O
Object does not have enough neighbours
Avoids excessive computation of Statistical
models
If an object is an outlier according to a
discordancy test then o is DB(p, D) outlier for
some p and D
Index based Algorithm
Uses multi-dimensional indexing structures such
as k-d trees and R-trees
M maximum number of objects within dmin
neighborhood
Once M+1 neighbours are found o is not an
outlier
2
O(n k) apart from index construction
Nested loop algorithm
Avoids index construction
Tries to minimize I/Os
Divides memory buffer space into two halves and
data set into several logical blocks
Cell based Algorithm
k
Complexity : O(c +n) c- depends on number of
cells ; k dimensionality
Data space is partitioned into cells: dmin / 2k
Two layers surround each cell

First layer One cell thick

Second layer kAlgorithm processes cells instead of objects


Maintains three counts: cell_count,
cell_+_1_layer_count, cell_+_2_layers_count
An object in a cell is an outlier if
cell_+_1_layer_count <= M, if not, no objects in
the cell are outliers
If cell_+_2_layers_count, <= M then all objects in
cell Outliers

If > M some may be outliers

Object by object processing has to be done

Density based Outlier detection


Previous methods assume data are uniformly
distributed
Data may have different density distributions
Difficulty in choosing dmin
Local Outlier if its outlying relative to its local
neighbourhood particularly wrt the density of the
neighborhood
O2 is a local outlier wrt C2; o1 is also an outlier;
none of the objects in C1 are treated as outliers
Considers degree to which an object is an outlier
Local Outlier factor degree depends on how
isolated the object is wrt its surroundings

The k-distance of an object p is the maximal distance


that p gets from its k-nearest neighbors d(p, o)
there are at least k objects in D that are as close as
or closer to p than o; for k o d(p, o) <= d(p, o)
there are at most k-1 objects that are closer to p
than o; for k-1 o d(p, o) < d(p, o)
k-distance neighborhood

contains every object whose distance is not


greater than the MinPts (k)-distance of p
The reachability distance of an object p with respect
to object o, is defined as reach_distMinPts(p, o) = max
{ MinPts-distance(o), d(p, o) }

Complexity : O(n log n)


Local reachability density of p is the inverse of the
average reachability density based on the MinPtsnearest neighbors of p.
Local outlier factor (LOF) of p captures the degree to
which we call p an outlier.
It is the average of the ratio of the local
reachability density of p and those of ps MinPtsnearest neighbors.
LOF is higher for outliers
Deviation based Outlier detection
Identifies outliers by examining the main
characteristics of objects in a group
Objects that deviate from this description are
considered outliers
Sequential exception technique
Simulates the way in which humans can
distinguish unusual objects from among a series
of supposedly like objects
Sequential exception technique
Given a data set D a sequence of subsets {D1, D2,
..Dm} is built such that Dj-1 Dj; Dissimilarities
are assessed between subsets in the sequence
Exception Set Smallest subset of objects whose
removal results in greatest reduction of
dissimilarity

Dissimilarity function 1/n i=1 n (xi-x)2


Smoothing factor: Assesses how much the
dissimilarity can be reduced by removing the
subset from the original set of objects
Can be repeated to avoid the influence of order
OLAP Data Cube technique
Uses data cubes to identify regions of anomalies
A cell value in a cube is an exception if it differs
significantly from an expected value
Visualization effects guide user
May drill down

You might also like