Data Mining-Outlier Analysis

OUTLIER ANALYSIS
Outlier Analysis
Outlier data objects that are grossly different from
or inconsistent with the remaining set of data
Causes
Measurement / Execution errors
Inherent data variability
Outliers maybe valuable patterns
Fraud detection
Customized marketing
Medical Analysis
Outlier Mining
Given n data points and k expected number of
outliers find the top k dissimilar objects
Define inconsistent data
Residuals in Regression
Difficulties Multi-dimensional data, nonnumeric data

Mine the outliers
Visualization based methods
Not applicable to cyclic plots, high

dimensional data and categorical data
Approaches
Statistical Approach
Distance-based approach
Density based outlier approach
Deviation based approach
Statistical Distribution-based Outlier detection

Assumes data follows a probability distribution and
uses discordancy test
Discordancy testing
Working hypothesis H: oi F i=1,2,..n
Test verifies whether an object oi is significantly
different from F
Significance probability SP(vi) = Prob(T>vi)
IF SP is small oi is discordant and working
hypothesis is rejected and alternate hypothesis
that oi comes from another distribution model G
is adopted
Alternative distributions
Inherent alternative distribution
Alternative hypothesis: All objects arise from

another distribution G
Mixture alternative distribution
Discordant values are not outliers but

contaminants from G H: oi (1G
i=1,2,..n
Slippage alternative distribution
Some Objects are independent observations

from a modified version of F (different
parameters)
Procedures for detecting Outliers
Block procedures
All are outliers or all are consistent

Consecutive Procedures
Inside-out procedure: Least likely object is

tested first
If it is an outlier more extreme values are

also considered as outliers
Disadvantages of Statistical Approach
Tests are for single attributes

Data distribution may not be known
Distance based Outlier Detection

Distance-based outlier
A DB(p, D)-outlier is an object O in a dataset T
such that at least a fraction p of the objects in T
lies at a distance greater than D from O
Object does not have enough neighbours
Avoids excessive computation of Statistical
models
If an object is an outlier according to a
discordancy test then o is DB(p, D) outlier for
some p and D
Index based Algorithm
Uses multi-dimensional indexing structures such
as k-d trees and R-trees
M maximum number of objects within dmin
neighborhood
Once M+1 neighbours are found o is not an
outlier
2
O(n k) apart from index construction
Nested loop algorithm
Avoids index construction
Tries to minimize I/Os
Divides memory buffer space into two halves and
data set into several logical blocks
Cell based Algorithm
k
Complexity : O(c +n) c- depends on number of
cells ; k dimensionality
Data space is partitioned into cells: dmin / 2k
Two layers surround each cell
First layer One cell thick
Second layer kAlgorithm processes cells instead of objects

Maintains three counts: cell_count,
cell_+_1_layer_count, cell_+_2_layers_count
An object in a cell is an outlier if
cell_+_1_layer_count <= M, if not, no objects in
the cell are outliers
If cell_+_2_layers_count, <= M then all objects in
cell Outliers
If > M some may be outliers
Object by object processing has to be done
Density based Outlier detection

Previous methods assume data are uniformly
distributed
Data may have different density distributions
Difficulty in choosing dmin
Local Outlier if its outlying relative to its local
neighbourhood particularly wrt the density of the
neighborhood
O2 is a local outlier wrt C2; o1 is also an outlier;
none of the objects in C1 are treated as outliers
Considers degree to which an object is an outlier
Local Outlier factor degree depends on how
isolated the object is wrt its surroundings
The k-distance of an object p is the maximal distance

that p gets from its k-nearest neighbors d(p, o)
there are at least k objects in D that are as close as
or closer to p than o; for k o d(p, o) <= d(p, o)
there are at most k-1 objects that are closer to p
than o; for k-1 o d(p, o) < d(p, o)
k-distance neighborhood
contains every object whose distance is not

greater than the MinPts (k)-distance of p
The reachability distance of an object p with respect
to object o, is defined as reach_distMinPts(p, o) = max
{ MinPts-distance(o), d(p, o) }
Complexity : O(n log n)

Local reachability density of p is the inverse of the
average reachability density based on the MinPtsnearest neighbors of p.
Local outlier factor (LOF) of p captures the degree to
which we call p an outlier.
It is the average of the ratio of the local
reachability density of p and those of ps MinPtsnearest neighbors.
LOF is higher for outliers
Deviation based Outlier detection
Identifies outliers by examining the main
characteristics of objects in a group
Objects that deviate from this description are
considered outliers
Sequential exception technique
Simulates the way in which humans can
distinguish unusual objects from among a series
of supposedly like objects
Sequential exception technique
Given a data set D a sequence of subsets {D1, D2,
..Dm} is built such that Dj-1 Dj; Dissimilarities
are assessed between subsets in the sequence
Exception Set Smallest subset of objects whose
removal results in greatest reduction of
dissimilarity
Dissimilarity function 1/n i=1 n (xi-x)2

Smoothing factor: Assesses how much the
dissimilarity can be reduced by removing the
subset from the original set of objects
Can be repeated to avoid the influence of order
OLAP Data Cube technique
Uses data cubes to identify regions of anomalies
A cell value in a cube is an exception if it differs
significantly from an expected value
Visualization effects guide user
May drill down

Data Mining-Outlier Analysis

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining-Outlier Analysis

Uploaded by

Copyright:

Available Formats

OUTLIER ANALYSIS

Difficulties Multi-dimensional data, nonnumeric data

Visualization based methods

Not applicable to cyclic plots, high

Statistical Distribution-based Outlier detection

Alternative hypothesis: All objects arise from

Discordant values are not outliers but

Some Objects are independent observations

All are outliers or all are consistent

Inside-out procedure: Least likely object is

If it is an outlier more extreme values are

Tests are for single attributes

Distance based Outlier Detection

First layer One cell thick

Second layer kAlgorithm processes cells instead of objects

If > M some may be outliers

Object by object processing has to be done

Density based Outlier detection

The k-distance of an object p is the maximal distance

contains every object whose distance is not

Complexity : O(n log n)

Dissimilarity function 1/n i=1 n (xi-x)2

You might also like