You are on page 1of 7

Outlier mining using N-gram technique with compression models

Swati Vashisht
Assistant Professor
DITSE Gr. Noida
Shubhi Gupta
Assistant Professor
DITSE Gr. Noida
Atul Mani
Assistant Professor
RKGE Gha!iabad
Abstract
"utlier #inin$ is %on%erned with the data ob&e%ts that do not %o#pl' with the $eneral
beha(ior or #odel of the data) su%h data "b&e%ts) whi%h are either different fro# or in%onsistent
with the re#ainin$ set of data. "utliers %an be dete%ted throu$h N*$ra# te%hni+ue but this
te%hni+ue is usin$ a lar$e stora$e spa%e to store #etadata and data di%tionar'. There are a nu#ber
of %o#pression #odels e.$. ontent tree wei$htin$ #ethod) ,-..) ,-./) ,-0 that are used in
%o#pressin$ te1t 2 i#a$e. 3urrows40heeler transfor# 5blo%6 sortin$ prepro%essin$ that #a6es
%o#pression #ore effi%ient7.8en%e in this paper we are $i(in$ a %o#pression te%hni+ue 5,-0
%o#pression #odel7 that %an %o#press data and will #a6e this al$orith# #ore effi%ient in ter#s
of stora$e spa%e.
Keywords: "utliers) o#pression) N*$ra# te%hni+ue.
Introduction
"utliers %an be %aused b' #easure#ent or e1e%ution error) for e1a#ple the displa' of an
e#plo'ee salar' in ne$ati(e %ould be %aused b' a pro$ra# default settin$ of an unre%orded
salar'. Alternati(el') outliers #a' be result of inherent data (ariabilit'. The salar' of the
e1e%uti(e offi%ers of a %o#pan' %ould naturall' stand out as an outlier a#on$ the salar' of the
other e#plo'ees in the fir#.
"utlier dete%tion and anal'sis is an interestin$ data #inin$ tas6) referred to as "utlier Minin$
that has a lot of real life appli%ations in #an' different do#ains. Man' data #inin$ al$orith#s
tr' to #ini#i!e the influen%e of outliers or eli#inate the# all to$ether. In %ontrast to traditional
data #inin$ tas6 that ai#s to find $eneral pattern appli%able to a lar$e a#ount of data) outlier
dete%tion ai#s findin$ of the rare data whose beha(ior is (er' e1%eptional when %o#pared with
rest lar$e a#ount of data.
"utlier #inin$ %an be des%ribed as9 Gi(en a set of n data points or ob&e%ts) and 6) the e1pe%ted
nu#ber of outliers) find the top 6 ob&e%ts that are %onsiderabl' dissi#ilar) e1%eptional) or
in%onsistent with respe%t to the re#ainin$ data.
N-Grams Technique
N*$ra# te%hni+ues ha(e been studied and used in #an' infor#ation retrie(al tas6s. The'
ha(e been applied in different do#ains su%h as lan$ua$e identifi%ation :;<) do%u#ent
%ate$ori!ation and %o#parison) robust handlin$ of nois' te1t and #an' other do#ains of natural
lan$ua$e pro%essin$ appli%ations :=<. The su%%ess of n*$ra#*based s'ste#s is be%ause the strin$s
are dis%o#posed into s#aller parts %ausin$ errors 5#isspelled error) t'po$raphi%al errors and
errors arisin$ fro# ob&e%t %hara%ter re%o$nition 5"R77 to affe%t onl' a li#ited nu#ber of su%h
parts rather than the whole word. The nu#ber of n*$ra#s 5hi$her order n*$ra#s7 %o##on to two
strin$s is a #easure of the si#ilarit' between the words. This #easure is resistant to a lar$e
(ariet' of te1tual errors.
N*$ra# te%hni+ues ha(e been used for a nu#ber of te1t pro%essin$ tas6s. N*$ra#s of a strin$
of len$th 6 is an n %onti$uous sli%e of the strin$ into substrin$s ea%h of si!e n. N $ra# s'ste#s
suffer fro# lar$e #e#or' re+uire#ents be%ause of the hu$e nu#ber of n*$ra# (e%tors resultin$
fro# the sli%in$. >or e1a#ple) a strin$ of len$th 6 has 6*n?@ possible n*$ra#s i$norin$ all n*
$ra#s with trailin$ or pre%edin$ blan6s. This proble# notwithstandin$) n*$ra# s'ste#s ha(e
so#e ad(anta$es o(er full word #at%hin$ in web pa$e %ate$ori!ation for the followin$ reasons9
5@7 A lar$e nu#ber of do%u#ent are posted on the web without $oin$ throu$h an' for# of
thorou$h error %he%6in$ be%ause the' #i$ht be too %ostl' or the do%u#ents #a' be ti#e
dependent. Su%h do%u#ents #a' %ontain si$nifi%ant a#ounts of errors #a6in$ full word
#at%hin$ less effi%ient. In su%h %ases) n*$ra# te%hni+ues offer #ore effi%ient and effe%ti(e #eans
of %o#parin$ strin$s be%ause n*$ra#s s'ste#s support partial #at%hin$ of strin$s. 5A78a(in$
sa#e len$th n*$ra#s) $i(es a $reat ad(anta$e to this te%hni+ue as %o#pared to words that #a'
ha(e different len$ths.
n*$ra# #odels are widel' used in statisti%al natural lan$ua$e pro%essin$.In spee%h
re%o$nition) pheno#es and se+uen%es of phone#es are #odeled usin$ a n*$ra# distribution. >or
parsin$) words are #odeled su%h that ea%h n*$ra# is %o#posed of n words. >or lan$ua$e
identifi%ation) se+uen%es of %hara%ters 5e.g.) letters of alphabets7 are #odeled for different
lan$ua$es.:.< .8owe(er) the ter# n*$ra# %an be used for an' set of %onse%uti(e %hara%ters
o%%urrin$ in a strin$ 5e.$.) n*$ra# %onsistin$ of se%ond) third and fifth %hara%ters in a strin$7.The
n*$ra#s fro# a strin$ of len$th 6 is obtained b' slidin$ a window of si!e n o(er the strin$)
startin$ at the first position and #o(in$ the window one position at a ti#e until it rea%hes the end
of the strin$. The set of %hara%ters that appear in the window at an' position for#s the n*$ra#s
of that strin$. >or e1a#ple) the strin$ Bintelli$en%eChas
BintelC)CntellC)CtelliC)Celli$C)Clli$eC)Cli$enC)Ci$en%C)C$en%eC;*$ra#s. The nu#ber of possible n*$ra#s
resultin$ fro# a strin$ of len$th 6 is 56*n?@7) where n is the si!e of the n*$ra#.
A stud' of different n*$ra# si!es re(eals n*$ra#s that are too short tend to %apture
si#ilarities between words that are due to fa%tors other than se#anti% relatedness. It #a' happen
that ha(in$ a si$nifi%ant nu#ber of %o##on n*$ra#s) a word see#s related but the' are not.
Si#ilarl') n*$ra#s that are too lon$ fail to %apture si#ilarities between si#ilar but different
words. N*$ra#s that are too lon$ beha(e li6e full word #at%h. Notwithstandin$) n*$ra#s of
reasonable len$ths are able to deter#ine si#ilarit' between different but related words better
than n*$ra#s of shorter len$ths.
Data %o#pression is the pro%ess of en%odin$ data so that it ta6es less stora$e spa%e or less
trans#ission ti#e than it would if it were not %o#pressed. A %ode is a #appin$ fro# input
s'#bols to se+uen%es of bits. The pro%ess of %on(ertin$ an input se+uen%e into a binar' se+uen%e
b' repla%in$ ea%h s'#bol in #essa$e with its %orrespondin$ %ode word is %alled en%odin$. 0hen
the purpose of en%odin$ is to produ%e a %o#pa%t representation of the ori$inal #essa$e) the
pro%ess is %alled %o#pression.:@<
,ossless data %o#pression al$orith#s e1ploit re$ularities in the data to produ%e %o#pressed
representations fro# whi%h the data %an be re%onstru%ted e1a%tl'. ,ossless %o#pression #ethods
ha(e been used traditionall' on te1t %o#pression and on %o#pression on te1t en%oded on binar'
data files su%h as se#i stru%tured do%u#ents) database files) %o#puter pro$ra#s et%.
Statisti%al #odelin$ al$orith#s for te1t in%lude9
@.onte1t tree wei$htin$ #ethod 5T07
A.3urrows40heeler transfor# 5blo%6 sortin$ prepro%essin$ that #a6es %o#pression #ore
effi%ient7
=.,-.. 5used b' DE>,ATE7 2 ,-./
D.,-0
,e#pel -i( Al$orith#9 These %o#pression te%hni+ues use a s'#bol di%tionar' to represent
re%urrin$ patterns. The di%tionar' is d'na#i%all' updated durin$ a %o#pression as new patterns
o%%ur. >or data trans#issions) the di%tionar' is passed to a re%ei(in$ s'ste# so it 6nows how to
de%ode the %hara%ters. >or file stora$e) the di%tionar' is stored with the %o#pressed file.:A<
Problem ormulation
A lar$e nu#ber of te%hni+ues e1ists for dete%tin$ outliers fro# the web but al#ost none of
the e1istin$ al$orith#s %o#press the $i(en do%u#ents before dete%tion.
8en%e) in this paper we proposed a %o#pression al$orith# before N*Gra# te%hni+ue whi%h
will %o#press the data di%tionar' and prepro%essed data and then N*$ra# te%hni+ue anal'!e the
%ontents of the Meta data of the web pa$es of related %ate$or' and identif' the web pa$es whi%h
are ha(in$ si$nifi%antl' different %ontent as %o#pared to other pa$es.
Proposed !or"
"utlier dete%tion %an be done usin$ N*$ra# te%hni+ue but it ta6es a lar$e stora$e spa%e to
reside #etadata here we are appl'in$ ,-0 al$orith# for data %o#pression before usin$ this data
for outlier dete%tion.
#ompression using $%! technique
,-0 initiates with a di%tionar' as the EstandardE %hara%ter set. It then reads data bits at a ti#e
and en%odes the data as inde1 (alue $i(en in the di%tionar'. 0hen it %o#es a%ross a new
substrin$ it adds it to e1istin$ di%tionar'F and in %ase of a substrin$ it has alread' seen) it &ust
reads in a new %hara%ter and %on%atenates it with the %urrent strin$ to $et a new substrin$. The
ne1t ti#e ,-0 re(isits a substrin$) it will be en%oded usin$ a sin$le nu#ber) while repeatin$
these steps %o#pression is a%hie(ed.:A<
Pseudo%ode9
strin$ sF
%har%hF
s G e#pt' strin$F
while 5there is still data to be read7
H
%h G read a %hara%terF
if 5di%tionar' %ontains s?%h7
H
s G s?%hF
I
else
H
en%ode s to output fileF
adds?%h to di%tionar'F
s G %hF
I
I
en%ode s to output fileF
Now) suppose we wish to %o#press JababaabaaaK and that we are onl' usin$ the initial
di%tionar'9
Inde1 Entr'
L a
@ b
The en%odin$ steps9
INPMT MRRENT STRING SEEN T8IS 3E>"RE EN"DED "MTPMT NE0 DITI"NARN
ENTRN
a a 'es L none
ab ab 'es L@ abOA
aba aba no L@ baO=
abab ab 'es L@A None
ababa aba 'es L@A None
ababaa aa no L@AD aaOD
ababaab ab 'es L@AD None
ababaaba ba 'es L@AD= none
ababaabaa baa 'es L@AD= none
ababaabaaa aa 'es L@AD=D none
#onclusion
The #ain ad(anta$e of %o#pression al$orith# is9 It %an i#pro(e dete%tion fun%tion be%ause
%o#pressed data will ta6e less ti#e for #at%hin$ a pattern. N*$ra# te%hni+ue would be #ore
effi%ient in deter#inin$ si#ilarit' between different but related words in te1t. The stora$e spa%e
proble# %an be sol(ed but it re+uires additional %o#putational %ost. N*$ra#s support partial
#at%hin$ of strin$s with errors and %o#pressed data %an be%o#e a little %o#ple1 as %o#pared to
ori$inal data.
o#pression al$orith#s %an %ause dela' but in real ti#e en(iron#ent #an' appli%ations #a'
not tolerate an' dela's) in whi%h %ase we #a' need to tune the %o#pression le(els andOor
i#ple#ent other te%hni+ues to re#o(e those dela's. Moreo(er) %o#pression affe%ts the
portabilit' of files in whi%h %ase re%ipient s'ste# needs sa#e software al$orith# to de%o#press
sa#e data.
uture scope
Area of future resear%h in%ludes e1peri#ental e(aluation of N*$ra# te%hni+ue in ter#s of
response ti#e to i#pro(e its effi%ien%') different wei$htin$ #ethods for data di%tionar' words
that will help to de%rease response ti#e. In future) a new approa%h %ould be $i(en to #at%h a full
word that %an #ini#i!e errors due to fi1ed len$th #at%hin$ in N*$ra# te%hni+ue. >uture s%ope
also in%ludes anal'sis of perfor#an%e of different data outlier #inin$ al$orith#s.
&e'erences
@.3rat6o Andre&)KTe1t #inin$ usin$ data %o#pression #odelsK
A.Sa'ood Khalid)KIntrodu%tion to data %o#pressionK
=.3a(nar.) Tren6le M.P. JN*Gra#*3ased Te1t ate$ori!ationK Pro%eedin$s of =rd Annual
S'#posiu# on Do%u#ent and Infor#ation Retrie(al) @QQD
D.haru. A$$arwal Philip S. NuOutlier Detection for High Dimensional Data I3M T. P.
0atson Resear%h enter)Nor6town 8ei$hts) AM SIGM"D ALL@ Ma' A@*AD
;.Da#ashe6) M. Gau$in$ Si#ilarit' with N*Gra#s9 ,an$ua$e Independent ate$ori!ation of
Te1t) S%ien%e) AR.5@QQ;7 pp /D=*/D/
R.>An$iulli) .Pi!!uti. >ast "utlier Dete%tion in 8i$h Di#ensional Spa%e)In
Pro%of PKDDCLA)pp A;*=R)ALLA
..Ted Dunnin$ 5@QQD7. Statisti%al Identifi%ation of ,an$ua$e.New Me1i%o State
Mni(ersit'. Te%hni%al Report MS QD*A.=
/.M A$'e#an$ )3ar6er K.) Alha&& S. RMinin$ 0eb ontent "utliers usin$ Stru%ture "riented
0ei$htin$ Te%hni+ues and N*Gra#s Depart#ent of o#puter S%ien%eMni(ersit' of al$ar'A;LL
Mni(ersit' Dri(e N.0.al$ar') A3) anada TAN @ND
Q.V. 3arnett and T. ,ewis. "utliers in Statisti%al Pohn 0ile' 2 Sons) @QQD.
@L.-en$'ou 8e ) SiaofeiSu) Shen$%hun Den$ JA >ast Greed' Al$orith# for "utlier Minin$K

You might also like