Professional Documents
Culture Documents
WVU,DataMining:Fall'06
CS591O::LCSEE::WVU
Lectures
Lecture3a
(ThesenotesextendedfromWitten&Frankchapter4.)
1R:simplicityfirst
Simplealgorithmsoftenworksurprisinglywell
Inanycase,trythesimplestbeforetryingthemostcomplicated.Whytrysimple?Well...compareasupposedlymoresophisticatedapproach......againsta
seeminglymorestupidone(the"strawman").isgoodscience.
(Warning:sometimes,strawmendon'tburn.)
1R([Holte93])assumesthatallattributesareindependentandoneattributeismorepowerfulthantherest.
Itbuildsa1leveldecisiontree.Inotherwords,generatesasetofrulesthatalltestononeparticularattribute
Basicversion(assumingnominalattributes)
Onebranchforeachoftheattribute'svalues
Eachbranchassignsmostfrequentclass
Errorrate:proportionofinstancesthatdon'tbelongtothemajorityclassoftheircorrespondingbranch
Chooseattributewithlowesterrorrate
Pseudocodefor1R:
Foreachattribute,
Foreachvalueoftheattribute,makearuleasfollows:
counthowofteneachclassappears
findthemostfrequentclass
maketheruleassignthatclasstothisattributevalue
Calculatetheerrorrateoftherules
Choosetheruleswiththesmallesterrorrate
Note:"missing"isalwaystreatedasaseparateattributevalue
Example:
OUTLOOKTEMPHUMIDITYWINDYPLAY
SunnyHotHighFalseNo
SunnyHotHighTrueNo
OvercastHotHighFalseYes
RainyMildHighFalseYes
RainyCoolNormalFalseYes
RainyCoolNormalTrueNo
OvercastCoolNormalTrueYes
SunnyMildHighFalseNo
SunnyCoolNormalFalseYes
RainyMildNormalFalseYes
SunnyMildNormalTrueYes
OvercastMildHighTrueYes
OvercastHotNormalFalseYes
RainyMildHighTrueNo%%
OneR:
ATTRIBUTERULESERRORSTOTALERRORS
OutlookSunny=>No2/54/14
Overcast=>Yes0/4
Rainy=>Yes2/5
TemperatureHot=>No*2/45/14
Mild=>Yes2/6
Cool=>Yes1/4
HumidityHigh=>No3/74/14
Normal=>Yes1/7
WindyFalse=>Yes2/85/14
True=>No*3/6%%
Numericattributesarediscretized:therangeoftheattributeisdividedintoasetofintervals
Instancesaresortedaccordingtoattribute'svalues
Breakpointsareplacedwherethe(majority)classchanges(sothatthetotalerrorisminimized)
Example:temperaturefromweatherdata:
6465686970717272757580818385
Yes|No|YesYesYes|NoNoYes|YesYes|No|YesYes|No%%
Theproblemofoverfitting
Discretizationprocedureisverysensitivetonoise
Asingleinstancewithanincorrectclasslabelwillmostlikelyresultinaseparateinterval
Simplesolution:enforceminimumnumberofinstancesinmajorityclassperinterval
Weatherdataexample(withminimumsettoB=3):)
6465686970717272757580818385
YesNoYesYesYes|NoNoYesYesYes|NoYesYesNo
Resultofoverfittingavoidance(B=3)
http://www.csee.wvu.edu/~timm/cs591o/old/BasicMethods.html 1/7
1/5/2017 DataMining:BasicMethods
6465686970717272757580818385
YesNoYesYesYes|NoNoYesYesYes|NoYesYesNo
Repeatingtheabove)
Total
AttributeRulesErrorserrors
OutlookSunny=>No2/54/14
Overcast=>Yes0/4
Rainy=>Yes2/5
Temperature77.5=>Yes3/105/14
>77.5=>No*2/4
Humidity82.5=>Yes1/73/14
>82.5and<=95.5=>No2/6
>95.5=>Yes0/1
WindyFalse=>Yes2/85/14
True=>No*3/6%%
1Rwasdescribedinapaperby[Holte93].
Containsanexperimentalevaluationon16datasets(usingcrossvalidationsothatresultswererepresentativeofperformanceonfuturedata)
Minimumnumberofinstanceswassetto6aftersomeexperimentation
1R'ssimplerulesperformednotmuchworsethanmuchmorecomplexdecisiontrees
Simplicityfirstpaysoff!
BayesClassifiers101
ABayesclassifierisasimplelearningscheme.
Advantages:
Tinymemoryfootprint
Fasttraining,fastlearning
Simplicity
Oftenworkssurprisinglywell
Assumptions
Learningisdonebestviastatisticalmodeling
Attributesare
equallyimportant
statisticallyindependent(giventheclassvalue)
Thismeansthatknowledgeaboutthevalueofaparticularattributedoesn'ttellusanythingaboutthevalueofanotherattribute(iftheclassis
known)
Althoughbasedonassumptionsthatarealmostnevercorrect,thisschemeworkswellinpractice[Domingos97]:
Example
weather.symbolic.arff
outlooktemperaturehumiditywindyplay
rainycoolnormalTRUEno
rainymildhighTRUEno
sunnyhothighFALSEno
sunnyhothighTRUEno
sunnymildhighFALSEno
overcastcoolnormalTRUEyes
overcasthothighFALSEyes
overcasthotnormalFALSEyes
overcastmildhighTRUEyes
http://www.csee.wvu.edu/~timm/cs591o/old/BasicMethods.html 2/7
1/5/2017 DataMining:BasicMethods
overcastmildhighTRUEyes
rainycoolnormalFALSEyes
rainymildhighFALSEyes
rainymildnormalFALSEyes
sunnycoolnormalFALSEyes
sunnymildnormalTRUEyes%%
Thisdatacanbesummerizedasfollows:
OutlookTemperatureHumidity
======================================================
YesNoYesNoYesNo
Sunny23Hot22High34
Overcast40Mild42Normal61
Rainy32Cool31
Sunny2/93/5Hot2/92/5High3/94/5
Overcast4/90/5Mild4/92/5Normal6/91/5
Rainy3/92/5Cool3/91/5
WindyPlay
=========================
YesNoYesNo
False6295
True33
False6/92/59/145/14
True3/93/5
So,whathappensonanewday:
OutlookTemp.HumidityWindyPlay
SunnyCoolHighTrue?%%
Firstfindthelikelihoodofthetwoclasses
For"yes"=2/9*3/9*3/9*3/9*9/14=0.0053
For"no"=3/5*1/5*4/5*3/5*5/14=0.0206
Conversionintoaprobabilitybynormalization:
P("yes")=0.0053/(0.0053+0.0206)=0.205
P("no")=0.0206/(0.0053+0.0206)=0.795
So,wearen'tplayinggolftoday.
Bayes'rule
Moregenerally,theaboveisjustanapplicationofBayes'Theorem.
ProbabilityofeventHgivenevidenceE:
Pr[E|H]*Pr[H]
Pr[H|E]=
Pr[E]
AprioriprobabilityofH=Pr[H]
Probabilityofeventbeforeevidencehasbeenseen
AposterioriprobabilityofH=Pr[H|E]
Probabilityofeventafterevidencehasbeenseen
Classificationlearning:what'stheprobabilityoftheclassgivenaninstance?
EvidenceE=instance
EventH=classvalueforinstance
NaiveBayesassumption:evidencecanbesplitintoindependentparts(i.e.attributesofinstance!
Pr[E1|H]*Pr[E2|H]*....*Pr[En|H]Pr[H]
Pr[H|E]=
Pr[E]
Weusedthisabove.Here'sourevidence:
OutlookTemp.HumidityWindyPlay
SunnyCoolHighTrue?
Here'stheprobabilityfor"yes":
Pr[yes|E]=Pr[Outlook=Sunny|yes]*
Pr[Temperature=Cool|yes]*
Pr[Humidity=High|yes]*Pr[yes]
Pr[Windy=True|yes]*Pr[yes]/Pr[E]
=(2/9*3/9*3/9*3/9&9/14)*Pr[E]%%
Numericalerrors:
Frommultiplicationoflotsofsmallnumbers
Usethestandardfix:don'tmultiplythenumbers,addthelogs
Missingvalues
Missingvaluesareaproblemforanylearner.NaiveBayes'treatmentofmissingvaluesisparticularlyelegant.
Duringtraining:instanceisnotincludedinfrequencycountforattributevalueclasscombination
http://www.csee.wvu.edu/~timm/cs591o/old/BasicMethods.html 3/7
1/5/2017 DataMining:BasicMethods
Duringclassification:attributewillbeomittedfromcalculation
Example:OutlookTemp.HumidityWindyPlay
?CoolHighTrue?%%
Likelihoodof"yes"=3/9*3/9*3/9*9/14=0.0238
Likelihoodof"no"=1/5*4/5*3/5*5/14=0.0343
P("yes")=0.0238/(0.0238+0.0343)=41%
P("no")=0.0343/(0.0238+0.0343)=59%
The"lowfrequenciesproblem"
Whatifanattributevaluedoesn'toccurwitheveryclassvalue(e.g."Humidity=high"forclass"yes")?
Probabilitywillbezero!
Pr[Humidity=High|yes]=0
Aposterioriprobabilitywillalsobezero!Pr[yes|E]=0(Nomatterhowlikelytheothervaluesare!)
Souseanestimatorsforlowfrequencyattributeranges
Addalittle"m"tothecountforeveryattributevalueclasscombination
TheLaplaceestimator
Result:probabilitieswillneverbezero!
Anduseanstimatorforlowfrequencyclasses
Addalittle"k"toclasscounts
TheMestimate
Magicnumbers:m=2,k=1
Psuedocode
Here'sthepseudocodeofthetheNaiveBayesclassifierpreferredby[Yang03](p4).Itusestheseglobals:
#"F":frequencytables
#"I":numberofinstances
#"C":howmanyclasses?
#"N":instancesperclass
Whenlearningfromtrainingexamples,updateafrequencytable:
functionupdate(class,train){
#OUTPUT:changestotheglobals.
#INPUT:a"train"ingexample
#containingattribute/
#valuepairsinsome
#"class"
I++#updatenumberofinstances
if(++N[class]==1)#updatecountsforeachclass
thenC++#maybe,increasenumberofclasses
fi
for<attr,value>intrain
doif(value!="?")#:skipmissingvalues
thenF[class,attr,range]++#:incrementfrequencycounts
fi
done
}
Whentesting,findthelikelihoodofeachhypotheticalclassandreturntheonethatismostlikely.
functionclassify(test){
#OUTPUT:"what"isthemost
#likelyhypothesis
#forthetestcase.
#INPUT:a"test"casecontaining
#attribute/valuepairs.
m=2#ControlforLaplaceestimates.
k=1;#ControlforMestimates.
like=100000#Initial,impossiblelikelihood.
for(HinN)#Checkallhypotheses.
doprior=(N[H]+k)/(I+(k*C))#here'sP[H]
temp=log(prior)#uselogsforsmallvalues
for<attr,value>inattributes#forallitemsinthetest
doif(value!="?")#skipmissingvalues
theninc=F[H,attr,value]+(m*prior))/(N[H]+m)#P[Ei|H]
temp+=log(inc)#addinglogs=multiplication
fi
done
if(temp>=like)#ifwe'vegotabetterlikelihood
thenlike=temp
class=H#savethishypothesisandmaxlikelihood
fi
done
returnclass#returntheclasswithmostlikelihood
}
HandlingNumerics
Theabovecodeassumesthattheattributesarediscrete.Theusualapproximationistoassumea"gaussian"(i.e.a"normal"or"bellshaped"curve)forthe
numerics.
TheprobabilitydensityfunctionforthenormaldistributionisdefinedbythemeanandstandardDev(standarddeviation)
Given:
n:thenumberofvalues
http://www.csee.wvu.edu/~timm/cs591o/old/BasicMethods.html 4/7
1/5/2017 DataMining:BasicMethods
sum:thesumofthevaluesi.e.sum=sum+value
sumSq:thesumofthesquareofthevaluesi.e.sumSq=sumSq+value*value
Then:
functionmean(sum,n){
returnsum/n
}
functionstandardDeviation(sumSq,sum,n){
returnsqrt((sumSq((sum*sum)/n))/(n1))
}
functiongaussianPdf(mean,standardDev,x){
pi=1068966896/340262731;#:goodto17decimalplaces
return1/(standardDev*sqrt(2*pi))^
(1*(xmean)^2/(2*standardDev*standardDev))
}
Forexample:
outlooktemperaturehumiditywindyplay
sunny8585FALSEno
sunny8090TRUEno
overcast8386FALSEyes
rainy7096FALSEyes
rainy6880FALSEyes
rainy6570TRUEno
overcast6465TRUEyes
sunny7295FALSEno
sunny6970FALSEyes
rainy7580FALSEyes
sunny7570TRUEyes
overcast7290TRUEyes
overcast8175FALSEyes
rainy7191TRUEno%%
Thisgeneratesthefollowingstatistics:
OutlookTemperatureHumidity
=======================================================
YesNoYesNoYesNo
Sunny2383858685
Overcast4070809690
Rainy3268658070
Sunny2/93/5mean7374.6mean79.186.2
Overcast4/90/5stddev6.27.9stddev10.29.7
Rainy3/92/5
WindyPlay
==============================
YesNoYesNo
False6295
True33
False6/92/59/145/14
True3/93/5
Exampledensityvalue:
f(temperature=66|yes)=gaussianPdf(73,6/2,66)=0.0340
Classifyinganewday:
OutlookTemp.HumidityWindyPlay
Sunny6690true?%%
Likelihoodof"yes"=2/9*0.0340*0.0221*3/9*9/14=0.000036
Likelihoodof"no"=3/5*0.0291*0.0380*3/5*5/14=0.000136
P("yes")=0.000036/(0.000036+0.000136)=20.9%
P("no")=0.000136/(0.000036+0.000136)=79.1%
Note:missingvaluesduringtraining:notincludedincalculationofmeanandstandarddeviation
BTW,analternativetotheaboveisapplysomediscretizationpolicytothedatae.g.[Yang03].Suchdiscretizationisgoodpracticesinceitcandramatically
improvetheperformanceofaNaiveBayesclassifier(see[Dougherty95]
Notso"Naive"Bayes
WhydoesNaiveBayesworksowell?[Domingos97]offeroneanalysis:
Theyofferoneexamplewiththreeattributeswheretheperformancewherea"Naive"anda"optimal"Bayespeformnearlythesame.
Theygeneralizedthattoconcludethat"Naive"bayesisonlyreallyNaiveinavanishinglysmallnumberofcases.
Therethreeattributeexampleisgivenbelow.Forthegeneralizedexample,seetheirpaper.
ConsideraBooleanconcept,describedbythreeattributesA,BandC.
Assumethatthetwoclasses,denotedby+andareequiprobable
(P(+)=P()=1/2).
LetAandCbeindependent,andletA=B(i.e.,AandBarecompletelydependent).ThereforeBshouldbeignored,andtheoptimalclassificationprocedure
foratestinstanceistoassignitto(i)class+if
P(A|+)*P(C|+)P(A|)*P(C|)>0,
http://www.csee.wvu.edu/~timm/cs591o/old/BasicMethods.html 5/7
1/5/2017 DataMining:BasicMethods
and(ii)toclass(iftheinequalityhastheoppositesign),and(iii)toanarbitraryclassifthetwosidesareequal.
NotethattheBayesianclassifierwilltakeBintoaccountasifitwasindependentfromA,andthiswillbeequivalenttocountingAtwice.Thus,theBayesian
classifierwillassigntheinstancetoclass+if
P(A|+)^2*P(C|+)P(A|)^2*P(C|)>0,
andtootherwise.
ApplyingBayes'theorem,P(A|+)canbereexpressedas
P(A)*P(+|A)/P(+)
andsimilarlyfortheotherprobabilities.
SinceP(+)=P(),aftercancelingliketermsthisleadstotheequivalentexpressions
P(+|A)*P(+|C)P(|A)*P(|C)>0
fortheoptimaldecision,and
P(+|A)^2*P(+|C)P(|A)^2*P(|C)>0
fortheBayesianclassifier.Let
P(+|A)=p
P(+|C)=q.
Thenclass+shouldbeselectedwhen
pq(1p)*(1q)>0
whichisequivalentto
q>1p[OptimalBayes]
WiththeBayesianclassifier,itwillbeselectedwhen
p^2*q(1p)^2*(1q)>0
whichisequivalentto
q>(1p)^2*p^2+(1p)^2[SimpleBayes]
Thetwocurvesareshowninfollowingfigure.Theremarkablefactisthat,eventhoughtheindependenceassumptionisdecisivelyviolatedbecauseB=A,the
Bayesianclassifierdisagreeswiththeoptimalprocedureonlyinthetwonarrowregionsthatareaboveoneofthecurvesandbelowtheothereverywhereelseit
performsthecorrectclassification.
Thus,forallproblemswhere(p,q)doesnotfallinthosetwosmallregions,theBayesianclassifieriseffectivelyoptimal.
ReviewQuestions
OneR
Discuss:"OneRoffersaninductivegeneralizationbutNaiveBayesdoesnot.
StatethePseudocodeof1R
Whatisoverfitting?Discusshowthediscretizationmethodof1Rtriestoavoidoverfitting.Giveanexample.
Distinguish"superviseddiscretization"from"unsuperviseddiscretization".WhichkindofdiscretizationdoesOneRuse?Explainyouranswer.a
Applythe1Rpseudocodetothegivendatasetanddeterminetheattributeandruleswhicharetobeusedforclassification.
http://www.csee.wvu.edu/~timm/cs591o/old/BasicMethods.html 6/7
1/5/2017 DataMining:BasicMethods
tearrecommended
AgeSpectacleastigmatismratelenses
===================================================================
youngmyopenoreducednone
youngmyopeyesnormalsoft
youngmyopeyesnormalhard
younghypermetropeyesreducednone
Prepresbyopicmyopenoreducednone
Prepresbyopicmyopenonormalsoft
Prepresbyopichypermetropeyesreducednone
presbyopicmyopenoreducednone
presbyopichypermetropeyesnormalnone
presbyopichypermetropenonormalsoft
Showallyourworking.Leavefractionsasfractions.
NaiveBaues
StateBayesruleanddiscusshowitcaninferfuturebeliefsasafunctionofpastbeliefsplusnewevidence.
WhatarethedrawbacksofNaiveBayesmethod?
HowaremissingvalueshandledinNaiveBayes?
Supposewehaveatableofdata:
MakeSizeConvertibleType
Mitsubishismallyescoup
Mitsubishimediumnosuv
Toyotasmallyescoup
Toyotalargenocoup
Toyotalargenosuv
Benzsmallyescoup
Benzlargenosuv
BMWsmallyescoup
BMWmediumyescoup
Fordsmallyescoup
Fordlargenosuv
Hondasmallnocoup
andweseeanewexample:
MakeSizeConvertibleType
Fordmediumno?
Calculatethefollowingforthenewexample,giventhedatabaseofoldexamples:
1.A=LikelihoodofSUV:
2.B=LikelihoodofCoup:
3.C=ProbabilityofSUV:
4.D=Probabilityofcoup:
Showallyourworking.Leavefractionsasfractions.
http://www.csee.wvu.edu/~timm/cs591o/old/BasicMethods.html 7/7