Chapter 7

DataMining
PracticalMachineLearningToolsandTechniques
SlidesforChapter7ofDataMiningbyI.H.Witten,E.Frankand
M.A.Hall
Datatransformations
Attributeselection
Attributediscretization
Principalcomponentanalysis,randomprojections,partialleastsquares,text,
timeseries
Sampling
Reservoirsampling
Dirtydata
Unsupervised,supervised,errorvsentropybased,converseofdiscretization
Projections
Schemeindependent,schemespecific
Datacleansing,robustregression,anomalydetection
Transformingmultipleclassestobinaryones
Simpleapproaches,errorcorrectingcodes,ensemblesof
nesteddichotomies
Calibratingclassprobabilities
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)
Justapplyalearner?NO!
Scheme/parameterselection
treatselectionprocessaspartofthelearning
process
Modifyingtheinput:
Dataengineeringtomakelearningpossibleor
easier
Modifyingtheoutput
Recalibratingprobabilityestimates
Attributeselection
Addingarandom(i.e.irrelevant)attributecan
significantlydegradeC4.5sperformance
IBLverysusceptibletoirrelevantattributes
Problem:attributeselectionbasedonsmallerand
smalleramountsofdata
Numberoftraininginstancesrequiredincreases
exponentiallywithnumberofirrelevantattributes
NaveBayesdoesnthavethisproblem
Relevantattributescanalsobeharmful
Schemeindependentattributeselection
Filterapproach:assessbasedongeneralcharacteristicsofthedata
Onemethod:findsmallestsubsetofattributesthatseparatesdata
Anothermethod:usedifferentlearningscheme
IBLbasedattributeweightingtechniques:
e.g.useattributesselectedbyC4.5and1R,orcoefficientsoflinear
model,possiblyappliedrecursively(recursivefeatureelimination)
cantfindredundantattributes(butfixhasbeensuggested)
CorrelationbasedFeatureSelection(CFS):
correlationbetweenattributesmeasuredbysymmetricuncertainty:
A , B
UA ,B=2 H AHBH
[0,1]
H AHB
goodnessofsubsetofattributesmeasuredby(breakingtiesinfavorof
smallersubsets):
j U A j , C/ i j UA i , A j
Attributesubsetsforweatherdata
Searchingattributespace
Numberofattributesubsetsis
exponentialinnumberofattributes
Commongreedyapproaches:
forwardselection
backwardelimination
Moresophisticatedstrategies:
Bidirectionalsearch
Bestfirstsearch:canfindoptimumsolution
Beamsearch:approximationtobestfirstsearch
Geneticalgorithms
Schemespecificselection
Wrapperapproachtoattributeselection
Implementwrapperaroundlearningscheme
Timeconsuming
greedyapproach,kattributesk2time
priorrankingofattributeslinearink
Canusesignificancetesttostopcrossvalidationfor
subsetearlyifitisunlikelytowin(racesearch)
Evaluationcriterion:crossvalidationperformance
canbeusedwithforward,backwardselection,priorranking,orspecial
purposeschematasearch
Learningdecisiontables:schemespecificattribute
selectionessential
EfficientfordecisiontablesandNaveBayes
Attributediscretization
AvoidsnormalityassumptioninNaveBayesand
clustering
1R:usessimplediscretizationscheme
C4.5performslocaldiscretization
Globaldiscretizationcanbeadvantageousbecause
itsbasedonmoredata
Applylearnerto
kvalueddiscretizedattributeorto
k1binaryattributesthatcodethecutpoints
Discretization:unsupervised
Determineintervalswithoutknowingclasslabels
Twostrategies:
Whenclustering,theonlypossibleway!
Equalintervalbinning
Equalfrequencybinning
(alsocalledhistogramequalization)
Normallyinferiortosupervisedschemesin
classificationtasks
ButequalfrequencybinningworkswellwithnaveBayesif
numberofintervalsissettosquarerootofsizeofdataset
(proportionalkintervaldiscretization)
10
Discretization:supervised
Entropybasedmethod
Buildadecisiontreewithprepruningonthe
attributebeingdiscretized
Useentropyassplittingcriterion
Useminimumdescriptionlengthprincipleasstopping
criterion
Workswell:thestateoftheart
Toapplymindescriptionlengthprinciple:
Thetheoryis
thesplittingpoint(log2[N1]bits)
plusclassdistributionineachsubset
Comparedescriptionlengthsbefore/afteraddingsplit
11
Example:temperatureattribute
Temperature
64
65
68
69
70
71
Play
Yes No Yes Yes Yes No
72
72
75
75
80
81
83
85
No Yes Yes Yes No Yes Yes No
12
FormulaforMDLP
Ninstances
Originalset:
Firstsubset:
Secondsubset:
gain
kclasses,entropyE
k1classes,entropyE1
log2 N1
N
k2classes,entropyE2
log2 3k 2kEk 1 E1 k 2 E2
N
Resultsinnodiscretizationintervalsfor
temperatureattribute
13
Superviseddiscretization:othermethods
Canreplacetopdownprocedurebybottomup
method
CanreplaceMDLPbychisquaredtest
Canusedynamicprogrammingtofindoptimum
kwaysplitforgivenadditivecriterion
Requirestimequadraticinthenumberofinstances
Butcanbedoneinlineartimeiferrorrateisused
insteadofentropy
14
Errorbasedvs.entropybased
Question:
couldthebestdiscretizationeverhavetwo
adjacentintervalswiththesameclass?
Wronganswer:No.Forifso,
Collapsethetwo
Freeupaninterval
Useitsomewhereelse
(Thisiswhaterrorbaseddiscretizationwilldo)
Rightanswer:Surprisingly,yes.
(andentropybaseddiscretizationcandoit)
15
Errorbasedvs.entropybased
A2class,
2attribute
problem
Entropybaseddiscretizationcandetectchangeofclassdistribution
16
Theconverseofdiscretization
Makenominalvaluesintonumericones
1. Indicatorattributes(usedbyIB1)
Makesnouseofpotentialorderinginformation
2. Codeanorderednominalattributeintobinary
ones(usedbyM5)
Canbeusedforanyorderedattribute
Betterthancodingorderingintoaninteger(which
impliesametric)
Ingeneral:codesubsetofattributevaluesas
binary
17
Projections
Simpletransformationscanoftenmakealargedifference
inperformance
Exampletransformations(notnecessarilyfor
performanceimprovement):
Differenceoftwodateattributes
Ratiooftwonumeric(ratioscale)attributes
Concatenatingthevaluesofnominalattributes
Encodingclustermembership
Addingnoisetodata
Removingdatarandomlyorselectively
Obfuscatingthedata
18
Principalcomponentanalysis
Methodforidentifyingtheimportantdirections
inthedata
Canrotatedatainto(reduced)coordinatesystem
thatisgivenbythosedirections
Algorithm:
1. Finddirection(axis)ofgreatestvariance
2. Finddirectionofgreatestvariancethatisperpendicular
topreviousdirectionandrepeat
Implementation:findeigenvectorsofcovariance
matrixbydiagonalization
Eigenvectors(sortedbyeigenvalues)arethedirections
19
Example:10dimensionaldata
Cantransformdataintospacegivenbycomponents
DataisnormallystandardizedforPCA
Couldalsoapplythisrecursivelyintreelearner
20
Randomprojections
PCAisnicebutexpensive:cubicinnumberof
attributes
Alternative:userandomdirections
(projections)insteadofprinciplecomponents
Surprising:randomprojectionspreserve
distancerelationshipsquitewell(onaverage)
CanusethemtoapplykDtreestohigh
dimensionaldata
Canimprovestabilitybyusingensembleof
modelsbasedondifferentprojections
21
Partialleastsquaresregression
PCAisoftenapreprocessingstepbefore
applyingalearningalgorithm
Whenlinearregressionisappliedtheresulting
modelisknownasprincipalcomponents
regression
Outputcanbereexpressedintermsofthe
originalattribues
PartialleastsquaresdiffersfromPCAinthatit
takestheclassattributeintoaccount
Findsdirectionsthathavehighvarianceandare
stronglycorrelatedwiththeclass
22
Algorithm
1.Startwithstandardizedinputattributes
2.AttributecoefficientsofthefirstPLSdirection:
Computethedotproductbetweeneachattributevector
andtheclassvectorinturn
3.CoefficientsfornextPLSdirection:
Originalattributevaluesarefirstreplacedbydifference
(residual)betweentheattribute'svalueandtheprediction
fromasimpleunivariateregressionthatusestheprevious
PLSdirectionasapredictorofthatattribute
Computethedotproductbetweeneachattribute's
residualvectorandtheclassvectorinturn
4.Repeatfrom3
23
Texttoattributevectors
Manydataminingapplicationsinvolvetextualdata(eg.string
attributesinARFF)
Standardtransformation:convertstringintobagofwordsby
tokenization
Attributevaluesarebinary,wordfrequencies(f ),
ij
log(1+fij),orTFIDF:
f ij log #documents
#documentsthat includeword i
Onlyretainalphabeticsequences?
Whatshouldbeusedasdelimiters?
Shouldwordsbeconvertedtolowercase?
Shouldstopwordsbeignored?
Shouldhapaxlegomenabeincluded?Orevenjustthekmost
frequentwords?
24
Timeseries
Intimeseriesdata,eachinstancerepresentsadifferent
timestep
Somesimpletransformations:
Shiftvaluesfromthepast/future
Computedifference(delta)betweeninstances(ie.
derivative)
Insomedatasets,samplesarenotregularbuttimeis
givenbytimestampattribute
Needtonormalizebystepsizewhentransforming
Transformationsneedtobeadaptedifattributes
representdifferenttimesteps
25
Sampling
Samplingistypicallyasimpleprocedure
Whatiftraininginstancesarriveonebyonebutwe
don'tknowthetotalnumberinadvance?
Orperhapstherearesomanythatitisimpracticalto
storethemallbeforesampling?
Isitpossibletoproduceauniformlyrandomsampleof
afixedsize?Yes.
Reservoirsampling
Fillthereservoir,ofsizer,withthefirstrinstances
toarrive
Subsequentinstancesreplacearandomlyselected
reservoirelementwithprobabilityr/i,whereiis
thenumberofinstancesseensofar
26
Automaticdatacleansing
Toimproveadecisiontree:
Better(ofcourse!):
Removemisclassifiedinstances,thenrelearn!
Humanexpertchecksmisclassifiedinstances
Attributenoisevsclassnoise
Attributenoiseshouldbeleftintrainingset
(donttrainoncleansetandtestondirtyone)
Systematicclassnoise(e.g.oneclasssubstitutedfor
another):leaveintrainingset
Unsystematicclassnoise:eliminatefromtraining
set,ifpossible
27
Robustregression
Robuststatisticalmethod onethat
addressesproblemofoutliers
Tomakeregressionmorerobust:
Minimizeabsoluteerror,notsquarederror
Removeoutliers(e.g.10%ofpointsfarthestfrom
theregressionplane)
Minimizemedianinsteadofmeanofsquares
(copeswithoutliersinxandydirection)
Findsnarroweststripcoveringhalftheobservations
28
Example:leastmedianofsquares
Numberofinternationalphonecallsfrom
Belgium,19501973
29
Detectinganomalies
Visualizationcanhelptodetectanomalies
Automaticapproach:
committeeofdifferentlearningschemes
E.g.
decisiontree
nearestneighborlearner
lineardiscriminantfunction
Conservativeapproach:deleteinstances
incorrectlyclassifiedbyallofthem
Problem:mightsacrificeinstancesofsmall
classes
30
OneClassLearning
Usuallytrainingdataisavailableforallclasses
Someproblemsexhibitonlyasingleclassat
trainingtime
Oneclassclassification
Testinstancesmaybelongtothisclassoranew
classnotpresentattrainingtime
Predicteithertargetorunknown
Someproblemscanbereformulatedintotwo
classones
Otherapplicationstrulydon'thavenegativedata
Egpasswordhardening
31
Outlierdetection
Oneclassclassificationisoftencalled
outlier/noveltydetection
Genericapproach:identifyoutliersasinstances
thatliebeyonddistancedfrompercentagepof
thetrainingdata
Alternatively,estimatedensityofthetargetclass
andmarklowprobabilitytestinstancesas
outliers
Thresholdcanbeadjustedtoobtaina
suitablerateofoutliers
32
Generatingartificialdata
Anotherpossibilityistogenerateartificialdatafor
theoutlierclass
Canthenapplyanyofftheshelfclassifier
Cantunerejectionratethresholdifclassifier
producesprobabilityestimates
Generateuniformlyrandomdata
Toomuchwilloverwhelmthetargetclass!
Canbeavoidediflearningaccurateprobabilitiesrather
thanminimizingclassificationerror
Curseofdimensionalityas#attributesincreaseit
becomesinfeasibletogenerateenoughdatatoget
goodcoverageofthespace
33
Generatingartificialdata
Generatedatathatisclosetothetargetclass
Nolongeruniformlydistributedandmusttakethis
distributionintoaccountwhencomputingmembership
scoresfortheoneclassmodel
Ttargetclass,Aartificialclass.WantPr[X|T],forany
instanceX;weknowPr[X|A]
CombinesomeamountofAwithinstancesofTandusea
classprobabilityestimatortoestimatePr[T|X];thenbyBayes'
rule:
Pr [T ]) Pr [T | X ]
Pr [ X | T ]= (1
Pr [ X | A]
Pr [T ](1Pr [T | X ])
Forclassification,chooseathresholdtotunerejectionrate
HowtochoosePr[X|A]?Applyadensityestimatortothe
targetclassanduseresultingfunctiontomodeltheartificial
class
34
Transformingmultipleclassestobinary
ones
Somelearningalgorithmsonlyworkwithtwo
classproblems
Sophisticatedmulticlassvariantsexistinmany
casesbutcanbeveryslowordifficultto
implement
Acommonalternativeistotransformmulticlass
problemsintomultipletwoclassones
Simplemethods
Discriminateeachclassagainstheunionof
theothersonevs.rest
Buildaclassifierforeverypairofclasses
pairwiseclassification
35
Errorcorrectingoutputcodes
Multiclassproblem binaryproblems
Idea:useerrorcorrecting
codesinstead
Simpleonevs.restscheme:
Oneperclasscoding
baseclassifierspredict
1011111,trueclass=??
Usecodewordsthathave
largeHammingdistance
betweenanypair
class
a
class
vector
1000
0100
0010
0001
class
a
class
vector
1111111
0000111
0011001
0101010
Cancorrectupto(d1)/2singlebiterrors
36
MoreonECOCs
Twocriteria:
Rowseparation:
minimumdistancebetweenrows
Columnseparation:
minimumdistancebetweencolumns
3classes only23possiblecolumns
(andcolumnscomplements)
Why?Becauseifcolumnsareidentical,baseclassifierswilllikely
makethesameerrors
Errorcorrectionisweakenediferrorsarecorrelated
(and4outofthe8arecomplements)
Cannotachieverowandcolumnseparation
Onlyworksforproblemswith>3classes
37
ExhaustiveECOCs
Exhaustivecodeforkclasses:
Columnscompriseevery
possiblekstring
exceptforcomplements
andallzero/onestrings
Eachcodewordcontains
2k11bits
Exhaustivecode,k=4
class
class vector
1111111
0000111
0011001
0101010
Class1:codewordisallones
Class2:2k2zeroesfollowedby2k21ones
Classi:alternatingrunsof2ki0sand1s
lastrunisoneshort
38
MoreonECOCs
Moreclasses exhaustivecodesinfeasible
Numberofcolumnsincreasesexponentially
Randomcodewordshavegooderrorcorrecting
propertiesonaverage!
Therearesophisticatedmethodsforgenerating
ECOCswithjustafewcolumns
ECOCsdontworkwithNNclassifier
But:worksifdifferentattributesubsetsareusedtopredict
eachoutputbit
39
Ensemblesofnesteddichotomies
ECOCsproduceclassifications,butwhatifwewantclass
probabilityestimatesaswell?
e.g.forcostsensitiveclassificationviaminimum
expectedcost
Nesteddichotomies
Decomposesmulticlasstobinary
Workswithtwoclassclassifiersthatcanproduceclass
probabilityestimates
Recursivelysplitthefullsetofclassesintosmallerand
smallersubsets,whilesplittingthefulldatasetof
instancesintosubsetscorrespondingtothesesubsetsof
classes
Yieldsabinarytreeofclassescalledanested
dichotomy
40
Example
Fullsetofclasses:
[a,b,c,d]
Twodisjointsubsets: [a,b][c,d]
[a][b][c][d]
Nesteddichotomyasacodematrix:
Class
Class vector
00X
1X0
01X
1X1
41
Probabilityestimation
SupposewewanttocomputePr[a|x]?
Learntwoclassmodelsforeachofthethreeinternal
nodes
Fromthetwoclassmodelattheroot:
Pr[{a,b}|x]
Fromthelefthandchildoftheroot:
Pr[{a}|x,{a|b}]
Usingthechainrule:
Pr[{a}|x]=Pr[{a}|{a,b},x]Pr[{a,b}|x]
Issues
Estimationerrorsfordeephierarchies
Howtodecideonhierarchicaldecompositionofclasses?
42
Ensemblesofnesteddichotomies
Ifthereisnoreasonaprioritopreferanyparticular
decompositionthenusethemall
Impracticalforanynontrivialnumberofclasses
Considerasubsetbytakingarandomsampleof
possibletreestructures
Cachingofmodels(sinceagiventwoclassproblem
mayoccurinmultipletrees)
Averageprobabilityestimatesoverthetrees
Experimentsshowthatthisapproachyieldsaccurate
multiclassclassifiers
Canevenimprovetheperformanceofmethodsthat
canalreadyhandlemulticlassproblems!
43
Classprobabilityestimationisharderthan
classification
Classificationerrorisminimizedaslongasthe
correctclassispredictedwithmaxprobability
Estimatesthatyieldcorrectclassificationmaybe
quitepoorwithrespecttoquadraticor
informationalloss
Oftenimportanttohaveaccurateclass
probabilities
e.g.costsensitivepredictionusingtheminimum
expectedcostmethod
44
Consideratwoclassproblem.Probabilitiesthatare
correctforclassificationmaybe:
Toooptimistictooclosetoeither0or1
Toopessimisticnotcloseenoughto0or1
Reliabilitydiagram
showingoveroptimistic
probabilityestimation
foratwoclassproblem
45
Reliabilitydiagramgeneratedbycollecting
predictedprobabilitiesandrelativefrequencies
froma10foldcrossvalidation
Predictedprobabilitiesdiscretizedinto20ranges
viaequalfrequencydiscretization
Correctbiasbyusingposthoccalibrationto
mapobservedcurvetothediagonal
Aroughapproachcanusethedatafromthe
reliabilitydiagramdirectly
Discretizationbasedcalibrationisfast...
Butdeterminingtheappropriatenumberof
discretizationintervalsisnoteasy
46
Viewasafunctionestimationproblem
Oneinputestimatedclassprobabilityandone
outputthecalibratedprobability
Assumingthefunctionispiecewiseconstantand
monotonicallyincreasing
Isotonicregressionminimizesthesquarederror
betweenobservedclassprobabilities(0/1)and
resultingcalibratedclassprobabilities
Alternatively,uselogisticregressiontoestimatethe
calibrationfunction
Mustusethelogoddsoftheestimatedclass
probabilitiesasinput
Multiclasslogisticregressioncanbeusedfor
calibrationinthemulticlasscase
47

Chapter 7

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 7

Uploaded by

Copyright:

Available Formats

DataMining

Yes No Yes Yes Yes No

No Yes Yes Yes No Yes Yes No

You might also like