You are on page 1of 47

DataMining

PracticalMachineLearningToolsandTechniques
SlidesforChapter7ofDataMiningbyI.H.Witten,E.Frankand
M.A.Hall

Datatransformations

Attributeselection

Attributediscretization

Principalcomponentanalysis,randomprojections,partialleastsquares,text,
timeseries

Sampling
Reservoirsampling
Dirtydata

Unsupervised,supervised,errorvsentropybased,converseofdiscretization

Projections

Schemeindependent,schemespecific

Datacleansing,robustregression,anomalydetection

Transformingmultipleclassestobinaryones
Simpleapproaches,errorcorrectingcodes,ensemblesof
nesteddichotomies
Calibratingclassprobabilities
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)

Justapplyalearner?NO!

Scheme/parameterselection
treatselectionprocessaspartofthelearning
process

Modifyingtheinput:

Dataengineeringtomakelearningpossibleor
easier

Modifyingtheoutput

Recalibratingprobabilityestimates

DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)

Attributeselection

Addingarandom(i.e.irrelevant)attributecan
significantlydegradeC4.5sperformance

IBLverysusceptibletoirrelevantattributes

Problem:attributeselectionbasedonsmallerand
smalleramountsofdata
Numberoftraininginstancesrequiredincreases
exponentiallywithnumberofirrelevantattributes

NaveBayesdoesnthavethisproblem
Relevantattributescanalsobeharmful

DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)

Schemeindependentattributeselection

Filterapproach:assessbasedongeneralcharacteristicsofthedata
Onemethod:findsmallestsubsetofattributesthatseparatesdata
Anothermethod:usedifferentlearningscheme

IBLbasedattributeweightingtechniques:

e.g.useattributesselectedbyC4.5and1R,orcoefficientsoflinear
model,possiblyappliedrecursively(recursivefeatureelimination)
cantfindredundantattributes(butfixhasbeensuggested)

CorrelationbasedFeatureSelection(CFS):

correlationbetweenattributesmeasuredbysymmetricuncertainty:
A , B
UA ,B=2 H AHBH
[0,1]
H AHB

goodnessofsubsetofattributesmeasuredby(breakingtiesinfavorof
smallersubsets):

j U A j , C/ i j UA i , A j
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)

Attributesubsetsforweatherdata

DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)

Searchingattributespace

Numberofattributesubsetsis
exponentialinnumberofattributes
Commongreedyapproaches:

forwardselection
backwardelimination

Moresophisticatedstrategies:

Bidirectionalsearch
Bestfirstsearch:canfindoptimumsolution
Beamsearch:approximationtobestfirstsearch
Geneticalgorithms

DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)

Schemespecificselection

Wrapperapproachtoattributeselection
Implementwrapperaroundlearningscheme

Timeconsuming

greedyapproach,kattributesk2time

priorrankingofattributeslinearink

Canusesignificancetesttostopcrossvalidationfor
subsetearlyifitisunlikelytowin(racesearch)

Evaluationcriterion:crossvalidationperformance

canbeusedwithforward,backwardselection,priorranking,orspecial
purposeschematasearch

Learningdecisiontables:schemespecificattribute
selectionessential
EfficientfordecisiontablesandNaveBayes
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)

Attributediscretization

AvoidsnormalityassumptioninNaveBayesand
clustering
1R:usessimplediscretizationscheme
C4.5performslocaldiscretization
Globaldiscretizationcanbeadvantageousbecause
itsbasedonmoredata
Applylearnerto

kvalueddiscretizedattributeorto
k1binaryattributesthatcodethecutpoints

DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)

Discretization:unsupervised

Determineintervalswithoutknowingclasslabels

Twostrategies:

Whenclustering,theonlypossibleway!
Equalintervalbinning
Equalfrequencybinning
(alsocalledhistogramequalization)

Normallyinferiortosupervisedschemesin
classificationtasks

ButequalfrequencybinningworkswellwithnaveBayesif
numberofintervalsissettosquarerootofsizeofdataset
(proportionalkintervaldiscretization)

DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)

10

Discretization:supervised

Entropybasedmethod
Buildadecisiontreewithprepruningonthe
attributebeingdiscretized

Useentropyassplittingcriterion
Useminimumdescriptionlengthprincipleasstopping
criterion

Workswell:thestateoftheart
Toapplymindescriptionlengthprinciple:

Thetheoryis

thesplittingpoint(log2[N1]bits)

plusclassdistributionineachsubset

Comparedescriptionlengthsbefore/afteraddingsplit
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)

11

Example:temperatureattribute
Temperature

64

65

68

69

70

71

Play

Yes No Yes Yes Yes No

72

72

75

75

80

81

83

85

No Yes Yes Yes No Yes Yes No

DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)

12

FormulaforMDLP

Ninstances

Originalset:
Firstsubset:
Secondsubset:

gain

kclasses,entropyE
k1classes,entropyE1

log2 N1
N

k2classes,entropyE2

log2 3k 2kEk 1 E1 k 2 E2
N

Resultsinnodiscretizationintervalsfor
temperatureattribute

DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)

13

Superviseddiscretization:othermethods

Canreplacetopdownprocedurebybottomup
method
CanreplaceMDLPbychisquaredtest
Canusedynamicprogrammingtofindoptimum
kwaysplitforgivenadditivecriterion

Requirestimequadraticinthenumberofinstances
Butcanbedoneinlineartimeiferrorrateisused
insteadofentropy

DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)

14

Errorbasedvs.entropybased

Question:
couldthebestdiscretizationeverhavetwo
adjacentintervalswiththesameclass?
Wronganswer:No.Forifso,

Collapsethetwo
Freeupaninterval
Useitsomewhereelse
(Thisiswhaterrorbaseddiscretizationwilldo)

Rightanswer:Surprisingly,yes.

(andentropybaseddiscretizationcandoit)

DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)

15

Errorbasedvs.entropybased
A2class,
2attribute
problem

Entropybaseddiscretizationcandetectchangeofclassdistribution
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)

16

Theconverseofdiscretization
Makenominalvaluesintonumericones
1. Indicatorattributes(usedbyIB1)

Makesnouseofpotentialorderinginformation

2. Codeanorderednominalattributeintobinary
ones(usedbyM5)

Canbeusedforanyorderedattribute
Betterthancodingorderingintoaninteger(which
impliesametric)

Ingeneral:codesubsetofattributevaluesas
binary

DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)

17

Projections

Simpletransformationscanoftenmakealargedifference
inperformance
Exampletransformations(notnecessarilyfor
performanceimprovement):

Differenceoftwodateattributes
Ratiooftwonumeric(ratioscale)attributes
Concatenatingthevaluesofnominalattributes
Encodingclustermembership
Addingnoisetodata
Removingdatarandomlyorselectively
Obfuscatingthedata

DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)

18

Principalcomponentanalysis

Methodforidentifyingtheimportantdirections
inthedata
Canrotatedatainto(reduced)coordinatesystem
thatisgivenbythosedirections
Algorithm:
1. Finddirection(axis)ofgreatestvariance
2. Finddirectionofgreatestvariancethatisperpendicular
topreviousdirectionandrepeat

Implementation:findeigenvectorsofcovariance
matrixbydiagonalization

Eigenvectors(sortedbyeigenvalues)arethedirections

DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)

19

Example:10dimensionaldata

Cantransformdataintospacegivenbycomponents

DataisnormallystandardizedforPCA

Couldalsoapplythisrecursivelyintreelearner
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)

20

Randomprojections

PCAisnicebutexpensive:cubicinnumberof
attributes
Alternative:userandomdirections
(projections)insteadofprinciplecomponents
Surprising:randomprojectionspreserve
distancerelationshipsquitewell(onaverage)

CanusethemtoapplykDtreestohigh
dimensionaldata
Canimprovestabilitybyusingensembleof
modelsbasedondifferentprojections
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)

21

Partialleastsquaresregression

PCAisoftenapreprocessingstepbefore
applyingalearningalgorithm

Whenlinearregressionisappliedtheresulting
modelisknownasprincipalcomponents
regression
Outputcanbereexpressedintermsofthe
originalattribues

PartialleastsquaresdiffersfromPCAinthatit
takestheclassattributeintoaccount

Findsdirectionsthathavehighvarianceandare
stronglycorrelatedwiththeclass

DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)

22

Algorithm
1.Startwithstandardizedinputattributes
2.AttributecoefficientsofthefirstPLSdirection:

Computethedotproductbetweeneachattributevector
andtheclassvectorinturn

3.CoefficientsfornextPLSdirection:

Originalattributevaluesarefirstreplacedbydifference
(residual)betweentheattribute'svalueandtheprediction
fromasimpleunivariateregressionthatusestheprevious
PLSdirectionasapredictorofthatattribute
Computethedotproductbetweeneachattribute's
residualvectorandtheclassvectorinturn

4.Repeatfrom3

DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)

23

Texttoattributevectors

Manydataminingapplicationsinvolvetextualdata(eg.string
attributesinARFF)
Standardtransformation:convertstringintobagofwordsby
tokenization
Attributevaluesarebinary,wordfrequencies(f ),
ij
log(1+fij),orTFIDF:

f ij log #documents
#documentsthat includeword i

Onlyretainalphabeticsequences?
Whatshouldbeusedasdelimiters?
Shouldwordsbeconvertedtolowercase?
Shouldstopwordsbeignored?
Shouldhapaxlegomenabeincluded?Orevenjustthekmost
frequentwords?
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)

24

Timeseries

Intimeseriesdata,eachinstancerepresentsadifferent
timestep
Somesimpletransformations:
Shiftvaluesfromthepast/future
Computedifference(delta)betweeninstances(ie.
derivative)
Insomedatasets,samplesarenotregularbuttimeis
givenbytimestampattribute
Needtonormalizebystepsizewhentransforming
Transformationsneedtobeadaptedifattributes
representdifferenttimesteps
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)

25

Sampling

Samplingistypicallyasimpleprocedure
Whatiftraininginstancesarriveonebyonebutwe
don'tknowthetotalnumberinadvance?

Orperhapstherearesomanythatitisimpracticalto
storethemallbeforesampling?

Isitpossibletoproduceauniformlyrandomsampleof
afixedsize?Yes.
Reservoirsampling
Fillthereservoir,ofsizer,withthefirstrinstances
toarrive
Subsequentinstancesreplacearandomlyselected
reservoirelementwithprobabilityr/i,whereiis
thenumberofinstancesseensofar
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)

26

Automaticdatacleansing

Toimproveadecisiontree:

Better(ofcourse!):

Removemisclassifiedinstances,thenrelearn!
Humanexpertchecksmisclassifiedinstances

Attributenoisevsclassnoise

Attributenoiseshouldbeleftintrainingset
(donttrainoncleansetandtestondirtyone)
Systematicclassnoise(e.g.oneclasssubstitutedfor
another):leaveintrainingset
Unsystematicclassnoise:eliminatefromtraining
set,ifpossible

DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)

27

Robustregression

Robuststatisticalmethod onethat
addressesproblemofoutliers
Tomakeregressionmorerobust:

Minimizeabsoluteerror,notsquarederror
Removeoutliers(e.g.10%ofpointsfarthestfrom
theregressionplane)
Minimizemedianinsteadofmeanofsquares
(copeswithoutliersinxandydirection)

Findsnarroweststripcoveringhalftheobservations

DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)

28

Example:leastmedianofsquares
Numberofinternationalphonecallsfrom
Belgium,19501973

DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)

29

Detectinganomalies

Visualizationcanhelptodetectanomalies
Automaticapproach:
committeeofdifferentlearningschemes

E.g.

decisiontree
nearestneighborlearner
lineardiscriminantfunction

Conservativeapproach:deleteinstances
incorrectlyclassifiedbyallofthem
Problem:mightsacrificeinstancesofsmall
classes

DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)

30

OneClassLearning

Usuallytrainingdataisavailableforallclasses
Someproblemsexhibitonlyasingleclassat
trainingtime

Oneclassclassification

Testinstancesmaybelongtothisclassoranew
classnotpresentattrainingtime
Predicteithertargetorunknown

Someproblemscanbereformulatedintotwo
classones
Otherapplicationstrulydon'thavenegativedata
Egpasswordhardening
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)

31

Outlierdetection

Oneclassclassificationisoftencalled
outlier/noveltydetection
Genericapproach:identifyoutliersasinstances
thatliebeyonddistancedfrompercentagepof
thetrainingdata
Alternatively,estimatedensityofthetargetclass
andmarklowprobabilitytestinstancesas
outliers
Thresholdcanbeadjustedtoobtaina
suitablerateofoutliers

DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)

32

Generatingartificialdata

Anotherpossibilityistogenerateartificialdatafor
theoutlierclass

Canthenapplyanyofftheshelfclassifier
Cantunerejectionratethresholdifclassifier
producesprobabilityestimates

Generateuniformlyrandomdata

Toomuchwilloverwhelmthetargetclass!

Canbeavoidediflearningaccurateprobabilitiesrather
thanminimizingclassificationerror

Curseofdimensionalityas#attributesincreaseit
becomesinfeasibletogenerateenoughdatatoget
goodcoverageofthespace

DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)

33

Generatingartificialdata

Generatedatathatisclosetothetargetclass
Nolongeruniformlydistributedandmusttakethis
distributionintoaccountwhencomputingmembership
scoresfortheoneclassmodel
Ttargetclass,Aartificialclass.WantPr[X|T],forany
instanceX;weknowPr[X|A]
CombinesomeamountofAwithinstancesofTandusea
classprobabilityestimatortoestimatePr[T|X];thenbyBayes'
rule:
Pr [T ]) Pr [T | X ]
Pr [ X | T ]= (1
Pr [ X | A]
Pr [T ](1Pr [T | X ])

Forclassification,chooseathresholdtotunerejectionrate
HowtochoosePr[X|A]?Applyadensityestimatortothe
targetclassanduseresultingfunctiontomodeltheartificial
class
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)

34

Transformingmultipleclassestobinary
ones

Somelearningalgorithmsonlyworkwithtwo
classproblems

Sophisticatedmulticlassvariantsexistinmany
casesbutcanbeveryslowordifficultto
implement

Acommonalternativeistotransformmulticlass
problemsintomultipletwoclassones
Simplemethods
Discriminateeachclassagainstheunionof
theothersonevs.rest
Buildaclassifierforeverypairofclasses
pairwiseclassification
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)

35

Errorcorrectingoutputcodes

Multiclassproblem binaryproblems

Idea:useerrorcorrecting
codesinstead

Simpleonevs.restscheme:
Oneperclasscoding

baseclassifierspredict
1011111,trueclass=??

Usecodewordsthathave
largeHammingdistance
betweenanypair

class
a

class
vector
1000

0100

0010

0001

class
a

class
vector
1111111

0000111

0011001

0101010

Cancorrectupto(d1)/2singlebiterrors
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)

36

MoreonECOCs

Twocriteria:

Rowseparation:
minimumdistancebetweenrows
Columnseparation:
minimumdistancebetweencolumns

3classes only23possiblecolumns

(andcolumnscomplements)
Why?Becauseifcolumnsareidentical,baseclassifierswilllikely
makethesameerrors
Errorcorrectionisweakenediferrorsarecorrelated

(and4outofthe8arecomplements)
Cannotachieverowandcolumnseparation

Onlyworksforproblemswith>3classes
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)

37

ExhaustiveECOCs

Exhaustivecodeforkclasses:

Columnscompriseevery
possiblekstring
exceptforcomplements
andallzero/onestrings
Eachcodewordcontains
2k11bits

Exhaustivecode,k=4

class

class vector

1111111

0000111

0011001

0101010

Class1:codewordisallones
Class2:2k2zeroesfollowedby2k21ones
Classi:alternatingrunsof2ki0sand1s

lastrunisoneshort

DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)

38

MoreonECOCs

Moreclasses exhaustivecodesinfeasible

Numberofcolumnsincreasesexponentially

Randomcodewordshavegooderrorcorrecting
propertiesonaverage!
Therearesophisticatedmethodsforgenerating
ECOCswithjustafewcolumns
ECOCsdontworkwithNNclassifier

But:worksifdifferentattributesubsetsareusedtopredict
eachoutputbit

DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)

39

Ensemblesofnesteddichotomies

ECOCsproduceclassifications,butwhatifwewantclass
probabilityestimatesaswell?

e.g.forcostsensitiveclassificationviaminimum
expectedcost

Nesteddichotomies

Decomposesmulticlasstobinary
Workswithtwoclassclassifiersthatcanproduceclass
probabilityestimates
Recursivelysplitthefullsetofclassesintosmallerand
smallersubsets,whilesplittingthefulldatasetof
instancesintosubsetscorrespondingtothesesubsetsof
classes
Yieldsabinarytreeofclassescalledanested
dichotomy
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)

40

Example
Fullsetofclasses:

[a,b,c,d]

Twodisjointsubsets: [a,b][c,d]
[a][b][c][d]

Nesteddichotomyasacodematrix:
Class

Class vector

00X

1X0

01X

1X1

DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)

41

Probabilityestimation

SupposewewanttocomputePr[a|x]?

Learntwoclassmodelsforeachofthethreeinternal
nodes
Fromthetwoclassmodelattheroot:
Pr[{a,b}|x]

Fromthelefthandchildoftheroot:
Pr[{a}|x,{a|b}]

Usingthechainrule:
Pr[{a}|x]=Pr[{a}|{a,b},x]Pr[{a,b}|x]

Issues

Estimationerrorsfordeephierarchies
Howtodecideonhierarchicaldecompositionofclasses?

DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)

42

Ensemblesofnesteddichotomies

Ifthereisnoreasonaprioritopreferanyparticular
decompositionthenusethemall

Impracticalforanynontrivialnumberofclasses

Considerasubsetbytakingarandomsampleof
possibletreestructures

Cachingofmodels(sinceagiventwoclassproblem
mayoccurinmultipletrees)
Averageprobabilityestimatesoverthetrees
Experimentsshowthatthisapproachyieldsaccurate
multiclassclassifiers
Canevenimprovetheperformanceofmethodsthat
canalreadyhandlemulticlassproblems!
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)

43

Calibratingclassprobabilities

Classprobabilityestimationisharderthan
classification

Classificationerrorisminimizedaslongasthe
correctclassispredictedwithmaxprobability
Estimatesthatyieldcorrectclassificationmaybe
quitepoorwithrespecttoquadraticor
informationalloss

Oftenimportanttohaveaccurateclass
probabilities

e.g.costsensitivepredictionusingtheminimum
expectedcostmethod

DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)

44

Calibratingclassprobabilities

Consideratwoclassproblem.Probabilitiesthatare
correctforclassificationmaybe:

Toooptimistictooclosetoeither0or1
Toopessimisticnotcloseenoughto0or1

Reliabilitydiagram
showingoveroptimistic
probabilityestimation
foratwoclassproblem

DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)

45

Calibratingclassprobabilities

Reliabilitydiagramgeneratedbycollecting
predictedprobabilitiesandrelativefrequencies
froma10foldcrossvalidation
Predictedprobabilitiesdiscretizedinto20ranges
viaequalfrequencydiscretization
Correctbiasbyusingposthoccalibrationto
mapobservedcurvetothediagonal
Aroughapproachcanusethedatafromthe
reliabilitydiagramdirectly
Discretizationbasedcalibrationisfast...
Butdeterminingtheappropriatenumberof
discretizationintervalsisnoteasy

DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)

46

Calibratingclassprobabilities

Viewasafunctionestimationproblem

Oneinputestimatedclassprobabilityandone
outputthecalibratedprobability

Assumingthefunctionispiecewiseconstantand
monotonicallyincreasing

Isotonicregressionminimizesthesquarederror
betweenobservedclassprobabilities(0/1)and
resultingcalibratedclassprobabilities
Alternatively,uselogisticregressiontoestimatethe
calibrationfunction
Mustusethelogoddsoftheestimatedclass
probabilitiesasinput
Multiclasslogisticregressioncanbeusedfor
calibrationinthemulticlasscase
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)

47

You might also like