Professional Documents
Culture Documents
UNITI
AnIntroductiononData
MiningandPreprocessing
December26, DataMining:Conceptsand 2
2012 h
Chapter1.Introduction
Motivation:Whydatamining?
Whatisdatamining?
DataMining:Onwhatkindofdata?
Dataminingfunctionality
Classificationofdataminingsystems
Top10mostpopulardataminingalgorithms
Majorissuesindatamining
Overviewofthecourse
December26, DataMining:Conceptsand 3
2012 h
WhyDataMining?
TheExplosiveGrowthofData:fromterabytestopetabytes
Datacollectionanddataavailability
Automateddatacollectiontools,databasesystems,Web,
computerizedsociety
Majorsourcesofabundantdata
Business:Web,ecommerce,transactions,stocks,
Science:Remotesensing,bioinformatics,scientificsimulation,
Societyandeveryone:news,digitalcameras,YouTube
Wearedrowningindata,butstarvingforknowledge!
NecessityisthemotherofinventionDataminingAutomatedanalysisof
massivedatasets
December26, DataMining:Conceptsand 4
2012 h
WhatIsDataMining?
Datamining(knowledgediscoveryfromdata)
Extractionofinteresting(nontrivial, implicit,previouslyunknown and
potentiallyuseful) patternsorknowledgefromhugeamountofdata
Datamining:amisnomer?
Alternativenames
Knowledgediscovery(mining)indatabases(KDD),knowledge
extraction,data/patternanalysis,dataarcheology,datadredging,
informationharvesting,businessintelligence,etc.
Watchout:Iseverythingdatamining?
Simplesearchandqueryprocessing
(Deductive)expertsystems
December26, DataMining:Conceptsand 5
2012 h
KnowledgeDiscovery(KDD)Process
Task-relevant Data
Data Cleaning
Data Integration
December26,
Databases 6
DataMining:Conceptsand
2012 h
DataMiningandBusinessIntelligence
Increasing potential
to support
business decisions End User
Decision
Making
DataPresentation Business
Analyst
Visualization Techniques
DataMining Data
Information Discovery Analyst
DataExploration
Statistical Summary, Querying, and Reporting
DataPreprocessing/Integration,DataWarehouses
DBA
DataSources
Paper, Files, Web documents, Scientific experiments, Database Systems
December26, DataMining:Conceptsand 7
2012 h
DataMining:ConfluenceofMultipleDisciplines
Database
Technology Statistics
Machine Visualization
Learning DataMining
Pattern
Recognition Other
Algorithm Disciplines
December26, DataMining:Conceptsand 8
2012 h
WhyNotTraditionalDataAnalysis?
Tremendousamountofdata
Algorithmsmustbehighlyscalabletohandlesuchasterabytesofdata
Highdimensionalityofdata
Microarraymayhavetensofthousandsofdimensions
Highcomplexityofdata
Datastreamsandsensordata
Timeseriesdata,temporaldata,sequencedata
Structuredata,graphs,socialnetworksandmultilinkeddata
Heterogeneousdatabasesandlegacydatabases
Spatial,spatiotemporal,multimedia,textandWebdata
Softwareprograms,scientificsimulations
Newandsophisticatedapplications
December26, DataMining:Conceptsand 9
2012 h
MultiDimensionalViewofDataMining
Datatobemined
Relational,datawarehouse,transactional,stream,objectoriented/relational,
active,spatial,timeseries,text,multimedia,heterogeneous,legacy,WWW
Knowledgetobemined
Characterization,discrimination,association,classification,clustering,
trend/deviation,outlieranalysis,etc.
Multiple/integratedfunctionsandminingatmultiplelevels
Techniquesutilized
Databaseoriented,datawarehouse(OLAP),machinelearning,statistics,
visualization,etc.
Applicationsadapted
Retail,telecommunication,banking,fraudanalysis,biodatamining,stock
marketanalysis,textmining,Webmining,etc.
December26, DataMining:Conceptsand 10
2012 h
DataMining:ClassificationSchemes
Generalfunctionality
Descriptivedatamining
Predictivedatamining
Differentviewsleadtodifferentclassifications
Data view:Kindsofdatatobemined
Knowledge view:Kindsofknowledgetobediscovered
Method view:Kindsoftechniquesutilized
Application view:Kindsofapplicationsadapted
December26, DataMining:Conceptsand 11
2012 h
DataMining:OnWhatKindsofData?
Databaseorienteddatasetsandapplications
Relationaldatabase,datawarehouse,transactionaldatabase
Advanceddatasetsandadvancedapplications
Datastreamsandsensordata
Timeseriesdata,temporaldata,sequencedata(incl.biosequences)
Structuredata,graphs,socialnetworksandmultilinkeddata
Objectrelationaldatabases
Heterogeneousdatabasesandlegacydatabases
Spatialdataandspatiotemporaldata
Multimediadatabase
Textdatabases
TheWorldWideWeb
December26, DataMining:Conceptsand 12
2012 h
DataMiningFunctionalities
Multidimensionalconceptdescription:Characterizationanddiscrimination
Generalize,summarize,andcontrastdatacharacteristics,e.g.,dryvs.
wetregions
Frequentpatterns,association,correlationvs.causality
Diaper Beer[0.5%,75%](Correlationorcausality?)
Classificationandprediction
Constructmodels(functions)thatdescribeanddistinguishclassesor
conceptsforfutureprediction
E.g.,classifycountriesbasedon(climate),orclassifycarsbasedon
(gasmileage)
Predictsomeunknownormissingnumericalvalues
December26, DataMining:Conceptsand 13
2012 h
DataMiningFunctionalities(2)
Clusteranalysis
Classlabelisunknown:Groupdatatoformnewclasses,e.g.,cluster
housestofinddistributionpatterns
Maximizingintraclasssimilarity&minimizinginterclasssimilarity
Outlieranalysis
Outlier:Dataobjectthatdoesnotcomplywiththegeneralbehaviorof
thedata
Noiseorexception?Usefulinfrauddetection,rareeventsanalysis
Trendandevolutionanalysis
Trendanddeviation:e.g.,regressionanalysis
Sequentialpatternmining:e.g.,digitalcamera largeSDmemory
Periodicityanalysis
Similaritybasedanalysis
Otherpatterndirectedorstatisticalanalyses
December26, DataMining:Conceptsand 14
2012 h
MajorIssuesinDataMining
Miningmethodology
Miningdifferentkindsofknowledgefromdiversedatatypes,e.g.,bio,stream,Web
Performance:efficiency,effectiveness,andscalability
Patternevaluation:theinterestingnessproblem
Incorporationofbackgroundknowledge
Handlingnoiseandincompletedata
Parallel,distributedandincrementalminingmethods
Integrationofthediscoveredknowledgewithexistingone:knowledgefusion
Userinteraction
Dataminingquerylanguagesandadhocmining
Expressionandvisualizationofdataminingresults
Interactiveminingof knowledgeatmultiplelevelsofabstraction
Applicationsandsocialimpacts
Domainspecificdatamining&invisibledatamining
Protectionofdatasecurity,integrity,andprivacy
December26, DataMining:Conceptsand 15
2012 h
WhyDataMiningQueryLanguage?
Automatedvs.querydriven?
Findingallthepatternsautonomouslyinadatabase?unrealistic
becausethepatternscouldbetoomanybutuninteresting
Dataminingshouldbeaninteractiveprocess
Userdirectswhattobemined
Usersmustbeprovidedwithasetofprimitives tobeusedtocommunicate
withthedataminingsystem
Incorporatingtheseprimitivesinadataminingquerylanguage
Moreflexibleuserinteraction
Foundationfordesignofgraphicaluserinterface
Standardizationofdataminingindustryandpractice
December26, DataMining:Conceptsand 16
2012 h
PrimitivesthatDefineaDataMiningTask
Taskrelevantdata
Databaseordatawarehousename
Databasetablesordatawarehousecubes
Conditionfordataselection
Relevantattributesordimensions
Datagroupingcriteria
Typeofknowledgetobemined
Characterization,discrimination,association,classification,prediction,
clustering,outlieranalysis,otherdataminingtasks
Backgroundknowledge
Patterninterestingnessmeasurements
Visualization/presentationofdiscoveredpatterns
December26, DataMining:Conceptsand 17
2012 h
DMQLADataMiningQueryLanguage
Motivation
ADMQLcanprovidetheabilitytosupportadhocandinteractive
datamining
Byprovidingastandardizedlanguage likeSQL
HopetoachieveasimilareffectlikethatSQLhasonrelational
database
Foundationforsystemdevelopmentandevolution
Facilitateinformationexchange,technologytransfer,
commercializationandwideacceptance
Design
DMQLisdesignedwiththe primitivesdescribedearlier
December26, DataMining:Conceptsand 18
2012 h
AnExampleQueryinDMQL
December26, DataMining:Conceptsand 19
2012 h
IntegrationofDataMiningandDataWarehousing
Dataminingsystems,DBMS,Datawarehousesystemscoupling
Nocoupling,loosecoupling,semitightcoupling,tightcoupling
Onlineanalyticalminingdata
integrationofminingandOLAPtechnologies
Interactiveminingmultilevelknowledge
Necessityofminingknowledgeandpatternsatdifferentlevelsof
abstractionbydrilling/rolling,pivoting,slicing/dicing,etc.
Integrationofmultipleminingfunctions
Characterizedclassification,firstclusteringandthenassociation
December26, DataMining:Conceptsand 20
2012 h
CouplingDataMiningwithDB/DWSystems
Nocouplingflatfileprocessing,notrecommended
Loosecoupling
FetchingdatafromDB/DW
SemitightcouplingenhancedDMperformance
ProvideefficientimplementafewdataminingprimitivesinaDB/DW
system,e.g.,sorting,indexing,aggregation,histogramanalysis,
multiwayjoin,precomputationofsomestatfunctions
TightcouplingAuniforminformationprocessing
environment
DMissmoothlyintegratedintoaDB/DWsystem,miningqueryis
optimizedbasedonminingquery,indexing,queryprocessing
methods,etc.
December26, DataMining:Conceptsand 21
2012 h
Architecture:TypicalDataMiningSystem
GraphicalUserInterface
PatternEvaluation
Knowl
DataMiningEngine edge
Base
DatabaseorDataWarehouse
Server
December26, DataMining:Conceptsand 22
2012 h
ChapterDataPreprocessing
Whypreprocessthedata?
Descriptivedatasummarization
Datacleaning
Dataintegrationandtransformation
Datareduction
Discretizationandconcepthierarchygeneration
Summary
December26, DataMining:Conceptsand 23
2012 h
WhyDataPreprocessing?
Dataintherealworldisdirty
incomplete:lackingattributevalues,lacking
certainattributesofinterest,orcontainingonly
aggregatedata
e.g.,occupation=
noisy:containingerrorsoroutliers
e.g.,Salary=10
inconsistent:containingdiscrepanciesincodesor
names
e.g.,Age=42Birthday=03/07/1997
e.g.,Wasrating1,2,3,nowratingA,B,C
e.g.,discrepancybetweenduplicaterecords
December26, DataMining:Conceptsand 24
2012 h
WhyIsDataDirty?
Incompletedatamaycomefrom
Notapplicabledatavaluewhencollected
Differentconsiderationsbetweenthetimewhenthedatawascollectedand
whenitisanalyzed.
Human/hardware/softwareproblems
Noisydata(incorrectvalues)maycomefrom
Faultydatacollectioninstruments
Humanorcomputererroratdataentry
Errorsindatatransmission
Inconsistentdatamaycomefrom
Differentdatasources
Functionaldependencyviolation(e.g.,modifysomelinkeddata)
Duplicaterecordsalsoneeddatacleaning
December26, DataMining:Conceptsand 25
2012 h
WhyIsDataPreprocessingImportant?
Noqualitydata,noqualityminingresults!
Qualitydecisionsmustbebasedonqualitydata
e.g.,duplicateormissingdatamaycauseincorrectoreven
misleadingstatistics.
Datawarehouseneedsconsistentintegrationofqualitydata
Dataextraction,cleaning,andtransformationcomprisesthe
majorityoftheworkofbuildingadatawarehouse
December26, DataMining:Conceptsand 26
2012 h
MultiDimensionalMeasureofDataQuality
Awellacceptedmultidimensionalview:
Accuracy
Completeness
Consistency
Timeliness
Believability
Valueadded
Interpretability
Accessibility
Broadcategories:
Intrinsic,contextual,representational,andaccessibility
December26, DataMining:Conceptsand 27
2012 h
MajorTasksinDataPreprocessing
Datacleaning
Fillinmissingvalues,smoothnoisydata,identifyorremoveoutliers,and
resolveinconsistencies
Dataintegration
Integrationofmultipledatabases,datacubes,orfiles
Datatransformation
Normalizationandaggregation
Datareduction
Obtainsreducedrepresentationinvolumebutproducesthesameorsimilar
analyticalresults
Datadiscretization
Partofdatareductionbutwithparticularimportance,especiallyfornumerical
data
December26, DataMining:Conceptsand 28
2012 h
FormsofDataPreprocessing
December26, DataMining:Conceptsand 29
2012 h
DataPreprocessing
Whypreprocessthedata?
Descriptivedatasummarization
Datacleaning
Dataintegrationandtransformation
Datareduction
Discretizationandconcepthierarchygeneration
Summary
December26, DataMining:Conceptsand 30
2012 h
MiningDataDescriptive Characteristics
Motivation
Tobetterunderstandthedata:centraltendency,variationandspread
Datadispersioncharacteristics
median,max,min,quantiles,outliers,variance,etc.
Numericaldimensions correspondtosortedintervals
Datadispersion:analyzedwithmultiplegranularitiesofprecision
Boxplotorquantileanalysisonsortedintervals
Dispersionanalysisoncomputedmeasures
Foldingmeasuresintonumericaldimensions
Boxplotorquantileanalysisonthetransformedcube
December26, DataMining:Conceptsand 31
2012 h
MeasuringtheCentralTendency
=
1 n
x
Mean(algebraicmeasure)(samplevs.population): x =
n
i =1
xi
N
Weightedarithmeticmean: n
wx i i
Trimmedmean:choppingextremevalues x = i =1
n
Median:Aholisticmeasure
w
i =1
i
Middlevalueifoddnumberofvalues,oraverageofthemiddletwovalues
otherwise
Estimatedbyinterpolation(forgroupeddata):
n / 2 ( f )l
Mode median = L1 + ( )c
Valuethatoccursmostfrequentlyinthedata f median
Unimodal,bimodal,trimodal
Empiricalformula:
mean mode = 3 (mean median)
December26, DataMining:Conceptsand 32
2012 h
Symmetricvs.SkewedData
Median,meanandmodeofsymmetric,
positivelyandnegativelyskeweddata
December26, DataMining:Conceptsand 33
2012 h
MeasuringtheDispersionofData
Quartiles,outliersandboxplots
Quartiles:Q1 (25th percentile),Q3 (75th percentile)
Interquartilerange:IQR=Q3 Q1
Fivenumbersummary:min,Q1,M, Q3,max
Boxplot:endsoftheboxarethequartiles,medianismarked,whiskers,andplotoutlier
individually
Outlier:usually,avaluehigher/lowerthan1.5xIQR
Varianceandstandarddeviation(sample: s,population:)
Variance:(algebraic,scalablecomputation)
1 n 1 n 2 1 n 2 1 n 1 n
s =2
( xi x) =
2
n 1Standarddeviation
i =1
[ xi ( xi ) ] = ( xi ) 2 =
2
n s(or)isthesquarerootofvariances
1 i=1 n i=1 N 2(ior
=1
2) N
xi 2
i =1
2
December26, DataMining:Conceptsand 34
2012 h
DataPreprocessing
Whypreprocessthedata?
Descriptivedatasummarization
Datacleaning
Dataintegrationandtransformation
Datareduction
Discretizationandconcepthierarchygeneration
Summary
December26, DataMining:Conceptsand 35
2012 h
DataCleaning
Importance
Datacleaningisoneofthethreebiggestproblemsindata
warehousingRalphKimball
Datacleaningisthenumberoneproblemindatawarehousing
DCIsurvey
Datacleaningtasks
Fillinmissingvalues
Identifyoutliersandsmoothoutnoisydata
Correctinconsistentdata
Resolveredundancycausedbydataintegration
December26, DataMining:Conceptsand 36
2012 h
MissingData
Dataisnotalwaysavailable
E.g.,manytupleshavenorecordedvalueforseveralattributes,suchas
customerincomeinsalesdata
Missingdatamaybedueto
equipmentmalfunction
inconsistentwithotherrecordeddataandthusdeleted
datanotenteredduetomisunderstanding
certaindatamaynotbeconsideredimportantatthetimeofentry
notregisterhistoryorchangesofthedata
Missingdatamayneedtobeinferred.
December26, DataMining:Conceptsand 37
2012 h
HowtoHandleMissingData?
Ignorethetuple:usuallydonewhenclasslabelismissing(assumingthe
tasksinclassificationnoteffectivewhenthepercentageofmissingvalues
perattributevariesconsiderably.
Fillinthemissingvaluemanually:tedious+infeasible?
Fillinitautomaticallywith
aglobalconstant:e.g.,unknown,anewclass?!
theattributemean
theattributemeanforallsamplesbelongingtothesameclass:smarter
themostprobablevalue:inferencebasedsuchasBayesianformulaordecision
tree
December26, DataMining:Conceptsand 38
2012 h
NoisyData
Noise:randomerrororvarianceinameasuredvariable
Incorrectattributevaluesmaydueto
faultydatacollectioninstruments
dataentryproblems
datatransmissionproblems
technologylimitation
inconsistencyinnamingconvention
Otherdataproblemswhichrequiresdatacleaning
duplicaterecords
incompletedata
inconsistentdata
December26, DataMining:Conceptsand 39
2012 h
HowtoHandleNoisyData?
Binning
firstsortdataandpartitioninto(equalfrequency)bins
thenonecansmoothbybinmeans,smoothbybinmedian,smoothby
binboundaries,etc.
Regression
smoothbyfittingthedataintoregressionfunctions
Clustering
detectandremoveoutliers
Combinedcomputerandhumaninspection
detectsuspiciousvaluesandcheckbyhuman(e.g.,dealwithpossible
outliers)
December26, DataMining:Conceptsand 40
2012 h
SimpleDiscretization
Methods:Binning
Equalwidth (distance)partitioning
DividestherangeintoN intervalsofequalsize:uniformgrid
ifA andB arethelowestandhighestvaluesoftheattribute,thewidthof
intervalswillbe:W=(BA)/N.
Themoststraightforward,butoutliersmaydominatepresentation
Skeweddataisnothandledwell
Equaldepth (frequency)partitioning
DividestherangeintoN intervals,eachcontainingapproximatelysamenumber
ofsamples
Gooddatascaling
Managingcategoricalattributescanbetricky
December26, DataMining:Conceptsand 41
2012 h
BinningMethodsforData
Smoothing
Sorteddataforprice(indollars):4,8,9,15,21,21,24,25,26,28,29,34
*Partitionintoequalfrequency(equidepth)bins:
Bin1:4,8,9,15
Bin2:21,21,24,25
Bin3:26,28,29,34
*Smoothingbybinmeans:
Bin1:9,9,9,9
Bin2:23,23,23,23
Bin3:29,29,29,29
*Smoothingbybinboundaries:
Bin1:4,4,4,15
Bin2:21,21,25,25
Bin3:26,26,26,34
December26, DataMining:Conceptsand 42
2012 h
Regression
Y1
Y1 y=x+1
X1 x
December26, DataMining:Conceptsand 43
2012 h
ClusterAnalysis
December26, DataMining:Conceptsand 44
2012 h
DataCleaningasaProcess
Datadiscrepancydetection
Usemetadata(e.g.,domain,range,dependency,distribution)
Checkfieldoverloading
Checkuniquenessrule,consecutiveruleandnullrule
Usecommercialtools
Datascrubbing:usesimpledomainknowledge(e.g.,postalcode,
spellcheck)todetecterrorsandmakecorrections
Dataauditing:byanalyzingdatatodiscoverrulesandrelationshipto
detectviolators(e.g.,correlationandclusteringtofindoutliers)
Datamigrationandintegration
Datamigrationtools:allowtransformationstobespecified
ETL(Extraction/Transformation/Loading)tools:allowuserstospecify
transformationsthroughagraphicaluserinterface
Integrationofthetwoprocesses
Iterativeandinteractive(e.g.,PottersWheels)
December26, DataMining:Conceptsand 45
2012 h
DataPreprocessing
Whypreprocessthedata?
Datacleaning
Dataintegrationandtransformation
Datareduction
Discretizationandconcepthierarchygeneration
Summary
December26, DataMining:Conceptsand 46
2012 h
DataIntegration
Dataintegration:
Combinesdatafrommultiplesourcesintoacoherentstore
Schemaintegration:e.g.,A.custid B.cust#
Integratemetadatafromdifferentsources
Entityidentificationproblem:
Identifyrealworldentitiesfrommultipledatasources,e.g.,BillClinton=
WilliamClinton
Detectingandresolvingdatavalueconflicts
Forthesamerealworldentity,attributevaluesfromdifferentsourcesare
different
Possiblereasons:differentrepresentations,differentscales,e.g.,metric
vs.Britishunits
December26, DataMining:Conceptsand 47
2012 h
HandlingRedundancyinDataIntegration
Redundantdataoccuroftenwhenintegrationofmultiple
databases
Objectidentification:Thesameattributeorobjectmayhavedifferent
namesindifferentdatabases
Derivabledata: Oneattributemaybeaderivedattributeinanother
table,e.g.,annualrevenue
Redundantattributesmaybeabletobedetectedby
correlationanalysis
Carefulintegrationofthedatafrommultiplesourcesmayhelp
reduce/avoidredundanciesandinconsistenciesandimprove
miningspeedandquality
December26, DataMining:Conceptsand 48
2012 h
CorrelationAnalysis(NumericalData)
Correlationcoefficient(alsocalledPearsonsproductmoment
coefficient)
rA , B =
( A A )( B B ) ( AB ) n A B
=
( n 1) A B ( n 1) A B
wherenisthenumberoftuples,andaretherespectivemeansofAandB,
A B A
andBaretherespectivestandarddeviationofAandB,and(AB)isthesumof
theABcrossproduct.
IfrA,B >0,AandBarepositivelycorrelated(Asvaluesincreaseas
Bs).Thehigher,thestrongercorrelation.
rA,B =0:independent;rA,B <0:negativelycorrelated
December26, DataMining:Conceptsand 49
2012 h
CorrelationAnalysis(CategoricalData)
2 (chisquare)test
(Observed Expected ) 2
2 =
Expected
Thelargerthe2 value,themorelikelythevariablesarerelated
Thecellsthatcontributethemosttothe2 valuearethose
whoseactualcountisverydifferentfromtheexpectedcount
Correlationdoesnotimplycausality
#ofhospitalsand#ofcartheftinacityarecorrelated
Botharecausallylinkedtothethirdvariable:population
December26, DataMining:Conceptsand 50
2012 h
DataTransformation
Smoothing:removenoisefromdata
Aggregation:summarization,datacubeconstruction
Generalization:concepthierarchyclimbing
Normalization:scaledtofallwithinasmall,specifiedrange
minmaxnormalization
zscorenormalization
normalizationbydecimalscaling
Attribute/featureconstruction
Newattributesconstructedfromthegivenones
December26, DataMining:Conceptsand 51
2012 h
DataTransformation:Normalization
Minmaxnormalization:to[new_minA,new_maxA]
v minA
v' = (new _ maxA new _ minA) + new _ minA
maxA minA
Ex.Letincomerange$12,000to$98,000normalizedto[0.0,1.0].Then
$73,000ismappedto 73,600 12,000
(1.0 0) + 0 = 0.716
98,000 12,000
Zscorenormalization(:mean,:standarddeviation):
v A
v'=
A
v
v' = j Where j is the smallest integer such that Max(||) < 1
10
December26, DataMining:Conceptsand 52
2012 h
DataPreprocessing
Whypreprocessthedata?
Datacleaning
Dataintegrationandtransformation
Datareduction
Discretizationandconcepthierarchygeneration
Summary
December26, DataMining:Conceptsand 53
2012 h
DataReductionStrategies
Whydatareduction?
Adatabase/datawarehousemaystoreterabytesofdata
Complexdataanalysis/miningmaytakeaverylongtimetorunonthe
completedataset
Datareduction
Obtainareducedrepresentationofthedatasetthatismuchsmallerin
volumebutyetproducethesame(oralmostthesame)analyticalresults
Datareductionstrategies
Datacubeaggregation:
Dimensionalityreduction e.g., removeunimportantattributes
DataCompression
Numerosityreduction e.g., fitdataintomodels
Discretizationandconcepthierarchygeneration
December26, DataMining:Conceptsand 54
2012 h
DataCubeAggregation
Thelowestlevelofadatacube(basecuboid)
Theaggregateddataforanindividualentityofinterest
E.g.,acustomerinaphonecallingdatawarehouse
Multiplelevelsofaggregationindatacubes
Furtherreducethesizeofdatatodealwith
Referenceappropriatelevels
Usethesmallestrepresentationwhichisenoughtosolvethetask
Queriesregardingaggregatedinformationshouldbeanswered
usingdatacube,whenpossible
December26, DataMining:Conceptsand 55
2012 h
AttributeSubsetSelection
Featureselection(i.e.,attributesubsetselection):
Selectaminimumsetoffeaturessuchthattheprobabilitydistributionof
differentclassesgiventhevaluesforthosefeaturesisascloseaspossible
totheoriginaldistributiongiventhevaluesofallfeatures
reduce#ofpatternsinthepatterns,easiertounderstand
Heuristicmethods(duetoexponential#ofchoices):
Stepwiseforwardselection
Stepwisebackwardelimination
Combiningforwardselectionandbackwardelimination
Decisiontreeinduction
December26, DataMining:Conceptsand 56
2012 h
ExampleofDecisionTreeInduction
A4 ?
A1? A6?
December26, DataMining:Conceptsand 57
2012 h
HeuristicFeatureSelectionMethods
December26, DataMining:Conceptsand 58
2012 h
DataCompression
Stringcompression
Thereareextensivetheoriesandwelltunedalgorithms
Typicallylossless
Butonlylimitedmanipulationispossiblewithoutexpansion
Audio/videocompression
Typicallylossycompression,withprogressiverefinement
Sometimessmallfragmentsofsignalcanbereconstructedwithout
reconstructingthewhole
Timesequenceisnotaudio
Typicallyshortandvaryslowlywithtime
December26, DataMining:Conceptsand 59
2012 h
DataCompression
Original Data
Approximated
December26, DataMining:Conceptsand 60
2012 h
DimensionalityReduction:PrincipalComponent
Analysis(PCA)
GivenN datavectorsfromndimensions,findk n orthogonalvectors
(principalcomponents)thatcanbebestusedtorepresentdata
Steps
Normalizeinputdata:Eachattributefallswithinthesamerange
Computek orthonormal(unit)vectors,i.e.,principalcomponents
Eachinputdata(vector)isalinearcombinationofthek principalcomponent
vectors
Theprincipalcomponentsaresortedinorderofdecreasingsignificanceor
strength
Sincethecomponentsaresorted,thesizeofthedatacanbereducedby
eliminatingtheweakcomponents,i.e.,thosewithlowvariance.(i.e.,usingthe
strongestprincipalcomponents,itispossibletoreconstructagood
approximationoftheoriginaldata
Worksfornumericdataonly
Usedwhenthenumberofdimensionsislarge
December26, DataMining:Conceptsand 61
2012 h
PrincipalComponentAnalysis
X2
Y1
Y2
X1
December26, DataMining:Conceptsand 62
2012 h
DataReductionMethod(1):RegressionandLog
LinearModels
Linearregression:Dataaremodeledtofitastraightline
Oftenusestheleastsquaremethodtofittheline
Multipleregression:allowsaresponsevariableYtobe
modeledasalinearfunctionofmultidimensionalfeature
vector
Loglinearmodel:approximatesdiscretemultidimensional
probabilitydistributions
December26, DataMining:Conceptsand 63
2012 h
RegressAnalysisandLogLinearModels
Linearregression:Y=wX+b
Tworegressioncoefficients,w andb, specifythelineandaretobe
estimatedbyusingthedataathand
UsingtheleastsquarescriteriontotheknownvaluesofY1,Y2,,X1,X2,
.
Multipleregression:Y=b0+b1X1+b2X2.
Manynonlinearfunctionscanbetransformedintotheabove
Loglinearmodels:
Themultiwaytableofjointprobabilitiesisapproximatedbyaproduct
oflowerordertables
Probability:p(a,b,c,d)=abacad bcd
DataReductionMethod(2):Histograms
Dividedataintobucketsandstore 40
average(sum)foreachbucket
35
Partitioningrules:
Equalwidth:equalbucketrange 30
Equalfrequency(orequaldepth) 25
Voptimal:withtheleasthistogram
20
variance (weightedsumoftheoriginal
valuesthateachbucketrepresents)
15
MaxDiff:setbucketboundarybetween
10
eachpairforpairshavethe1largest
differences
5
0
10000 30000 50000 70000 90000
December26, DataMining:Conceptsand 65
2012 h
DataReductionMethod(3):Clustering
Partitiondatasetintoclustersbasedonsimilarity,andstorecluster
representation(e.g.,centroidanddiameter)only
Canbeveryeffectiveifdataisclusteredbutnotifdataissmeared
Canhavehierarchicalclusteringandbestoredinmultidimensionalindex
treestructures
Therearemanychoicesofclusteringdefinitionsandclusteringalgorithms
ClusteranalysiswillbestudiedindepthinChapter7
December26, DataMining:Conceptsand 66
2012 h
DataReductionMethod(4):Sampling
Sampling:obtainingasmallsamples torepresentthewhole
datasetN
Allowaminingalgorithmtorunincomplexitythatispotentially
sublineartothesizeofthedata
Choosearepresentative subsetofthedata
Simplerandomsamplingmayhaveverypoorperformanceinthe
presenceofskew
Developadaptivesamplingmethods
Stratifiedsampling:
Approximatethepercentageofeachclass(or
subpopulationofinterest)intheoveralldatabase
Usedinconjunctionwithskeweddata
Note:SamplingmaynotreducedatabaseI/Os(pageatatime)
December26, DataMining:Conceptsand 67
2012 h
Sampling:withorwithoutReplacement
Raw Data
December26, DataMining:Conceptsand 68
2012 h
DataPreprocessing
Whypreprocessthedata?
Datacleaning
Dataintegrationandtransformation
Datareduction
Discretizationandconcepthierarchygeneration
Summary
December26, DataMining:Conceptsand 69
2012 h
Discretization
Threetypesofattributes:
Nominal valuesfromanunorderedset,e.g.,color,profession
Ordinal valuesfromanorderedset,e.g.,militaryoracademicrank
Continuous realnumbers,e.g.,integerorrealnumbers
Discretization:
Dividetherangeofacontinuousattributeintointervals
Someclassificationalgorithmsonlyacceptcategoricalattributes.
Reducedatasizebydiscretization
Prepareforfurtheranalysis
December26, DataMining:Conceptsand 70
2012 h
DiscretizationandConcept
Hierarchy
Discretization
Reducethenumberofvaluesforagivencontinuousattributebydividingthe
rangeoftheattributeintointervals
Intervallabelscanthenbeusedtoreplaceactualdatavalues
Supervisedvs.unsupervised
Split(topdown)vs.merge(bottomup)
Discretizationcanbeperformedrecursivelyonanattribute
Concepthierarchyformation
Recursivelyreducethedatabycollectingandreplacinglowlevelconcepts(such
asnumericvaluesforage)byhigherlevelconcepts(suchasyoung,middleaged,
orsenior)
December26, DataMining:Conceptsand 71
2012 h
DiscretizationandConceptHierarchyGenerationfor
NumericData
Typicalmethods:Allthemethodscanbeappliedrecursively
Binning(coveredabove)
Topdownsplit,unsupervised,
Histogramanalysis(coveredabove)
Topdownsplit,unsupervised
Clusteringanalysis(coveredabove)
Eithertopdownsplitorbottomupmerge,unsupervised
Entropybaseddiscretization:supervised,topdownsplit
Intervalmergingby2 Analysis:unsupervised,bottomupmerge
Segmentationbynaturalpartitioning:topdownsplit,unsupervised
December26, DataMining:Conceptsand 72
2012 h
Exampleof345Rule
count
(-$1,000 - $2,000)
Step 3:
(-$400 -$5,000)
Step 4:
Specificationofapartial/totalorderingofattributesexplicitlyat
theschemalevelbyusersorexperts
street<city<state<country
Specificationofahierarchyforasetofvaluesbyexplicitdata
grouping
{Urbana,Champaign,Chicago}<Illinois
Specificationofonlyapartialsetofattributes
E.g.,onlystreet<city,notothers
Automaticgenerationofhierarchies(orattributelevels)bythe
analysisofthenumberofdistinctvalues
E.g.,forasetofattributes:{street,city,state,country}
December26, DataMining:Conceptsand 74
2012 h
AutomaticConceptHierarchyGeneration
Somehierarchiescanbeautomaticallygeneratedbasedon
theanalysisofthenumberofdistinctvaluesperattributein
thedataset
Theattributewiththemostdistinctvaluesisplacedatthelowest
levelofthehierarchy
Exceptions,e.g.,weekday,month,quarter,year
Whypreprocessthedata?
Datacleaning
Dataintegrationandtransformation
Datareduction
Discretizationandconcepthierarchygeneration
Summary
December26, DataMining:Conceptsand 76
2012 h
Summary
Datapreparationorpreprocessingisabigissueforbothdata
warehousinganddatamining
Discriptivedatasummarizationisneedforqualitydata
preprocessing
Datapreparationincludes
Datacleaninganddataintegration
Datareductionandfeatureselection
Discretization
Alotamethodshavebeendevelopedbutdatapreprocessing
stillanactiveareaofresearch
December26, DataMining:Conceptsand 77
2012 h
ReviewQuestions
Howisdatawarehousedifferentfromadatabase?Howare
theysimilar?
Listthefiveprimitivesforspecifyingadataminingtask?
Statethedataminingfunctionalities?
Enlisttheclassificationofdataminingsystems
WriteanoteondataminingqueryLanguage?
Describethestepsinvolvedindataminingwhenviewedasa
processofknowledgediscovery?
Statethevariouskindsoffrequentpattern?
Giveanexampleformultidimensionalassociationrule?
Statetheneedforoutlieranalysis?
Areallofthepatterninteresting? Justify
.Whatarethepossibleintegrationschemesincludedinthe
integrationofdataminingsystemwithadatabaseordata
warehousesystem?
December26, DataMining:Conceptsand 78
2012 h
Bibliography
DataminingconceptsandTechniquesby
JiaweiHanandMichelineKamber
T.DasuandT.Johnson.ExploratoryData
MiningandDataCleaning.JohnWiley&Sons,
2003
December26, DataMining:Conceptsand 79
2012 h
UNITII
December26, DataMining:Conceptsand 80
2012 h
ClosedPatternsandMaxPatterns
Alongpatterncontainsacombinatorialnumberofsub
patterns,e.g.,{a1,,a100}contains (1001)+(1002)++(110000)=
2100 1=1.27*1030subpatterns!
Solution:Mineclosedpatterns andmaxpatterns instead
AnitemsetX isclosedifXisfrequent andthereexistsnosuper
pattern Y X,withthesamesupport asX(proposedby
Pasquier,etal.@ICDT99)
AnitemsetXisamaxpattern ifXisfrequentandthereexists
nofrequentsuperpatternY X(proposedbyBayardo@
SIGMOD98)
Closedpatternisalosslesscompressionoffreq.patterns
Reducingthe#ofpatternsandrules
December26, DataMining:Conceptsand 81
2012 h
ClosedPatternsandMaxPatterns
Exercise.DB={<a1,,a100>,<a1,,a50>}
Min_sup=1.
Whatisthesetofcloseditemset?
<a1,,a100>:1
<a1,,a50>:2
Whatisthesetofmaxpattern?
<a1,,a100>:1
Whatisthesetofallpatterns?
!!
December26, DataMining:Conceptsand 82
2012 h
Chapter5:MiningFrequentPatterns,Associationand
Correlations
Basicconceptsandaroadmap
Efficientandscalablefrequentitemset
miningmethods
Miningvariouskindsofassociationrules
Fromassociationminingtocorrelation
analysis
Constraintbasedassociationmining
Summary
December26, DataMining:Conceptsand 83
2012 h
ScalableMethodsforMiningFrequentPatterns
Thedownwardclosure propertyoffrequentpatterns
Anysubsetofafrequentitemsetmustbefrequent
If{beer,diaper,nuts} isfrequent,sois{beer,diaper}
i.e.,everytransactionhaving{beer,diaper,nuts}alsocontains
{beer,diaper}
Scalableminingmethods:Threemajorapproaches
Apriori(Agrawal&Srikant@VLDB94)
Freq.patterngrowth(FPgrowthHan,Pei&Yin
@SIGMOD00)
Verticaldataformatapproach(CharmZaki&Hsiao
@SDM02)
December26, DataMining:Conceptsand 84
2012 h
Apriori:ACandidateGenerationandTestApproach
Aprioripruningprinciple:Ifthereisany itemsetwhichis
infrequent,itssupersetshouldnotbegenerated/tested!
(Agrawal&Srikant@VLDB94,Mannila,etal.@KDD94)
Method:
Initially,scanDBoncetogetfrequent1itemset
Generate length(k+1)candidate itemsetsfromlengthk
frequent itemsets
TestthecandidatesagainstDB
Terminatewhennofrequentorcandidatesetcanbe
generated
December26, DataMining:Conceptsand 85
2012 h
TheAprioriAlgorithmAnExample
Supmin =2 Itemset sup
Itemset sup
Database TDB {A} 2
Tid Items
L1 {A} 2
C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2
2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2
{B, C} 2 {A, E}
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}
Pseudocode:
Ck:Candidateitemsetofsizek
Lk :frequentitemsetofsizek
L1 ={frequentitems};
for (k =1;Lk !=;k++)dobegin
Ck+1 =candidatesgeneratedfromLk;
foreach transactiont indatabasedo
incrementthecountofallcandidatesinCk+1
thatarecontainedint
Lk+1 =candidatesinCk+1 withmin_support
end
return k Lk;
December26, DataMining:Conceptsand 87
2012 h
ImportantDetailsofApriori
Howtogeneratecandidates?
Step1:selfjoiningLk
Step2:pruning
Howtocountsupportsofcandidates?
ExampleofCandidategeneration
L3={abc,abd,acd,ace,bcd}
Selfjoining:L3*L3
abcdfromabc andabd
acde fromacd andace
Pruning:
acde isremovedbecauseade isnotinL3
C4={abcd}
December26, DataMining:Conceptsand 88
2012 h
HowtoGenerateCandidates?
SupposetheitemsinLk1 arelistedinanorder
Step1:selfjoiningLk1
insertinto Ck
selectp.item1,p.item2,,p.itemk1,q.itemk1
fromLk1 p,Lk1q
wherep.item1=q.item1,,p.itemk2=q.itemk2,p.itemk1<q.itemk1
Step2:pruning
forallitemsetscinCk do
forall(k1)subsetssofcdo
if(sisnotinLk1)thendeletec fromCk
December26, DataMining:Conceptsand 89
2012 h
HowtoCountSupportsofCandidates?
Whycountingsupportsofcandidatesaproblem?
Thetotalnumberofcandidatescanbeveryhuge
Onetransactionmaycontainmanycandidates
Method:
Candidateitemsetsarestoredinahashtree
Leafnodeofhashtreecontainsalistofitemsetsand
counts
Interiornode containsahashtable
Subsetfunction:findsallthecandidatescontainedina
transaction
December26, DataMining:Conceptsand 90
2012 h
Example:CountingSupportsofCandidates
Subset function
Transaction: 1 2 3 5 6
3,6,9
1,4,7
2,5,8
1+2356
13+56 234
567
145 345 356 367
136 368
357
12+356
689
124
457 125 159
458
December26, DataMining:Conceptsand 91
2012 h
EfficientImplementationofAprioriinSQL
HardtogetgoodperformanceoutofpureSQL(SQL92)
basedapproachesalone
MakeuseofobjectrelationalextensionslikeUDFs,BLOBs,
Tablefunctionsetc.
Getordersofmagnitudeimprovement
S.Sarawagi,S.Thomas,andR.Agrawal.Integrating
associationruleminingwithrelationaldatabasesystems:
Alternativesandimplications.InSIGMOD98
December26, DataMining:Conceptsand 92
2012 h
ChallengesofFrequentPatternMining
Challenges
Multiplescansoftransactiondatabase
Hugenumberofcandidates
Tediousworkloadofsupportcountingforcandidates
ImprovingApriori:generalideas
Reducepassesoftransactiondatabasescans
Shrinknumberofcandidates
Facilitatesupportcountingofcandidates
December26, DataMining:Conceptsand 93
2012 h
Partition:ScanDatabaseOnlyTwice
AnyitemsetthatispotentiallyfrequentinDBmustbe
frequentinatleastoneofthepartitionsofDB
Scan1:partitiondatabaseandfindlocalfrequentpatterns
Scan2:consolidateglobalfrequentpatterns
A.Savasere,E.Omiecinski,andS.Navathe.Anefficient
algorithmforminingassociationinlargedatabases.In
VLDB95
December26, DataMining:Conceptsand 94
2012 h
SamplingforFrequentPatterns
Selectasampleoforiginaldatabase,minefrequentpatterns
withinsampleusingApriori
Scandatabaseoncetoverifyfrequentitemsetsfoundin
sample,onlyborders ofclosureoffrequentpatternsare
checked
Example:checkabcd insteadofab,ac,,etc.
Scandatabaseagaintofindmissedfrequentpatterns
H.Toivonen.Samplinglargedatabasesforassociationrules.In
VLDB96
December26, DataMining:Conceptsand 95
2012 h
BottleneckofFrequent
patternMining
Multipledatabasescansarecostly
Mininglongpatternsneedsmanypassesof
scanningandgenerateslotsofcandidates
Tofindfrequentitemseti1i2i100
#ofscans:100
#ofCandidates:(1001)+(1002)++(110000)=21001=
1.27*1030!
Bottleneck:candidategenerationandtest
Canweavoidcandidategeneration?
December26, DataMining:Conceptsand 96
2012 h
MiningFrequentPatternsWithout Candidate
Generation
Growlongpatternsfromshortonesusinglocal
frequentitems
abcisafrequentpattern
Getalltransactionshavingabc:DB|abc
disalocalfrequentiteminDB|abc abcdisa
frequentpattern
December26, DataMining:Conceptsand 97
2012 h
ConstructFPtreefromaTransactionDatabase
Completeness
Preservecompleteinformationforfrequentpatternmining
Neverbreakalongpatternofanytransaction
Compactness
Reduceirrelevantinfoinfrequentitemsaregone
Itemsinfrequencydescendingorder:themorefrequently
occurring,themorelikelytobeshared
Neverbelargerthantheoriginaldatabase(notcountnode
linksandthecount field)
ForConnect4DB,compressionratiocouldbeover100
December26, DataMining:Conceptsand 99
2012 h
FindPatternsHavingPFromPconditionalDatabase
StartingatthefrequentitemheadertableintheFPtree
TraversetheFPtreebyfollowingthelinkofeachfrequentitemp
Accumulatealloftransformedprefixpaths ofitemptoformps
conditionalpatternbase
{}
Header Table
f:4 c:1 Conditional pattern bases
Item frequency head
f 4 item cond. pattern base
c 4 c:3 b:1 b:1 c f:3
a 3
b 3 a:3 p:1 a fc:3
m 3 b fca:1, f:1, c:1
p 3 m:2 b:1 m fca:2, fcab:1
Basicconceptsandaroadmap
Efficientandscalablefrequentitemset
miningmethods
Miningvariouskindsofassociationrules
Fromassociationminingtocorrelation
analysis
Constraintbasedassociationmining
Summary
December26, DataMining:Conceptsand 101
2012 h
MiningVariousKindsofAssociationRules
Miningmultilevelassociation
Mimingmultidimensionalassociation
Miningquantitativeassociation
Mininginterestingcorrelationpatterns
Itemsoftenformhierarchies
Flexiblesupportsettings
Itemsatthelowerlevelareexpectedtohavelowersupport
Explorationofshared multilevelmining(Agrawal&
Srikant@VLB95,Han&Fu@VLDB95)
uniformsupport reducedsupport
Level 1
Milk Level 1
min_sup = 5%
[support = 10%] min_sup = 5%
Somerulesmayberedundantduetoancestorrelationships
betweenitems.
Example
milk wheatbread[support=8%,confidence=70%]
2%milk wheatbread[support=2%,confidence=72%]
Wesaythefirstruleisanancestorofthesecondrule.
Aruleisredundantifitssupportisclosetotheexpected
value,basedontherulesancestor.
Singledimensionalrules:
buys(X,milk) buys(X,bread)
Multidimensionalrules: 2dimensionsorpredicates
Interdimensionassoc.rules(norepeatedpredicates)
age(X,1925) occupation(X,student) buys(X,coke)
hybriddimensionassoc.rules(repeatedpredicates)
age(X,1925) buys(X,popcorn) buys(X,coke)
CategoricalAttributes:finitenumberofpossiblevalues,no
orderingamongvaluesdatacubeapproach
QuantitativeAttributes:numeric,implicitorderingamong
valuesdiscretization,clustering,andgradientapproaches
December26, DataMining:Conceptsand 105
2012 h
MiningQuantitativeAssociations
Techniquescanbecategorizedbyhownumericalattributes,
suchasageor salary aretreated
1. Staticdiscretizationbasedonpredefinedconcepthierarchies
(datacubemethods)
2. Dynamicdiscretizationbasedondatadistribution
(quantitativerules,e.g.,Agrawal&Srikant@SIGMOD96)
3. Clustering:Distancebasedassociation(e.g.,Yang&
Miller@SIGMOD97)
onedimensionalclusteringthenassociation
4. Deviation:(suchasAumannandLindell@KDD99)
Sex=female=>Wage:mean=$7/hr(overallmean=$9)
age(X,34-35) income(X,30-50K)
buys(X,high resolution TV)
Flexiblesupportconstraints(Wangetal.@VLDB02)
Someitems(e.g.,diamond)mayoccurrarelybutare
valuable
Customizedsupminspecificationandapplication
TopKclosedfrequentpatterns(Han,etal.@ICDM02)
Hardtospecifysupmin,buttopk withlengthminismore
desirable
Dynamicallyraisesupmin inFPtreeconstructionandmining,
andselectmostpromisingpathtomine
Basicconceptsandaroadmap
Efficientandscalablefrequentitemset
miningmethods
Miningvariouskindsofassociationrules
Fromassociationminingtocorrelation
analysis
Constraintbasedassociationmining
Summary
December26, DataMining:Conceptsand 109
2012 h
InterestingnessMeasure:Correlations(Lift)
Basicconceptsandaroadmap
Efficientandscalablefrequentitemsetmining
methods
Miningvariouskindsofassociationrules
Fromassociationminingtocorrelationanalysis
Constraintbasedassociationmining
Summary
December26, DataMining:Conceptsand 111
2012 h
Constraintbased(QueryDirected)Mining
Findingall thepatternsinadatabaseautonomously?
unrealistic!
Thepatternscouldbetoomanybutnotfocused!
Dataminingshouldbeaninteractiveprocess
Userdirectswhattobeminedusingadataminingquery
language(oragraphicaluserinterface)
Constraintbasedmining
Userflexibility:provides constraints onwhattobemined
Systemoptimization:exploressuchconstraintsforefficient
miningconstraintbasedmining
Knowledgetypeconstraint:
classification,association,etc.
Dataconstraint using SQLlikequeries
findproductpairssoldtogetherinstoresinChicago in
Dec.02
Dimension/levelconstraint
inrelevancetoregion,price,brand,customercategory
Rule(orpattern)constraint
smallsales(price<$10)triggersbigsales(sum>$200)
Interestingnessconstraint
strongrules:min_support 3%,min_confidence 60%
Constrainedminingvs.constraintbasedsearch/reasoning
Bothareaimedatreducingsearchspace
Findingallpatterns satisfyingconstraintsvs.findingsome(or
one)answer inconstraintbasedsearchinAI
Constraintpushing vs.heuristicsearch
Itisaninterestingresearchproblemonhowtointegrate
them
Constrainedminingvs.queryprocessinginDBMS
Databasequeryprocessingrequirestofindall
Constrainedpatternminingsharesasimilarphilosophyas
pushingselectionsdeeplyinqueryprocessing
December26, DataMining:Conceptsand 114
2012 h
TheAprioriAlgorithm Example
Basicconceptsandaroadmap
Efficientandscalablefrequentitemsetmining
methods
Miningvariouskindsofassociationrules
Fromassociationminingtocorrelationanalysis
Constraintbasedassociationmining
Summary
December26, DataMining:Conceptsand 117
2012 h
FrequentPatternMining:Summary
Frequentpatternmininganimportanttaskindatamining
Scalablefrequentpatternminingmethods
Apriori(Candidategeneration&test)
Projectionbased(FPgrowth,CLOSET+,...)
Verticalformatapproach(CHARM,...)
Miningavarietyofrulesandinterestingpatterns
Constraintbasedmining
Miningsequentialandstructuredpatterns
Extensionsandapplications
December26, DataMining:Conceptsand 118
2012 h
ClusterAnalysis
1. WhatisClusterAnalysis?
2. TypesofDatainClusterAnalysis
3. ACategorizationofMajorClusteringMethods
4. PartitioningMethods
5. HierarchicalMethods
6. DensityBasedMethods
7. GridBasedMethods
8. ModelBasedMethods
9. ClusteringHighDimensionalData
10. ConstraintBasedClustering
11. OutlierAnalysis
12. Summary
December26, DataMining:Conceptsand 119
2012 h
WhatisClusterAnalysis?
Cluster:acollectionofdataobjects
Similartooneanotherwithinthesamecluster
Dissimilartotheobjectsinotherclusters
Clusteranalysis
Findingsimilaritiesbetweendataaccordingtothe
characteristicsfoundinthedataandgroupingsimilardata
objectsintoclusters
Unsupervisedlearning:nopredefinedclasses
Typicalapplications
Asastandalonetool togetinsightintodatadistribution
Asapreprocessingstep forotheralgorithms
December26, DataMining:Conceptsand 120
2012 h
Clustering:RichApplicationsand
MultidisciplinaryEfforts
PatternRecognition
SpatialDataAnalysis
CreatethematicmapsinGISbyclusteringfeaturespaces
Detectspatialclustersorforotherspatialminingtasks
ImageProcessing
EconomicScience(especiallymarketresearch)
WWW
Documentclassification
ClusterWeblogdatatodiscovergroupsofsimilaraccess
patterns
Marketing: Helpmarketersdiscoverdistinctgroupsintheircustomerbases,
andthenusethisknowledgetodeveloptargetedmarketingprograms
Landuse: Identificationofareasofsimilarlanduseinanearthobservation
database
Insurance: Identifyinggroupsofmotorinsurancepolicyholderswithahigh
averageclaimcost
Cityplanning: Identifyinggroupsofhousesaccordingtotheirhousetype,
value,andgeographicallocation
Earthquakestudies: Observedearthquakeepicentersshouldbeclustered
alongcontinentfaults
Agoodclustering methodwillproducehighqualityclusters
with
highintraclass similarity
lowinterclass similarity
Thequality ofaclusteringresultdependsonboththesimilarity
measureusedbythemethodanditsimplementation
Thequality ofaclusteringmethodisalsomeasuredbyits
abilitytodiscoversomeorallofthehidden patterns
Dissimilarity/Similaritymetric:Similarityisexpressedinterms
ofadistancefunction,typicallymetric:d(i,j)
Thereisaseparatequalityfunctionthatmeasuresthe
goodnessofacluster.
Thedefinitionsofdistancefunctions areusuallyverydifferent
forintervalscaled,boolean,categorical,ordinalratio,and
vectorvariables.
Weightsshouldbeassociatedwithdifferentvariablesbasedon
applicationsanddatasemantics.
Itishardtodefinesimilarenoughorgoodenough
theansweristypicallyhighlysubjective.
December26, DataMining:Conceptsand 124
2012 h
RequirementsofClusteringinDataMining
Scalability
Abilitytodealwithdifferenttypesofattributes
Abilitytohandledynamicdata
Discoveryofclusterswitharbitraryshape
Minimalrequirementsfordomainknowledgetodetermine
inputparameters
Abletodealwithnoiseandoutliers
Insensitivetoorderofinputrecords
Highdimensionality
Incorporationofuserspecifiedconstraints
Interpretabilityandusability
1. WhatisClusterAnalysis?
2. TypesofDatainClusterAnalysis
3. ACategorizationofMajorClusteringMethods
4. PartitioningMethods
5. HierarchicalMethods
6. DensityBasedMethods
7. GridBasedMethods
8. ModelBasedMethods
9. ClusteringHighDimensionalData
10. ConstraintBasedClustering
11. OutlierAnalysis
12. Summary
December26, DataMining:Conceptsand 126
2012 h
DataStructures
0
d(2,1)
Dissimilaritymatrix 0
d(3,1 ) d ( 3,2 ) 0
(onemode)
: : :
d ( n ,1) d ( n ,2 ) ... ... 0
Intervalscaledvariables
Binaryvariables
Nominal,ordinal,andratiovariables
Variablesofmixedtypes
Standardizedata
Calculatethemeanabsolutedeviation:
s f = 1n (| x1 f m f | + | x2 f m f | +...+ | xnf m f |)
Calculatethestandardizedmeasurement(zscore)
xif m f
zif = sf
Usingmeanabsolutedeviationismorerobustthanusing
standarddeviation
December26, DataMining:Conceptsand 129
2012 h
SimilarityandDissimilarityBetween
Objects
Distances arenormallyusedtomeasurethesimilarity or
dissimilarity betweentwodataobjects
Somepopularonesinclude:Minkowskidistance:
d (i, j) = q (| x x |q + | x x |q +...+ | x x |q )
i1 j1 i2 j2 ip jp
wherei =(xi1,xi2,,xip)and j =(xj1,xj2,,xjp)aretwop
dimensionaldataobjects,andq isapositiveinteger
Ifq =1,d isManhattandistance
d(i, j) =| x x | +| x x | +...+| x x |
i1 j1 i2 j2 ip jp
Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
genderisasymmetricattribute
theremainingattributesareasymmetricbinary
letthevaluesYandPbesetto1,andthevalueNbesetto0
0 + 1
d ( jack , mary ) = = 0 . 33
2 + 0 + 1
1 + 1
d ( jack , jim ) = = 0 . 67
1 + 1 + 1
1 + 2
d ( jim , mary ) = = 0 . 75
1 + 1 + 2
December26, DataMining:Conceptsand 133
2012 h
NominalVariables
Ageneralizationofthebinaryvariableinthatitcantakemore
than2states,e.g.,red,yellow,blue,green
Method1:Simplematching
m:#ofmatches, p:total#ofvariables
d ( i , j ) = p p m
Method2:usealargenumberofbinaryvariables
creatinganewbinaryvariableforeachoftheM nominal
states
Anordinalvariablecanbediscreteorcontinuous
Orderisimportant,e.g.,rank
Canbetreatedlikeintervalscaled
replacexif bytheirrank rif {1,..., M f }
maptherangeofeachvariableonto[0,1]byreplacing ith
objectinthefthvariableby
r if 1
z =
if M f 1
computethedissimilarityusingmethodsforintervalscaled
variables
December26, DataMining:Conceptsand 135
2012 h
RatioScaledVariables
Ratioscaledvariable:apositivemeasurementonanonlinear
scale,approximatelyatexponentialscale, suchas
AeBt orAeBt
Methods:
treatthemlikeintervalscaledvariablesnotagoodchoice!
(why?thescalecanbedistorted)
applylogarithmictransformation
yif= log(xif)
treatthemascontinuousordinaldatatreattheirrankas
intervalscaled
Adatabasemaycontainallthesixtypesofvariables
symmetricbinary,asymmetricbinary,nominal,ordinal,
intervalandratio
Onemayuseaweightedformulatocombinetheireffects
pf = 1 ij( f ) d ij( f )
d (i, j ) =
f isbinaryornominal: pf = 1 ij( f )
dij(f) =0ifxif=xjf ,ordij(f) =1otherwise
f isintervalbased:usethenormalizeddistance
f isordinalorratioscaled
computeranksrif and
andtreatzif asintervalscaled
z if = r 1
if
fM 1
December26, DataMining:Conceptsand 137
2012 h
VectorObjects
Vectorobjects:keywordsindocuments,gene
featuresinmicroarrays,etc.
Broadapplications:informationretrieval,
biologictaxonomy,etc.
Cosinemeasure
Avariant:Tanimotocoefficient
December26, DataMining:Conceptsand 138
2012 h
ClusterAnalysis
1. WhatisClusterAnalysis?
2. TypesofDatainClusterAnalysis
3. ACategorizationofMajorClusteringMethods
4. PartitioningMethods
5. HierarchicalMethods
6. DensityBasedMethods
7. GridBasedMethods
8. ModelBasedMethods
9. ClusteringHighDimensionalData
10. ConstraintBasedClustering
11. OutlierAnalysis
12. Summary
December26, DataMining:Conceptsand 139
2012 h
MajorClusteringApproaches(I)
Partitioningapproach:
Constructvariouspartitionsandthenevaluatethembysomecriterion,e.g.,
minimizingthesumofsquareerrors
Typicalmethods:kmeans,kmedoids,CLARANS
Hierarchicalapproach:
Createahierarchicaldecompositionofthesetofdata(orobjects)usingsome
criterion
Typicalmethods:Diana,Agnes,BIRCH,ROCK,CAMELEON
Densitybasedapproach:
Basedonconnectivityanddensityfunctions
Typicalmethods:DBSACN,OPTICS,DenClue
1. WhatisClusterAnalysis?
2. TypesofDatainClusterAnalysis
3. ACategorizationofMajorClusteringMethods
4. PartitioningMethods
5. HierarchicalMethods
6. DensityBasedMethods
7. GridBasedMethods
8. ModelBasedMethods
9. ClusteringHighDimensionalData
10. ConstraintBasedClustering
11. OutlierAnalysis
12. Summary
December26, DataMining:Conceptsand 142
2012 h
PartitioningAlgorithms:BasicConcept
Givenk,thekmeans algorithmisimplementedinfour
steps:
Partitionobjectsintok nonemptysubsets
Computeseedpointsasthecentroidsoftheclustersof
thecurrentpartition(thecentroidisthecenter,i.e.,
meanpoint,ofthecluster)
Assigneachobjecttotheclusterwiththenearestseed
point
GobacktoStep2,stopwhennomorenewassignment
Example
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3 Update 3
3
2 each
2 the 2
1 1
1
objects 0
cluster
0
0
0 1 2 3 4 5 6 7 8 9 10 tomost
0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10
similar
center reassign reassign
10 10
K=2 9 9
8 8
ArbitrarilychooseK 7 7
6 6
objectasinitialcluster 5 5
center 4 Update 4
3 3
2
the 2
1 cluster 1
0
0 1 2 3 4 5 6 7 8 9 10
means 0
0 1 2 3 4 5 6 7 8 9 10
Afewvariantsofthekmeans whichdifferin
Selectionoftheinitialk means
Dissimilaritycalculations
Strategiestocalculateclustermeans
Handlingcategoricaldata:kmodes (Huang98)
Replacingmeansofclusterswithmodes
Usingnewdissimilaritymeasurestodealwithcategoricalobjects
Usingafrequencybasedmethodtoupdatemodesofclusters
Amixtureofcategoricalandnumericaldata:kprototype method
Thekmeansalgorithmissensitivetooutliers!
Sinceanobjectwithanextremelylargevaluemaysubstantiallydistort
thedistributionofthedata.
KMedoids:Insteadoftakingthemean valueoftheobjectinaclusterasa
referencepoint,medoids canbeused,whichisthemostcentrallylocated
objectinacluster.
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
1. WhatisClusterAnalysis?
2. TypesofDatainClusterAnalysis
3. ACategorizationofMajorClusteringMethods
4. PartitioningMethods
5. HierarchicalMethods
6. DensityBasedMethods
7. GridBasedMethods
8. ModelBasedMethods
9. ClusteringHighDimensionalData
10. ConstraintBasedClustering
11. OutlierAnalysis
12. Summary
December26, DataMining:Conceptsand 149
2012 h
HierarchicalClustering
Usedistancematrixasclusteringcriteria.Thismethoddoes
notrequirethenumberofclustersk asaninput,butneedsa
terminationcondition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
December26, DataMining:Conceptsand 150
2012 h
ClusterAnalysis
1. WhatisClusterAnalysis?
2. TypesofDatainClusterAnalysis
3. ACategorizationofMajorClusteringMethods
4. PartitioningMethods
5. HierarchicalMethods
6. DensityBasedMethods
7. GridBasedMethods
8. ModelBasedMethods
9. ClusteringHighDimensionalData
10. ConstraintBasedClustering
11. OutlierAnalysis
12. Summary
December26, DataMining:Conceptsand 151
2012 h
DensityBasedClusteringMethods
Clusteringbasedondensity(localclustercriterion),suchas
densityconnectedpoints
Majorfeatures:
Discoverclustersofarbitraryshape
Handlenoise
Onescan
Needdensityparametersasterminationcondition
Severalinterestingstudies:
DBSCAN: Ester,etal.(KDD96)
OPTICS:Ankerst,etal(SIGMOD99).
DENCLUE:Hinneburg&D.Keim(KDD98)
CLIQUE:Agrawal,etal.(SIGMOD98)(moregridbased)
December26, DataMining:Conceptsand 152
2012 h
DensityBasedClustering:BasicConcepts
Twoparameters:
Eps:Maximumradiusoftheneighbourhood
MinPts:MinimumnumberofpointsinanEps
neighbourhoodofthatpoint
NEps(p): {qbelongstoD| dist(p,q)<=Eps}
Directlydensityreachable:Apointp isdirectlydensity
reachablefromapointq w.r.t.Eps,MinPts if
p belongstoNEps(q)
p MinPts = 5
corepointcondition:
q
|NEps (q)|>=MinPts Eps = 1 cm
1. WhatisClusterAnalysis?
2. TypesofDatainClusterAnalysis
3. ACategorizationofMajorClusteringMethods
4. PartitioningMethods
5. HierarchicalMethods
6. DensityBasedMethods
7. GridBasedMethods
8. ModelBasedMethods
9. ClusteringHighDimensionalData
10. ConstraintBasedClustering
11. OutlierAnalysis
12. Summary
December26, DataMining:Conceptsand 154
2012 h
GridBasedClusteringMethod
Usingmultiresolutiongriddatastructure
Severalinterestingmethods
STING(aSTatisticalINformationGridapproach)byWang,Yangand
Muntz(1997)
WaveCluster bySheikholeslami,Chatterjee,andZhang(VLDB98)
Amultiresolutionclusteringapproachusingwaveletmethod
CLIQUE:Agrawal,etal.(SIGMOD98)
Onhighdimensionaldata(thusputinthesectionofclusteringhigh
dimensionaldata
1. WhatisClusterAnalysis?
2. TypesofDatainClusterAnalysis
3. ACategorizationofMajorClusteringMethods
4. PartitioningMethods
5. HierarchicalMethods
6. DensityBasedMethods
7. GridBasedMethods
8. ModelBasedMethods
9. ClusteringHighDimensionalData
10. ConstraintBasedClustering
11. OutlierAnalysis
12. Summary
December26, DataMining:Conceptsand 156
2012 h
ModelBasedClustering
Whatismodelbasedclustering?
Attempttooptimizethefitbetweenthegivendataandsome
mathematicalmodel
Basedontheassumption:Dataaregeneratedbyamixtureof
underlyingprobabilitydistribution
Typicalmethods
Statisticalapproach
EM(Expectationmaximization),AutoClass
Machinelearningapproach
COBWEB,CLASSIT
Neuralnetworkapproach
SOM(SelfOrganizingFeatureMap)
December26, DataMining:Conceptsand 157
2012 h
SelfOrganizingFeatureMap(SOM)
SOMs,alsocalledtopologicalorderedmaps,orKohonenSelfOrganizing
FeatureMap(KSOMs)
Itmapsallthepointsinahighdimensionalsourcespaceintoa2to3dtarget
space,s.t.,thedistanceandproximityrelationship(i.e.,topology)are
preservedasmuchaspossible
Similartokmeans:clustercenterstendtolieinalowdimensionalmanifoldin
thefeaturespace
Clusteringisperformedbyhavingseveralunitscompetingforthecurrent
object
Theunitwhoseweightvectorisclosesttothecurrentobjectwins
Thewinneranditsneighborslearnbyhavingtheirweightsadjusted
SOMsarebelievedtoresembleprocessingthatcanoccurinthebrain
Usefulforvisualizinghighdimensionaldatain2 or3Dspace
1. WhatisClusterAnalysis?
2. TypesofDatainClusterAnalysis
3. ACategorizationofMajorClusteringMethods
4. PartitioningMethods
5. HierarchicalMethods
6. DensityBasedMethods
7. GridBasedMethods
8. ModelBasedMethods
9. ClusteringHighDimensionalData
10. ConstraintBasedClustering
11. OutlierAnalysis
12. Summary
December26, DataMining:Conceptsand 159
2012 h
ClusteringHighDimensionalData
Clusteringhighdimensionaldata
Manyapplications:textdocuments,DNAmicroarraydata
Majorchallenges:
Manyirrelevantdimensionsmaymaskclusters
Distancemeasurebecomesmeaninglessduetoequidistance
Clustersmayexistonlyinsomesubspaces
Methods
Featuretransformation:onlyeffectiveifmostdimensionsarerelevant
PCA&SVDusefulonlywhenfeaturesarehighlycorrelated/redundant
Featureselection:wrapperorfilterapproaches
usefultofindasubspacewherethedatahaveniceclusters
Subspaceclustering:findclustersinallthepossiblesubspaces
CLIQUE,ProClus,andfrequentpatternbasedclustering
December26, DataMining:Conceptsand 160
2012 h
CLIQUE(ClusteringInQUEst)
Agrawal,Gehrke,Gunopulos,Raghavan(SIGMOD98)
Automaticallyidentifyingsubspacesofahighdimensionaldataspacethat
allowbetterclusteringthanoriginalspace
CLIQUEcanbeconsideredasbothdensitybasedandgridbased
Itpartitionseachdimensionintothesamenumberofequallengthinterval
Itpartitionsanmdimensionaldataspaceintononoverlappingrectangular
units
Aunitisdenseifthefractionoftotaldatapointscontainedintheunit
exceedstheinputmodelparameter
Aclusterisamaximalsetofconnecteddenseunitswithinasubspace
Partitionthedataspaceandfindthenumberofpointsthatlie
insideeachcellofthepartition.
IdentifythesubspacesthatcontainclustersusingtheApriori
principle
Identifyclusters
Determinedenseunitsinallsubspacesofinterests
Determineconnecteddenseunitsinallsubspacesof
interests.
Generateminimaldescriptionfortheclusters
Determinemaximalregionsthatcoveraclusterofconnected
denseunitsforeachcluster
Determinationofminimalcoverforeachcluster
(week)
Salary
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
age age
20 30 40 50 60 20 30 40 50 60
=3
Vacation
30 50
age
Strength
automatically findssubspacesofthe highestdimensionality
suchthathighdensityclustersexistinthosesubspaces
insensitive totheorderofrecordsininputanddoesnot
presumesomecanonicaldatadistribution
scales linearly withthesizeofinputandhasgoodscalability
asthenumberofdimensionsinthedataincreases
Weakness
Theaccuracyoftheclusteringresultmaybedegradedatthe
expenseofsimplicityofthemethod
Needuserfeedback:Usersknowtheirapplicationsthebest
Lessparametersbutmoreuserdesiredconstraints,e.g.,anATM
allocationproblem:obstacle&desiredclusters
1. WhatisClusterAnalysis?
2. TypesofDatainClusterAnalysis
3. ACategorizationofMajorClusteringMethods
4. PartitioningMethods
5. HierarchicalMethods
6. DensityBasedMethods
7. GridBasedMethods
8. ModelBasedMethods
9. ClusteringHighDimensionalData
10. ConstraintBasedClustering
11. OutlierAnalysis
12. Summary
December26, DataMining:Conceptsand 166
2012 h
WhatIsOutlierDiscovery?
Whatareoutliers?
Thesetofobjectsareconsiderablydissimilarfromthe
remainderofthedata
Example:Sports:MichaelJordon,WayneGretzky,...
Problem:Defineandfindoutliersinlargedatasets
Applications:
Creditcardfrauddetection
Telecomfrauddetection
Customersegmentation
Medicalanalysis
M Assumeamodelunderlyingdistributionthatgeneratesdata
set(e.g.normaldistribution)
Usediscordancytestsdependingon
datadistribution
distributionparameter(e.g.,mean,variance)
numberofexpectedoutliers
Drawbacks
mosttestsareforsingleattribute
Inmanycases,datadistributionmaynotbeknown
December26, DataMining:Conceptsand 168
2012 h
OutlierDiscovery:DistanceBasedApproach
Introducedtocounterthemainlimitationsimposedby
statisticalmethods
Weneedmultidimensionalanalysiswithoutknowingdata
distribution
Distancebasedoutlier:ADB(p,D)outlierisanobjectOina
datasetTsuchthatatleastafractionpoftheobjectsinTlies
atadistancegreaterthanDfromO
Algorithmsforminingdistancebasedoutliers
Indexbasedalgorithm
Nestedloopalgorithm
Cellbasedalgorithm
1. WhatisClusterAnalysis?
2. TypesofDatainClusterAnalysis
3. ACategorizationofMajorClusteringMethods
4. PartitioningMethods
5. HierarchicalMethods
6. DensityBasedMethods
7. GridBasedMethods
8. ModelBasedMethods
9. ClusteringHighDimensionalData
10. ConstraintBasedClustering
11. OutlierAnalysis
12. Summary
December26, DataMining:Conceptsand 170
2012 h
Summary
Statetheneedformarketbasketanalysis?
Whatarethetwoconditionsthatmakeassociationrule
interesting?
Statethetwostepprocessofassociationrulemining?
DefineAprioriproperty?
ListthetechniquestoimprovetheefficiencyofApriori
Whatisclusteringanalysis?
Givethetypicalrequirementsofclusteringindatamining?
Whatisthedifferencebetweensymmetricandasymmetric
binaryvariables?
Statethetypesofdatainclusteranalysis?
DataminingconceptsandTechniquesby
JiaweiHanandMichelineKamber
R.Agrawal,J.Gehrke,D.Gunopulos,andP.
Raghavan.Automaticsubspaceclusteringof
highdimensionaldatafordatamining
applications
R.AgrawalandR.Srikant.Fastalgorithmsfor
miningassociationrules.VLDB'94
Classificationandprediction
Issuesregardingclassificationand Lazylearners(orlearningfrom
prediction yourneighbors)
Classificationbydecisiontree Otherclassificationmethods
induction Prediction
Bayesianclassification Accuracyanderrormeasures
Rulebasedclassification Ensemblemethods
Modelselection
Classificationbybackpropagation
Summary
Classification
predictscategoricalclasslabels(discreteornominal)
classifiesdata(constructsamodel)basedonthetraining
setandthevalues(classlabels)inaclassifyingattribute
andusesitinclassifyingnewdata
Prediction
modelscontinuousvaluedfunctions,i.e.,predictsunknown
ormissingvalues
Typicalapplications
Creditapproval
Targetmarketing
Medicaldiagnosis
Frauddetection
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAM E RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
M erlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
December26, DataMining:Conceptsand 179
2012 h
Supervisedvs.UnsupervisedLearning
Supervisedlearning(classification)
Supervision:Thetrainingdata(observations,
measurements,etc.)areaccompaniedbylabelsindicating
theclassoftheobservations
Newdataisclassifiedbasedonthetrainingset
Unsupervisedlearning (clustering)
Theclasslabelsoftrainingdataisunknown
Givenasetofmeasurements,observations,etc.withthe
aimofestablishingtheexistenceofclassesorclustersin
thedata
December26, DataMining:Conceptsand 180
2012 h
Chapter6.Classification
andPrediction
Whatisclassification?Whatis SupportVectorMachines(SVM)
prediction? Associativeclassification
Issuesregardingclassificationand Lazylearners(orlearningfrom
prediction yourneighbors)
Classificationbydecisiontree Otherclassificationmethods
induction Prediction
Bayesianclassification Accuracyanderrormeasures
Rulebasedclassification Ensemblemethods
Modelselection
Classificationbybackpropagation
Summary
Datacleaning
Preprocessdatainordertoreducenoiseandhandle
missingvalues
Relevanceanalysis(featureselection)
Removetheirrelevantorredundantattributes
Datatransformation
Generalizeand/ornormalizedata
Accuracy
classifieraccuracy:predictingclasslabel
predictoraccuracy:guessingvalueofpredictedattributes
Speed
timetoconstructthemodel(trainingtime)
timetousethemodel(classification/predictiontime)
Robustness:handlingnoiseandmissingvalues
Scalability:efficiencyindiskresidentdatabases
Interpretability
understandingandinsightprovidedbythemodel
Othermeasures,e.g.,goodnessofrules,suchasdecisiontree
sizeorcompactnessofclassificationrules
Issuesregardingclassificationand Lazylearners(orlearningfrom
prediction yourneighbors)
Classificationbydecisiontree Otherclassificationmethods
induction Prediction
Bayesianclassification Accuracyanderrormeasures
Rulebasedclassification Ensemblemethods
Modelselection
Classificationbybackpropagation
Summary
age?
<=30 overcast
31..40 >40
no yes yes
Basicalgorithm(agreedyalgorithm)
Treeisconstructedinatopdownrecursivedivideandconquermanner
Atstart,allthetrainingexamplesareattheroot
Attributesarecategorical(ifcontinuousvalued,theyarediscretizedin
advance)
Examplesarepartitionedrecursivelybasedonselectedattributes
Testattributesareselectedonthebasisofaheuristicorstatistical
measure(e.g.,informationgain)
Conditionsforstoppingpartitioning
Allsamplesforagivennodebelongtothesameclass
Therearenoremainingattributesforfurtherpartitioning majority
voting isemployedforclassifyingtheleaf
Therearenosamplesleft
Classificationaclassicalproblemextensivelystudiedby
statisticiansandmachinelearningresearchers
Scalability:Classifyingdatasetswithmillionsofexamplesand
hundredsofattributeswithreasonablespeed
Whydecisiontreeinductionindatamining?
relativelyfasterlearningspeed(thanotherclassification
methods)
convertibletosimpleandeasytounderstandclassification
rules
canuseSQLqueriesforaccessingdatabases
comparableclassificationaccuracywithothermethods
Integrationofgeneralizationwithdecisiontreeinduction
(Kamberetal.97)
Classificationatprimitiveconceptlevels
E.g.,precisetemperature,humidity,outlook,etc.
Lowlevelconcepts,scatteredclasses,bushyclassification
trees
Semanticinterpretationproblems
Cubebasedmultilevelclassification
Relevanceanalysisatmultilevels
Informationgainanalysiswithdimension+level
December26, DataMining:Conceptsand 189
2012 h
Classificationand
Prediction
Whatisclassification?Whatis SupportVectorMachines(SVM)
prediction? Associativeclassification
Issuesregardingclassificationand Lazylearners(orlearningfrom
prediction yourneighbors)
Classificationbydecisiontree Otherclassificationmethods
induction Prediction
Bayesianclassification Accuracyanderrormeasures
Rulebasedclassification Ensemblemethods
Modelselection
Classificationbybackpropagation
Summary
Astatisticalclassifier:performsprobabilisticprediction,i.e.,
predictsclassmembershipprobabilities
Foundation: BasedonBayesTheorem.
Performance: AsimpleBayesianclassifier,naveBayesian
classifier,hascomparableperformancewithdecisiontreeand
selectedneuralnetworkclassifiers
Incremental:Eachtrainingexamplecanincrementally
increase/decreasetheprobabilitythatahypothesisiscorrect
priorknowledgecanbecombinedwithobserveddata
Standard:EvenwhenBayesianmethodsarecomputationally
intractable,theycanprovideastandardofoptimaldecision
makingagainstwhichothermethodscanbemeasured
LetX beadatasample(evidence):classlabelisunknown
LetHbeahypothesis thatXbelongstoclassC
ClassificationistodetermineP(H|X),theprobabilitythatthe
hypothesisholdsgiventheobserveddatasampleX
P(H)(priorprobability),theinitialprobability
E.g., X willbuycomputer,regardlessofage,income,
P(X):probabilitythatsampledataisobserved
P(X|H)(posterioriprobability),theprobabilityofobservingthe
sampleX,giventhatthehypothesisholds
E.g., Giventhat X willbuycomputer,theprob.thatXis31..40,
mediumincome
December26, DataMining:Conceptsand 192
2012 h
BayesianTheorem
Giventrainingdata X,posterioriprobabilityofahypothesisH,
P(H|X),followstheBayestheorem
P (H | X ) = P (X | H )P (H )
P (X )
Informally,thiscanbewrittenas
posteriori=likelihoodxprior/evidence
PredictsX belongstoC2 ifftheprobabilityP(Ci|X)isthehighest
amongalltheP(Ck|X)forallthek classes
Practicaldifficulty:requireinitialknowledgeofmany
probabilities,significantcomputationalcost
December26, DataMining:Conceptsand 193
2012 h
TowardsNaveBayesianClassifier
LetDbeatrainingsetoftuplesandtheirassociatedclasslabels,
andeachtupleisrepresentedbyannDattributevectorX =(x1,
x2,,xn)
Supposetherearem classesC1,C2,,Cm.
Classificationistoderivethemaximumposteriori,i.e.,the
maximalP(Ci|X)
ThiscanbederivedfromBayestheorem
P(X | C )P(C )
P(C | X) = i i
i P(X)
SinceP(X)isconstantforallclasses,only
P(C | X) = P(X | C )P(C )
i i i
needstobemaximized
P(Ci):P(buys_computer=yes)=9/14=0.643
P(buys_computer=no)=5/14=0.357
ComputeP(X|Ci)foreachclass
P(age=<=30|buys_computer=yes)=2/9=0.222
P(age=<=30|buys_computer=no)=3/5=0.6
P(income=medium|buys_computer=yes)=4/9=0.444
P(income=medium|buys_computer=no)=2/5=0.4
P(student=yes|buys_computer=yes)=6/9=0.667
P(student=yes|buys_computer=no)=1/5=0.2
P(credit_rating=fair|buys_computer=yes)=6/9=0.667
P(credit_rating=fair|buys_computer=no)=2/5=0.4
X=(age<=30,income=medium,student=yes,credit_rating=fair)
P(X|Ci): P(X|buys_computer=yes)=0.222x0.444x0.667x0.667=0.044
P(X|buys_computer=no)=0.6x0.4x0.2x0.4=0.019
P(X|Ci)*P(Ci):P(X|buys_computer=yes)*P(buys_computer=yes)=0.028
P(X|buys_computer=no)*P(buys_computer=no)=0.007
Therefore,Xbelongstoclass(buys_computer=yes)
Advantages
Easytoimplement
Goodresultsobtainedinmostofthecases
Disadvantages
Assumption:classconditionalindependence,thereforelossof
accuracy
Practically,dependenciesexistamongvariables
E.g.,hospitals:patients:Profile:age,familyhistory,etc.
Symptoms:fever,coughetc.,Disease:lungcancer,diabetes,etc.
DependenciesamongthesecannotbemodeledbyNaveBayesian
Classifier
Howtodealwiththesedependencies?
BayesianBeliefNetworks
Bayesianbeliefnetworkallowsasubset ofthevariables
conditionallyindependent
Agraphicalmodelofcausalrelationships
Representsdependency amongthevariables
Givesaspecificationofjointprobabilitydistribution
Nodes:randomvariables
Links:dependency
X Y XandYaretheparentsofZ,andYisthe
parentofP
Z NodependencybetweenZandP
P
Hasnoloopsorcycles
December26, DataMining:Conceptsand 198
2012 h
BayesianBeliefNetwork:AnExample
CPTshowstheconditionalprobabilityforeach
possiblecombinationofitsparents
Severalscenarios:
Givenboththenetworkstructureandallvariables
observable:learnonlytheCPTs
Networkstructureknown,somehiddenvariables:gradient
descent (greedyhillclimbing)method,analogoustoneural
networklearning
Networkstructureunknown,allvariablesobservable:
searchthroughthemodelspacetoreconstructnetwork
topology
Unknownstructure,allhiddenvariables:Nogood
algorithmsknownforthispurpose
Ref.D.Heckerman:Bayesiannetworksfordatamining
December26, DataMining:Conceptsand 200
2012 h
Classificationand
Prediction
Whatisclassification?Whatis SupportVectorMachines(SVM)
prediction? Associativeclassification
Issuesregardingclassificationand Lazylearners(orlearningfrom
prediction yourneighbors)
Classificationbydecisiontree Otherclassificationmethods
induction Prediction
Bayesianclassification Accuracyanderrormeasures
Rulebasedclassification Ensemblemethods
Modelselection
Classificationbybackpropagation
Summary
Sequentialcoveringalgorithm:Extractsrulesdirectlyfromtrainingdata
Typicalsequentialcoveringalgorithms:FOIL,AQ,CN2,RIPPER
Rulesarelearnedsequentially,eachforagivenclassCiwillcovermanytuples
ofCibutnone(orfew)ofthetuplesofotherclasses
Steps:
Rulesarelearnedoneatatime
Eachtimearuleislearned,thetuplescoveredbytherulesareremoved
Theprocessrepeatsontheremainingtuplesunlessterminationcondition,
e.g.,whennomoretrainingexamplesorwhenthequalityofarule
returnedisbelowauserspecifiedthreshold
Comp.w.decisiontreeinduction:learningasetofrulessimultaneously
Issuesregardingclassificationand Lazylearners(orlearningfrom
prediction yourneighbors)
Classificationbydecisiontree Otherclassificationmethods
induction Prediction
Bayesianclassification Accuracyanderrormeasures
Rulebasedclassification Ensemblemethods
Modelselection
Classificationbybackpropagation
Summary
Classification:
predictscategoricalclasslabels
E.g.,Personalhomepageclassification
xi =(x1,x2,x3,),yi =+1or1
x1 :#ofawordhomepage
x2 :#ofawordwelcome
Mathematically
x X=n,y Y={+1,1}
Wewantafunctionf:X Y
BinaryClassificationproblem
Thedataabovetheredline
belongstoclassx
Thedatabelowredline
x belongstoclasso
x
x x x Examples:SVM,Perceptron,
x ProbabilisticClassifiers
x x o
x
o
x o o
ooo
o o
o o o o
Advantages
predictionaccuracyisgenerallyhigh
AscomparedtoBayesianmethods ingeneral
robust,workswhentrainingexamplescontainerrors
fastevaluationofthelearnedtargetfunction
Bayesiannetworksarenormallyslow
Criticism
longtrainingtime
difficulttounderstandthelearnedfunction(weights)
Bayesiannetworkscanbeusedeasilyforpatterndiscovery
noteasytoincorporatedomainknowledge
Easyintheformofpriorsonthedataordistributions
December26, DataMining:Conceptsand 208
2012 h
Classificationby
Backpropagation
Backpropagation:Aneuralnetworklearningalgorithm
Startedbypsychologistsandneurobiologiststodevelopand
testcomputationalanaloguesofneurons
Aneuralnetwork:Asetofconnectedinput/outputunits
whereeachconnectionhasaweight associatedwithit
Duringthelearningphase,thenetworklearnsbyadjusting
theweights soastobeabletopredictthecorrectclasslabel
oftheinputtuples
Alsoreferredtoasconnectionistlearning duetothe
connectionsbetweenunits
December26, DataMining:Conceptsand 209
2012 h
NeuralNetworkasaClassifier
Weakness
Longtrainingtime
Requireanumberofparameterstypicallybestdeterminedempirically,
e.g.,thenetworktopologyor``structure."
Poorinterpretability:Difficulttointerpretthesymbolicmeaningbehind
thelearnedweightsandof``hiddenunits"inthenetwork
Strength
Hightolerancetonoisydata
Abilitytoclassifyuntrainedpatterns
Wellsuitedforcontinuousvaluedinputsandoutputs
Successfulonawidearrayofrealworlddata
Algorithmsareinherentlyparallel
Techniqueshaverecentlybeendevelopedfortheextractionofrules
fromtrainedneuralnetworks
December26, DataMining:Conceptsand 210
2012 h
ANeuron(=aperceptron)
- k
x0 w0
x1
w1
f
output y
xn wn
For Example
n
Input weight weighted Activation y = sign( wi xi + k )
vector x vector w sum function i =0
Thendimensionalinputvectorx ismappedintovariableybymeansof
thescalarproductandanonlinearfunctionmapping
Outputvector
Err j = O j (1 O j ) Errk w jk
Outputlayer k
j = j + (l) Err j
wij = wij + (l ) Err j Oi
Hiddenlayer Err j = O j (1 O j )(T j O j )
wij 1
Oj = I j
1+ e
Inputlayer
I j = wij Oi + j
i
Inputvector:X
December26, DataMining:Conceptsand 212
2012 h
HowAMultiLayerNeuralNetworkWorks?
Theinputs tothenetworkcorrespondtotheattributesmeasuredforeach
trainingtuple
Inputsarefedsimultaneouslyintotheunitsmakinguptheinputlayer
Theyarethenweightedandfedsimultaneouslytoahiddenlayer
Thenumberofhiddenlayersisarbitrary,althoughusuallyonlyone
Theweightedoutputsofthelasthiddenlayerareinputtounitsmakingup
theoutputlayer,whichemitsthenetwork'sprediction
Thenetworkisfeedforward inthatnoneoftheweightscyclesbacktoan
inputunitortoanoutputunitofapreviouslayer
Fromastatisticalpointofview,networksperformnonlinearregression:
Givenenoughhiddenunitsandenoughtrainingsamples,theycanclosely
approximateanyfunction
Firstdecidethenetworktopology:#ofunitsintheinputlayer,#
ofhiddenlayers (if>1),#ofunitsineachhiddenlayer,and#of
unitsintheoutputlayer
Normalizingtheinputvaluesforeachattributemeasuredinthe
trainingtuplesto[0.01.0]
Oneinput unitperdomainvalue,eachinitializedto0
Output,ifforclassificationandmorethantwoclasses,one
outputunitperclassisused
Onceanetworkhasbeentrainedanditsaccuracyis
unacceptable,repeatthetrainingprocesswithadifferent
networktopology oradifferentsetofinitialweights
Iterativelyprocessasetoftrainingtuples&comparethenetwork's
predictionwiththeactualknowntargetvalue
Foreachtrainingtuple,theweightsaremodifiedtominimizethemean
squarederror betweenthenetwork'spredictionandtheactualtargetvalue
Modificationsaremadeinthebackwardsdirection:fromtheoutputlayer,
througheachhiddenlayerdowntothefirsthiddenlayer,hence
backpropagation
Steps
Initializeweights(tosmallrandom#s)andbiasesinthenetwork
Propagatetheinputsforward(byapplyingactivationfunction)
Backpropagatetheerror(byupdatingweightsandbiases)
Terminatingcondition(whenerrorisverysmall,etc.)
Issuesregardingclassificationand Lazylearners(orlearningfrom
prediction yourneighbors)
Classificationbydecisiontree Otherclassificationmethods
induction Prediction
Bayesianclassification Accuracyanderrormeasures
Rulebasedclassification Ensemblemethods
Modelselection
Classificationbybackpropagation
Summary
Associativeclassification
Associationrulesaregeneratedandanalyzedforuseinclassification
Searchforstrongassociationsbetweenfrequentpatterns(conjunctionsof
attributevaluepairs)andclasslabels
Classification:Basedonevaluatingasetofrulesintheformof
P1 ^p2 ^pl Aclass =C(conf,sup)
Whyeffective?
Itexploreshighlyconfidentassociationsamongmultipleattributesandmay
overcomesomeconstraintsintroducedbydecisiontreeinduction,which
considersonlyoneattributeatatime
Inmanystudies,associativeclassificationhasbeenfoundtobemore
accuratethansometraditionalclassificationmethods,suchasC4.5
CBA(ClassificationByAssociation:Liu,Hsu&Ma,KDD98)
Mineassociationpossiblerulesintheformof
Condset(asetofattributevaluepairs) classlabel
Buildclassifier:Organizerulesaccordingtodecreasingprecedencebasedon
confidenceandthensupport
CMAR(ClassificationbasedonMultipleAssociationRules:Li,Han,Pei,ICDM01)
Classification:Statisticalanalysisonmultiplerules
CPAR(ClassificationbasedonPredictiveAssociationRules:Yin&Han,SDM03)
Generationofpredictiverules(FOILlikeanalysis)
Highefficiency,accuracysimilartoCMAR
RCBT(Miningtopk coveringrulegroupsforgeneexpressiondata,Congetal.SIGMOD05)
Explorehighdimensionalclassification,usingtopkrulegroups
Achievehighclassificationaccuracyandhighruntimeefficiency
December26, DataMining:Conceptsand 218
2012 h
ThekNearestNeighborAlgorithm
AllinstancescorrespondtopointsinthenDspace
ThenearestneighboraredefinedintermsofEuclidean
distance,dist(X1,X2)
Targetfunctioncouldbediscrete orreal valued
Fordiscretevalued,kNNreturnsthemostcommonvalue
amongthek trainingexamplesnearestto xq
Vonoroidiagram:thedecisionsurfaceinducedby1NNfor
atypicalsetoftrainingexamples
_
_
_ _
.
+
_ .
+
xq + . . .
_
December26,
+ .
DataMining:Conceptsand 219
2012 h
Classificationand
Prediction
Whatisclassification?Whatis SupportVectorMachines(SVM)
prediction? Associativeclassification
Issuesregardingclassificationand Lazylearners(orlearningfrom
prediction yourneighbors)
Classificationbydecisiontree Otherclassificationmethods
induction Prediction
Bayesianclassification Accuracyanderrormeasures
Rulebasedclassification Ensemblemethods
Modelselection
Classificationbybackpropagation
Summary
(Numerical)predictionissimilartoclassification
constructamodel
usemodeltopredictcontinuousororderedvalueforagiveninput
Predictionisdifferentfromclassification
Classificationreferstopredictcategoricalclasslabel
Predictionmodelscontinuousvaluedfunctions
Majormethodforprediction:regression
modeltherelationshipbetweenoneormoreindependent orpredictor
variablesandadependent orresponse variable
Regressionanalysis
Linearandmultipleregression
Nonlinearregression
Otherregressionmethods:generalizedlinearmodel,Poissonregression,
loglinearmodels,regressiontrees
December26, DataMining:Conceptsand 221
2012 h
LinearRegression
Linearregression:involvesaresponsevariableyandasinglepredictor
variablex
y=w0 +w1 x
wherew0 (yintercept)andw1 (slope)areregressioncoefficients
Methodofleastsquares:estimatesthebestfittingstraightline
|D|
(x x )( y i y )
w = w = y w x
i
i =1
1 |D|
0 1
(x
i =1
i x )2
Multiplelinearregression:involvesmorethanonepredictorvariable
Trainingdataisoftheform(X1,y1),(X2,y2),,(X|D|,y|D|)
Ex.For2Ddata,wemayhave:y=w0 +w1 x1+w2 x2
SolvablebyextensionofleastsquaremethodorusingSAS,SPlus
Manynonlinearfunctionscanbetransformedintotheabove
December26, DataMining:Conceptsand 222
2012 h
NonlinearRegression
Somenonlinearmodelscanbemodeledbyapolynomialfunction
Apolynomialregressionmodelcanbetransformedintolinear
regressionmodel.Forexample,
y=w0 +w1 x+w2 x2+w3 x3
convertibletolinearwithnewvariables:x2=x2,x3=x3
y=w0 +w1 x+w2 x2+w3 x3
Otherfunctions,suchaspowerfunction,canalsobetransformed
tolinearmodel
Somemodelsareintractablenonlinear(e.g.,sumofexponential
terms)
possibletoobtainleastsquareestimatesthroughextensive
calculationonmorecomplexformulae
December26, DataMining:Conceptsand 223
2012 h
OtherRegressionBasedModels
Generalizedlinearmodel:
Foundationonwhichlinearregressioncanbeappliedtomodeling
categoricalresponsevariables
Varianceofyisafunctionofthemeanvalueofy,notaconstant
Logisticregression:modelstheprob.ofsomeeventoccurringasalinear
functionofasetofpredictorvariables
Poissonregression:modelsthedatathatexhibitaPoissondistribution
Loglinearmodels:(forcategoricaldata)
Approximatediscretemultidimensionalprob.distributions
Alsousefulfordatacompressionandsmoothing
Regressiontreesandmodeltrees
Treestopredictcontinuousvaluesratherthanclasslabels
Regressiontree:proposedinCARTsystem(Breimanetal.1984)
CART:ClassificationAndRegressionTrees
Eachleafstoresacontinuousvaluedprediction
Itistheaveragevalueofthepredictedattribute forthetrainingtuples
thatreachtheleaf
Modeltree:proposedbyQuinlan(1992)
Eachleafholdsaregressionmodelamultivariatelinearequationfor
thepredictedattribute
Amoregeneralcasethanregressiontree
Regressionandmodeltreestendtobemoreaccuratethanlinearregression
whenthedataarenotrepresentedwellbyasimplelinearmodel
Predictivemodeling:Predictdatavaluesorconstruct
generalizedlinearmodelsbasedonthedatabasedata
Onecanonlypredictvaluerangesorcategorydistributions
Methodoutline:
Minimalgeneralization
Attributerelevanceanalysis
Generalizedlinearmodelconstruction
Prediction
Determinethemajorfactorswhichinfluencetheprediction
Datarelevanceanalysis:uncertaintymeasurement,entropy
analysis,expertjudgement,etc.
Multilevelprediction:drilldownandrollupanalysis
Analogy:Consultseveraldoctors,basedonacombinationofweighted
diagnosesweightassignedbasedonthepreviousdiagnosisaccuracy
Howboostingworks?
Weightsareassignedtoeachtrainingtuple
Aseriesofkclassifiersisiterativelylearned
AfteraclassifierMi islearned,theweightsareupdatedtoallowthe
subsequentclassifier,Mi+1,topaymoreattentiontothetrainingtuples
thatweremisclassifiedbyMi
ThefinalM*combinesthevotesofeachindividualclassifier,wherethe
weightofeachclassifier'svoteisafunctionofitsaccuracy
Theboostingalgorithmcanbeextendedforthepredictionofcontinuous
values
Comparingwithbagging:boostingtendstoachievegreateraccuracy,butit
alsorisksoverfittingthemodeltomisclassifieddata
December26, DataMining:Conceptsand 227
2012 h
Classificationand
Prediction
Whatisclassification?Whatis SupportVectorMachines(SVM)
prediction? Associativeclassification
Issuesregardingclassificationand Lazylearners(orlearningfrom
prediction yourneighbors)
Classificationbydecisiontree Otherclassificationmethods
induction Prediction
Bayesianclassification Accuracyanderrormeasures
Rulebasedclassification Ensemblemethods
Modelselection
Classificationbybackpropagation
Summary
Stratifiedkfoldcrossvalidation isarecommendedmethodforaccuracy
estimation.Baggingandboosting canbeusedtoincreaseoverallaccuracyby
learningandcombiningaseriesofindividualmodels.
Significancetests andROCcurves areusefulformodelselection
Therehavebeennumerouscomparisonsofthedifferentclassificationand
predictionmethods,andthematterremainsaresearchtopic
Nosinglemethodhasbeenfoundtobesuperioroverallothersforalldata
sets
Issuessuchasaccuracy,trainingtime,robustness,interpretability,and
scalabilitymustbeconsideredandcaninvolvetradeoffs,further
complicatingthequestforanoverallsuperiormethod
Howdoesclassificationworks?
Howispredictiondifferentformclassification?
DefineDatacleaning?
Listthecriteriainvolvedincomparingandevaluatingtheclassification
andpredictionmethods?
WhatareBayesianclassifier?
StateBayestheorem
DefineBackpropagationandhowdoesitwork?
StateRulepruning?
Whatifwewouldliketopredictacontinuousvalue,ratherthana
categoricallabel?
Statelinearregression?
Statepolynomialregression?
Giveanoteonbootstrapmethod?
Whatisboosting?Statewhyitmayimprovetheaccuracyofdecision
treeinduction?
DataminingconceptsandTechniquesby
JiaweiHanandMichelineKamber
T.DasuandT.Johnson.ExploratoryData
MiningandDataCleaning.JohnWiley&Sons,
2003
DataMining:Conceptsand
Techniques
December26, 233
2012
MiningStream,TimeSeries,andSequenceData
Miningdatastreams
Miningtimeseriesdata
Miningsequencepatternsintransactional
databases
Miningsequencepatternsinbiologicaldata
Whatisstreamdata?WhyStreamDataSystems?
Streamdatamanagementsystems:Issuesandsolutions
StreamdatacubeandmultidimensionalOLAPanalysis
Streamfrequentpatternanalysis
Streamclassification
Streamclusteranalysis
Researchissues
DataStreams
Datastreamscontinuous,ordered,changing,fast,hugeamount
TraditionalDBMSdatastoredinfinite,persistent datasets
Characteristics
Hugevolumesofcontinuousdata,possiblyinfinite
Fastchangingandrequiresfast,realtimeresponse
Datastreamcapturesnicelyourdataprocessingneedsoftoday
Randomaccessisexpensivesinglescanalgorithm(canonlyhaveone
look)
Storeonlythesummaryofthedataseenthusfar
Moststreamdataareatprettylowlevelormultidimensionalinnature,
needsmultilevelandmultidimensionalprocessing
Telecommunicationcallingrecords
Business:creditcardtransactionflows
Networkmonitoringandtrafficengineering
Financialmarket:stockexchange
Engineering&industrialprocesses:powersupply&
manufacturing
Sensor,monitoring&surveillance:videostreams,RFIDs
Securitymonitoring
WeblogsandWebpageclickstreams
Massivedatasets(evensavedbutrandomaccessistoo
expensive)
Persistentrelations Transientstreams
Onetimequeries Continuousqueries
Randomaccess Sequentialaccess
Unboundeddiskstore Boundedmainmemory
Onlycurrentstatematters Historicaldataisimportant
Norealtimeservices Realtimerequirements
Relativelylowupdaterate PossiblymultiGBarrivalrate
Dataatanygranularity Dataatfinegranularity
Assumeprecisedata Datastale/imprecise
Accessplandeterminedbyquery Unpredictable/variabledataarrival
processor,physicalDBdesign andcharacteristics
Ack. From Motwanis PODS tutorial slides
December26, DataMining:Conceptsand 238
2012 h
MiningDataStreams
Whatisstreamdata?WhyStreamDataSystems?
Streamdatamanagementsystems:Issuesandsolutions
StreamdatacubeandmultidimensionalOLAPanalysis
Streamfrequentpatternanalysis
Streamclassification
Streamclusteranalysis
Researchissues
Continuous Query
Results
Multiple streams
Stream Query
Processor
Scratch Space
(Main memory and/or Disk)
December26, DataMining:Conceptsand 240
2012 h
ChallengesofStreamDataProcessing
Multiple,continuous,rapid,timevarying,ordered streams
Mainmemory computations
Queriesareoftencontinuous
Evaluatedcontinuouslyasstreamdataarrives
Answerupdatedovertime
Queriesareoftencomplex
Beyondelementatatimeprocessing
Beyondstreamatatimeprocessing
Beyondrelationalqueries(scientific,datamining,OLAP)
Multilevel/multidimensionalprocessinganddatamining
Moststreamdataareatlowlevelormultidimensionalinnature
Querytypes
Onetimequeryvs.continuousquery (beingevaluatedcontinuouslyas
streamcontinuestoarrive)
Predefinedquery vs.adhocquery(issuedonline)
Unboundedmemoryrequirements
Forrealtimeresponse,mainmemoryalgorithm shouldbeused
Memoryrequirementisunboundedifonewilljoinfuturetuples
Approximatequeryanswering
Withboundedmemory,itisnotalwayspossibletoproduceexact
answers
Highqualityapproximateanswers aredesired
Datareductionandsynopsisconstructionmethods
Sketches,randomsampling,histograms,wavelets,etc.
December26, DataMining:Conceptsand 242
2012 h
MethodologiesforStreamDataProcessing
Majorchallenges
Keeptrackofalargeuniverse,e.g.,pairsofIPaddress,notages
Methodology
Synopses(tradeoffbetweenaccuracyandstorage)
Usesynopsisdatastructure,muchsmaller(O(logk N)space)thantheir
basedataset(O(N)space)
Computeanapproximateanswer withinasmallerrorrange (factor of
theactualanswer)
Majormethods
Randomsampling
Histograms
Slidingwindows
Multiresolutionmodel
Sketches
Radomizedalgorithms
December26, DataMining:Conceptsand 243
2012 h
StreamDataMiningvs.StreamQuerying
StreamminingAmorechallengingtaskinmanycases
Itsharesmostofthedifficultieswithstreamquerying
Butoftenrequireslessprecision,e.g.,nojoin,grouping,
sorting
Patternsarehiddenandmoregeneralthanquerying
Itmayrequireexploratoryanalysis
Notnecessarilycontinuousqueries
Streamdataminingtasks
Multidimensionalonlineanalysisofstreams
Miningoutliersandunusualpatternsinstreamdata
Clusteringdatastreams
Classificationofstreamdata
December26, DataMining:Conceptsand 244
2012 h
MiningDataStreams
Whatisstreamdata?WhyStreamDataSystems?
Streamdatamanagementsystems:Issuesandsolutions
StreamdatacubeandmultidimensionalOLAPanalysis
Streamfrequentpatternanalysis
Streamclassification
Streamclusteranalysis
Researchissues
Moststreamdataareatprettylowlevelormultidimensional
innature:needsML/MDprocessing
Analysisrequirements
Multidimensionaltrendsandunusualpatterns
Capturingimportantchangesatmultidimensions/levels
Fast,realtimedetectionandresponse
Comparingwithdatacube:Similarityanddifferences
Stream(data)cubeorstreamOLAP:Isthisfeasible?
Canweimplementitefficiently?
A tiltedtimeframe
Differenttimegranularities
second,minute,quarter,hour,day,week,
Criticallayers
Minimuminterestlayer (mlayer)
Observationlayer (olayer)
User:watchesatolayerandoccasionallyneedstodrilldowndowntom
layer
Partialmaterializationofstreamcubes
Fullmaterialization:toospaceandtimeconsuming
Nomaterialization:slowresponseatquerytime
Partialmaterialization:whatdowemeanpartial?
Whatisstreamdata?WhyStreamDataSystems?
Streamdatamanagementsystems:Issuesandsolutions
StreamdatacubeandmultidimensionalOLAPanalysis
Streamfrequentpatternanalysis
Streamclassification
Streamclusteranalysis
Researchissues
Frequentpatternminingisvaluableinstreamapplications
e.g.,networkintrusionmining(Dokas,etal02)
Miningprecise freq.patternsinstreamdata:unrealistic
Evenstoretheminacompressedform,suchasFPtree
Howtominefrequentpatternswithgoodapproximation?
Approximatefrequentpatterns(Manku&MotwaniVLDB02)
Keeponlycurrentfrequentpatterns?Nochangescanbedetected
Miningevolutionfreq.patterns(C.Giannella,J.Han,X.Yan,P.S.Yu,2003)
Usetiltedtimewindowframe
Miningevolutionanddramaticchangesoffrequentpatterns
Spacesavingcomputationoffrequentandtopkelements(Metwally,Agrawal,andEl
Abbadi,ICDT'05)
Miningprecise freq.patternsinstreamdata:unrealistic
Evenstoretheminacompressedform,suchasFPtree
Approximateanswers areoftensufficient(e.g.,trend/patternanalysis)
Example:arouterisinterestedinallflows:
whosefrequency isatleast1%() oftheentiretrafficstreamseenso
far
andfeelsthat1/10of ( =0.1%)error iscomfortable
Howtominefrequentpatternswithgoodapproximation?
LossyCountingAlgorithm(Manku&Motwani,VLDB02)
Majorideas:nottracingitemsuntilitbecomesfrequent
Adv:guaranteederrorbound
Disadv:keepalargesetoftraces
December26, DataMining:Conceptsand 250
2012 h
MiningDataStreams
Whatisstreamdata?WhyStreamDataSystems?
Streamdatamanagementsystems:Issuesandsolutions
StreamdatacubeandmultidimensionalOLAPanalysis
Streamfrequentpatternanalysis
Streamclassification
Streamclusteranalysis
Researchissues
Withhighprobability,classifiestuplesthesame
Onlyusessmallsample
BasedonHoeffdingBoundprinciple
HoeffdingBound(AdditiveChernoffBound)
r:randomvariable
R:rangeofr
n:#independentobservations
Meanofrisatleastravg ,withprobability1 d
R 2 ln( 1 / )
=
December26,
2n 253
DataMining:Conceptsand
2012 h
HoeffdingTreeAlgorithm
HoeffdingTreeInput
S:sequenceofexamples
X:attributes
G():evaluationfunction
d:desiredaccuracy
HoeffdingTreeAlgorithm
foreachexampleinS
retrieveG(Xa)andG(Xb)//twohighestG(Xi)
if(G(Xa) G(Xb)> )
splitonXa
recursetonextnode
break
Protocol = http
Packets > 10
Data Stream
yes no
Bytes > 60K
Protocol = http
yes
Strengths
Scalesbetterthantraditionalmethods
Sublinearwithsampling
Verysmallmemoryutilization
Incremental
Makeclasspredictionsinparallel
Newexamplesareaddedastheycome
Weakness
Couldspendalotoftimewithties
Memoryusedwithtreeexpansion
Numberofcandidateattributes
H.Wang,W.Fan,P.S.Yu,andJ.Han,MiningConceptDrifting
DataStreamsusingEnsembleClassifiers,KDD'03.
Method(derivedfromtheensembleideainclassification)
trainKclassifiersfromKchunks
foreachsubsequentchunk
trainanewclassifier
testotherclassifiersagainstthechunk
assignweighttoeachclassifier
selecttopKclassifiers
Whatisstreamdata?WhyStreamDataSystems?
Streamdatamanagementsystems:Issuesandsolutions
StreamdatacubeandmultidimensionalOLAPanalysis
Streamfrequentpatternanalysis
Streamclassification
Streamclusteranalysis
Researchissues
level-(i+1) medians
level-i medians
data points
December26, DataMining:Conceptsand 260
2012 h
HierarchicalTreeandDrawbacks
Method:
maintainatmostmlevelimedians
Onseeingmofthem,generateO(k)level(i+1)
mediansofweightequaltothesumoftheweightsof
theintermediatemediansassignedtothem
Drawbacks:
Lowqualityforevolvingdatastreams(registeronlyk
centers)
Limitedfunctionalityindiscoveringandexploring
clustersoverdifferentportionsofthestreamover
time
Streamdatamining:Arichandongoingresearchfield
Currentresearchfocusindatabasecommunity:
DSMSsystemarchitecture,continuousqueryprocessing,supporting
mechanisms
StreamdataminingandstreamOLAPanalysis
Powerfultoolsforfindinggeneralandunusualpatterns
Effectiveness,efficiencyandscalability:lotsofopenproblems
Ourphilosophyonstreamdataanalysisandmining
Amultidimensionalstreamanalysis framework
Timeisaspecialdimension:Tiltedtimeframe
Whattocomputeandwhattosave?Criticallayers
partialmaterializationandprecomputation
Miningdynamics ofstreamdata
December26, DataMining:Conceptsand 262
2012 h
Miningtimeseriesdata
Miningdatastreams
Miningtimeseriesdata
Miningsequencepatternsintransactional
databases
Miningsequencepatternsinbiologicaldata
RegressionandtrendanalysisAstatistical
approach
Similaritysearchintimeseriesanalysis
SequentialPatternMining
MarkovChain
HiddenMarkovModel
Timeseriesdatabase
Consistsofsequencesofvaluesoreventschanging
withtime
Dataisrecordedatregularintervals
Characteristictimeseriescomponents
Trend,cycle,seasonal,irregular
Applications
Financial:stockprice,inflation
Industry:powerconsumption
Scientific:experimentresults
Meteorological:precipitation
December26, DataMining:Conceptsand 266
2012 h
CategoriesofTimeSeriesMovements
CategoriesofTimeSeriesMovements
Longtermortrendmovements(trendcurve):generaldirectioninwhich
atimeseriesismovingoveralongintervaloftime
Cyclicmovementsorcyclevariations:longtermoscillationsabouta
trendlineorcurve
e.g.,businesscycles,mayormaynotbeperiodic
Seasonalmovementsorseasonalvariations
i.e,almostidenticalpatternsthatatimeseriesappearstofollow
duringcorrespondingmonthsofsuccessiveyears.
Irregularorrandommovements
Timeseriesanalysis:decompositionofatimeseriesintothesefourbasic
movements
AdditiveModal:TS=T+C+S+I
MultiplicativeModal:TS=T C S I
December26, DataMining:Conceptsand 267
2012 h
EstimationofTrendCurve
Thefreehandmethod
Fitthecurvebylookingatthegraph
Costlyandbarelyreliableforlargescaleddatamining
Theleastsquaremethod
Findthecurveminimizingthesumofthesquaresof
thedeviationofpointsonthecurvefromthe
correspondingdatapoints
Themovingaveragemethod
Seasonalindex
Setofnumbersshowingtherelativevaluesofavariableduringthe
monthsoftheyear
E.g.,ifthesalesduringOctober,November,andDecemberare80%,
120%,and140%oftheaveragemonthlysalesforthewholeyear,
respectively,then80,120,and140areseasonalindexnumbersfor
thesemonths
Deseasonalizeddata
Dataadjustedforseasonalvariationsforbettertrendandcyclicanalysis
Dividetheoriginalmonthlydatabytheseasonalindexnumbersforthe
correspondingmonths
Estimationofcyclicvariations
If(approximate)periodicityofcyclesoccurs,cyclic
indexcanbeconstructedinmuchthesame
mannerasseasonalindexes
Estimationofirregularvariations
Byadjustingthedatafortrend,seasonalandcyclic
variations
Withthesystematicanalysisofthetrend,cyclic,seasonal,and
irregularcomponents,itispossibletomakelong orshortterm
predictionswithreasonablequality
December26, DataMining:Conceptsand 270
2012 h
TimeSeries&SequentialPatternMining
RegressionandtrendanalysisAstatistical
approach
Similaritysearchintimeseriesanalysis
SequentialPatternMining
MarkovChain
HiddenMarkovModel
Normaldatabasequeryfindsexactmatch
Similaritysearchfindsdatasequencesthatdifferonlyslightly
fromthegivenquerysequence
Twocategoriesofsimilarityqueries
Wholematching:findasequencethatissimilarto
thequerysequence
Subsequencematching:findallpairsofsimilar
sequences
TypicalApplications
Financialmarket
Marketbasketdataanalysis
Scientificdatabases
Medicaldiagnosis
December26, DataMining:Conceptsand 272
2012 h
DataTransformation
Manytechniquesforsignalanalysisrequirethedatatobein
thefrequencydomain
Usuallydataindependenttransformationsareused
Thetransformationmatrixisdeterminedapriori
discreteFouriertransform(DFT)
discretewavelettransform(DWT)
Thedistancebetweentwosignalsinthetimedomainisthe
sameastheirEuclideandistanceinthefrequencydomain
Miningdatastreams
Miningtimeseriesdata
Miningsequencepatternsin
transactionaldatabases
Miningsequencepatternsinbiologicaldata
Transactiondatabases,timeseriesdatabasesvs.sequencedatabases
Frequentpatternsvs.(frequent)sequentialpatterns
Applicationsofsequentialpatternmining
Customershoppingsequences:
Firstbuycomputer,thenCDROM,andthendigitalcamera,
within3months.
Medicaltreatments,naturaldisasters(e.g.,earthquakes),
science&eng.processes,stocksandmarkets,etc.
Telephonecallingpatterns,Weblogclickstreams
DNAsequencesandgenestructures
Givenasetofsequences,findthecompleteset
offrequentsubsequences
Asequence:<(ef)(ab)(df)cb>
Asequencedatabase
SID sequence Anelementmaycontainasetofitems.
10 <a(abc)(ac)d(cf)> Itemswithinanelementareunordered
20 <(ad)c(bc)(ae)> andwelistthemalphabetically.
30 <(ef)(ab)(df)cb>
40 <eg(af)cbc> <a(bc)dc>isasubsequence of
<a(abc)(ac)d(cf)>
Givensupportthreshold min_sup=2,<(ab)c>isasequential
pattern
December26, DataMining:Conceptsand 277
2012 h
ChallengesonSequentialPatternMining
Ahuge numberofpossiblesequentialpatternsarehiddenin
databases
Aminingalgorithmshould
findthecompletesetofpatterns,whenpossible,
satisfyingtheminimumsupport(frequency)
threshold
behighlyefficient,scalable,involvingonlyasmall
numberofdatabasescans
beabletoincorporatevariouskindsofuser
specificconstraints
December26, DataMining:Conceptsand 278
2012 h
SequentialPatternMiningAlgorithms
ConceptintroductionandaninitialApriorilikealgorithm
Agrawal&Srikant.Miningsequentialpatterns,ICDE95
Aprioribasedmethod:GSP(GeneralizedSequentialPatterns:Srikant&
Agrawal@EDBT96)
Patterngrowthmethods:FreeSpan&PrefixSpan (Hanetal.@KDD00;Pei,
etal.@ICDE01)
Verticalformatbasedmining:SPADE (Zaki@MachineLeanining00)
Constraintbasedsequentialpatternmining(SPIRIT:Garofalakis,Rastogi,
Shim@VLDB99;Pei,Han,Wang@CIKM02)
Miningclosedsequentialpatterns:CloSpan (Yan,Han&Afshar@SDM03)
Abasicproperty:Apriori(Agrawal&Sirkant94)
IfasequenceSisnotfrequent
ThennoneofthesupersequencesofSisfrequent
E.g,<hb>isinfrequent sodo<hab>and<(ah)b>
Seq.ID Sequence
Givensupportthreshold min_sup
10 <(bd)cb(ac)>
=2
20 <(bf)(ce)b(fg)>
30 <(ah)(bf)abf>
40 <(be)(ce)d>
50 <a(bd)bcb(ade)>
SPADE(SequentialPAtternDiscoveryusingEquivalentClass)
developedbyZaki2001
Averticalformatsequentialpatternminingmethod
Asequencedatabaseismappedtoalargesetof
Item:<SID,EID>
Sequentialpatternminingisperformedby
growingthesubsequences(patterns)oneitemat
atimebyAprioricandidategeneration
Miningdatastreams
Miningtimeseriesdata
Miningsequencepatternsintransactional
databases
Miningsequencepatternsin
biologicaldata
December26, DataMining:Conceptsand 284
2012 h
MiningSequencePatternsinBiologicalData
Abriefintroductiontobiologyandbioinformatics
Alignmentofbiologicalsequences
HiddenMarkovmodelforbiologicalsequence
analysis
Summary
DNA:helixshapedmolecule
whoseconstituentsaretwo
parallelstrandsofnucleotides
DNAisusuallyrepresentedby
sequencesofthesefour
nucleotides
Thisassumesonlyonestrandis
Nucleotides(bases)
considered;thesecondstrandis Adenine(A)
alwaysderivablefromthefirstby Cytosine(C)
pairingAswithTsandCswith Guanine(G)
Thymine(T)
Gsandviceversa
Gene:Contiguoussubpartsofsingle
strandDNAthataretemplatesfor
producingproteins.Genescanappearin
eitheroftheDNAstrand.
Chromosomes:compactchainsofcoiled
DNA
Genome:Thesetofallgenes inagiven
organism.
Noncoding part:ThefunctionofDNA
materialbetweengenesislargely
unknown.Certainintergenic regionsof
DNAareknowntoplayamajorrolein
cellregulation (controlstheproduction
ofproteinsandtheirpossible
interactionswithDNA).
Source:www.mtsinai.on.ca/pdmg/Genetics/basic.htm
December26, 287
DataMining:Conceptsand
2012 h
BiologyFundamentals(3):Transcription
Proteins:ProducedfromDNAusing3operationsortransformations:
transcription,splicing andtranslation
In eukaryotes (cellswithnucleus):genesareonlyaminutepartofthetotalDNA
In prokaryotes (cellswithoutnucleus):thephaseofsplicingdoesnotoccur(no
preRNAgenerated)
DNAiscapableofreplicatingitself(DNApolymerase)
Centerdogma:ThecapabilityofDNAforreplicationandundergoingthe
three(ortwo)transformations
Genesaretranscribed intopreRNAbyacomplexensembleofmolecules
(RNApolymerase).DuringtranscriptionTissubstitutedbytheletterU(for
uracil).
PreRNAcanberepresentedbyalternationsoffsequencesegmentscalled
exons andintrons.TheexonsrepresentsthepartsofpreRNAthatwillbe
expressed,i.e.,translatedintoproteins.
Gene
DNA
Transcription genomics
molecular
RNA
biology
Translation
structural
Protein Protein folding biology
biophysics
Sincethereare64differentcodonsand20aminoacids,thetablelookup
fortranslatingeachcodonintoanaminoacidisredundant:multiple
codonscanproducethesameaminoacid
Thetableusedbynaturetoperformtranslationiscalledthegeneticcode
Duetotheredundancy ofthegeneticcode,certainnucleotidechangesin
DNAmaynotaltertheresultingprotein
Onceaproteinisproduced,itfoldsintoauniquestructurein3Dspace,
with3typesofcomponents:helices,sheets andcoils.
Thesecondary structureofaproteinisitssequenceofaminoacids,
annotatedtodistinguishtheboundaryofeachcomponent
Thetertiary structureisits3Drepresentation
Vastmajorityofdataare sequenceofsymbols(nucleotidesgenomicdata,
butalsogoodamounton aminoacids).
Nextinvolume:microarray experimentsandalsoproteinarray data
Comparablysmall:3Dstructureofproteins (PDB)
NCBI(NationalCenterforBiotechnologyInformation)server:
Total26Bbp:3Bbphumangenome,thenseveralbacteria(e.g.,E.Coli),higher
organisms:yeast,worm,fruitful,mouse,andplants
Thelargestknowngeneshas~20millionbpandthelargestproteinconsistsof
~34kaminoacids
PDBhasacatalogueofonly45kproteins,specifiedbytheir3Dstructure(i.e,
needtoinferproteinshapefromsequencedata)
Computationalmanagementand
analysisofbiologicalinformation
InterdisciplinaryField(Molecular
Biology,Statistics,ComputerScience,
Genomics,Genetics,Databases,
Chemistry,Radiology)
Bioinformaticsvs.computational
Functional
Bioinformatics Genomics biology (moreonalgorithm
correctness,complexityandother
themescentraltotheoreticalCS)
Genomics
Proteomics
Structural
Bioinformatics
December26, DataMining:Conceptsand 293
2012 h
DataMining&Bioinformatics:Why?
Manybiologicalprocessesarenotwellunderstood
Biologicalknowledgeishighlycomplex,imprecise,descriptive,and
experimental
Biologicaldataisabundantandinformationrich
Genomics&proteomicsdata(sequences),microarrayandproteinarrays,protein
database(PDB),biotestingdata
Hugedatabanks,richliterature,openlyaccessible
Largestandrichestscientificdatasetsintheworld
Mining:gainbiologicalinsight(data/information knowledge)
Miningforcorrelations,linkagesbetweendiseaseandgenesequences,protein
networks,classification,clustering,outliers,...
Findcorrelationsamonglinkagesinliteratureandheterogeneousdatabases
DataIntegration:Handlingheterogeneous,distributedbiodata
BuildWebbased,interchangeable,integrated,multidimensionalgenome
databases
Datacleaninganddataintegrationmethodsbecomescrucial
Miningcorrelatedinformationacrossmultipledatabasesitselfbecomesadata
miningtask
Typicalstudies:miningdatabasestructures,informationextractionfromdata,
referencereconciliation,documentclassification,clusteringandcorrelation
discoveryalgorithms,...
Masterandexplorationofexistingdataminingtools
Genomics,proteomics,andfunctionalgenomics(functionalnetworksofgenes
andproteins)
Whatarethecurrentbioinformaticstoolsaimingfor?
Inferringaproteinsshapeandfunctionfromagivensequenceofaminoacids
Findingallthegenesandproteinsinagivengenome
Determiningsitesintheproteinstructurewheredrugmoleculescanbeattached
Comparingsequences:Comparinglargenumbersoflongsequences,allow
insertion/deletion/mutationsofsymbols
Constructingevolutionary(phylogenetic)trees:Comparingseq.ofdiff.organisms,
&buildtreesbasedontheirdegreeofsimilarity(evolution)
Detectingpatternsinsequences
SearchforgenesinDNAorsubcomponentsofaseq.ofaminoacids
Determining3Dstructuresfromsequences
E.g.,inferRNAshapefromseq.&proteinshapefromaminoacidseq.
Inferringcellregulation:
Cellmodelingfromexperimental(say,microarray)data
Determiningproteinfunctionandmetabolicpathways: Interprethuman
annotationsforproteinfunctionanddevelopgraphdbthatcanbequeried
AssemblingDNAfragments (providedbysequencingmachines)
Usingscriptlanguages:scriptontheWebtoanalyzedataandapplications
December26, DataMining:Conceptsand 298
2012 h
MiningSequencePatternsinBiologicalData
Abriefintroductiontobiologyandbioinformatics
Alignmentofbiologicalsequences
HiddenMarkovmodelforbiologicalsequence
analysis
Summary
Alllivingorganismsarerelatedtoevolution
Alignment:Liningupsequencestoachievethemaximallevelofidentity
Twosequencesarehomologous iftheyshareacommonancestor
Sequencestobecompared:eithernucleotides(DNA/RNA)oraminoacids
(proteins)
Nucleotides:identical
Aminoacids:identical,orifonecanbederivedfromtheotherbysubstitutionsthatare
likelytooccurinnature
Localvs.globalalignments:Localonlyportionsofthesequencesarealigned.
Globalalignovertheentirelengthofthesequences
Usegaptoindicatepreferablenottoaligntwosymbols
Percentidentity:ratiobetweenthenumberofcolumnscontainingidentical
symbolsvs.thenumberofsymbolsinthelongestsequence
Score ofalignment:summingupthematchesandcountinggapsasnegative
Goal:
Giventwoormoreinputsequences
Identifysimilarsequenceswithlongconservedsubsequences
Method:
Usesubstitutionmatrices(probabilitiesofsubstitutionsofnucleotides
oraminoacidsandprobabilitiesofinsertionsanddeletions)
Optimalalignmentproblem:NPhard
Heuristicmethodtofindgoodalignments
HEAGAWGHEE
Example PAWHEAE
HEAGAWGHE-E HEAGAWGHE-E
P-A--W-HEAE --P-AW-HEAE
Whichoneisbetter? Scoringalignments
Tocomparetwosequencealignments,calculateascore
PAM(PercentAcceptedMutation)orBLOSUM(BlocksSubstitutionMatrix)
(substitution)matrices:Calculatematchesandmismatches,consideringamino
acidsubstitution
Gappenalty:Initiatingagap
Gapextensionpenalty:Extendingagap
A E G H W Gappenalty:8
A 5 -1 0 -2 -3
E -1 6 -3 0 -3
Gapextension:8
H -2 0 -2 10 -3
P -1 -1 -2 -2 -4
HEAGAWGHE-E
W -3 -3 -3 -3 15 --P-AW-HEAE
(-8) + (-8) + (-1) + 5 + 15 + (-8)
+ 10 + 6 + (-8) + 6 = 9
HEAGAWGHE-E
Exercise:Calculatefor
P-A--W-HEAE
Motivation:Complexityofalignmentalgorithms:O(nm)
CurrentproteinDB:100millionbasepairs
Matchingeachsequencewitha1,000basepairquerytakesabout3hours!
Heuristicalgorithmsaimatspeedingupatthepriceofpossiblymissingthe
bestscoringalignment
Twowellknownprograms
BLAST:BasicLocalAlignmentSearchTool
FASTA:FastAlignmentTool
Bothfindhighscoringlocalalignmentsbetweenaquerysequenceandatarget
database
Basicidea:firstlocatehighscoringshortstretchesandthenextendthem
Abriefintroductiontobiologyandbioinformatics
Alignmentofbiologicalsequences
HiddenMarkovmodelforbiologicalsequence
analysis
Summary
Therearemanycasesinwhichwewouldliketorepresent the
statisticalregularitiesofsomeclassofsequences
genes
variousregulatorysitesinDNA(e.g.,whereRNA polymeraseand
transcriptionfactorsbind)
proteinsinagivenfamily
Markovmodelsarewellsuitedtothistypeoftask
Transitionprobabilities
Pr(xi=a|xi1=g)=0.16
Pr(xi=c|xi1=g)=0.34
Pr(xi=g|xi1=g)=0.38
Pr(xi=t|xi1=g)=0.12
Pr( x | xi i 1 = g) = 1
AMarkovchainmodelisdefinedby
asetofstates
somestatesemitsymbols
otherstates(e.g.,thebeginstate)aresilent
asetoftransitionswithassociated probabilities
thetransitionsemanatingfromagivenstatedefinea
distributionoverthepossiblenextstates
GivensomesequencexoflengthL,wecanaskhow
probablethesequenceisgivenourmodel
Foranyprobabilisticmodelofsequences,wecanwritethis
probabilityas
Pr( x) = Pr( xL , xL 1 ,..., x1 )
= Pr( xL / xL 1 ,..., x1 ) Pr( xL 1 | xL 2 ,..., x1 )... Pr( x1 )
keypropertyofa(1storder)Markovchain:theprobability of
eachxi dependsonlyonthevalueof xi1
Pr( x) = Pr( xL / xL 1 ) Pr( xL 1 | xL 2 )... Pr( x2 | x1 ) Pr( x1 )
L
= Pr( x1 ) Pr( xi | xi 1 )
i =2
Pr(cggt)=Pr(c)Pr(g|c)Pr(g|g)Pr(t|g)
Learning
correctpathknownforeachtrainingsequence > simplemaximum likelihood
orBayesianestimation
correctpathnotknown> ForwardBackwardalgorithm+MLor Bayesian
estimation
Classification
simpleMarkovmodel >calculateprobabilityofsequencealongsingle path
foreachmodel
hiddenMarkovmodel > Forwardalgorithmtocalculateprobabilityof
sequencealongallpathsforeachmodel
Segmentation
hiddenMarkovmodel > Viterbialgorithmtofindmostprobablepath for
sequence
Abriefintroductiontobiologyandbioinformatics
Alignmentofbiologicalsequences
HiddenMarkovmodelforbiologicalsequence
analysis
Summary
Biologicalsequenceanalysiscompares,aligns,indexes,andanalyzesbiological
sequences(sequenceofnucleotidesoraminoacids)
Biosequenceanalysiscanbepartitionedintotwoessentialtasks:
pairwisesequencealignmentandmultiplesequencealignment
Dynamicprogrammingapproach(notably,BLAST)hasbeenpopularlyusedfor
sequencealignments
MarkovchainsandhiddenMarkovmodelsareprobabilisticmodelsinwhichthe
probabilityofastatedependsonlyonthatofthepreviousstate
Givenasequenceofsymbols,x,theforward algorithmfindstheprobabilityofobtaining
xinthemodel
TheViterbi algorithmfindsthemostprobablepath(correspondingtox)throughthe
model
TheBaumWelch learnsoradjuststhemodelparameters(transitionandemission
probabilities)tobestexplainasetoftrainingsequences.
MethodsforMiningFrequentSubgraphs
MiningVariantandConstrainedSubstructure
Patterns
Applications:
GraphIndexing
SimilaritySearch
ClassificationandClustering
Summary
December26, DataMining:Conceptsand 315
2012 h
WhyGraphMining?
Graphsareubiquitous
Chemicalcompounds(Cheminformatics)
Proteinstructures,biologicalpathways/networks(Bioinformactics)
Programcontrolflow,trafficflow,andworkflowanalysis
XMLdatabases,Web,andsocialnetworkanalysis
Graphisageneralmodel
Trees,lattices,sequences,anditemsaredegeneratedgraphs
Diversityofgraphs
Directedvs.undirected,labeledvs.unlabeled(edges&vertices),
weighted,withangles&geometry(topologicalvs.2D/3D)
Complexityofalgorithms:manyproblemsareofhigh
complexity
fromH.JeongetalNature411,41(2001)
Aspirin Yeastproteininteractionnetwork
Coauthornetwork
December26, Internet DataMining:Conceptsand 317
2012 h
GraphPatternMining
Frequent subgraphs
A(sub)graphisfrequent ifitssupport (occurrence
frequency)inagivendatasetisnolessthana
minimumsupport threshold
Applicationsofgraphpatternmining
Miningbiochemicalstructures
Programcontrolflowanalysis
MiningXMLstructuresorWebcommunities
Buildingblocksforgraphclassification,clustering,
compression,comparison,andcorrelationanalysis
December26, DataMining:Conceptsand 318
2012 h
GraphMiningAlgorithms
Incompletebeamsearch Greedy(Subdue)
Inductivelogicprogramming(WARMR)
Graphtheorybasedapproaches
Aprioribasedapproach
Patterngrowthapproach
Startwithsinglevertices
Expandbestsubstructureswithanewedge
Limitthenumberofbestsubstructures
Substructuresareevaluatedbasedontheirabilityto
compressinputgraphs
Usingminimumdescriptionlength(DL)
BestsubstructureS ingraphG minimizes:DL(S)+
DL(G\S)
Terminateuntilnonewsubstructureisdiscovered
December26, DataMining:Conceptsand 320
2012 h
PropertiesofGraphMiningAlgorithms
Searchorder
breadthvs.depth
Generationofcandidatesubgraphs
apriorivs.patterngrowth
Eliminationofduplicatesubgraphs
passivevs.active
Supportcalculation
embeddingstoreornot
Discoverorderofpatterns
path tree graph
December26, DataMining:Conceptsand 321
2012 h
AprioriBasedApproach
(k+1)-edge
k-edge
G1
G
G2
G Gn
JOIN
AGM(Inokuchi,etal.PKDD00)
generatesnewgraphswithonemorenode
FSG(KuramochiandKarypisICDM01)
generatesnewgraphswithonemoreedge
Ifagraphisfrequent,allofitssubgraphsare
frequent theAprioriproperty
Annedgefrequentgraphmayhave2n subgraphs
Among422 chemicalcompoundswhichare
confirmedtobeactiveinanAIDSantiviralscreen
dataset,thereare1,000,000 frequentgraph
patternsiftheminimumsupportis5%
MethodsforMiningFrequentSubgraphs
MiningVariantandConstrainedSubstructure
Patterns
Applications:
GraphIndexing
SimilaritySearch
ClassificationandClustering
Summary
December26, DataMining:Conceptsand 325
2012 h
ConstrainedPatterns
Density
Diameter
Connectivity
Degree
Min,Max,Avg
Highlyconnectedsubgraphsinalargegraph
usuallyarenotartifacts(group,functionality)
Recurrentpatternsdiscoveredinmultiplegraphsaremorerobustthanthe
patternsminedfromasinglegraph
MethodsforMiningFrequentSubgraphs
MiningVariantandConstrainedSubstructure
Patterns
Applications:
ClassificationandClustering
GraphIndexing
SimilaritySearch
Summary
December26, DataMining:Conceptsand 328
2012 h
GraphClustering
Graphsimilaritymeasure
Featurebasedsimilaritymeasure
Eachgraphisrepresentedasafeaturevector
Thesimilarityisdefinedbythedistanceoftheir
correspondingvectors
Frequentsubgraphscanbeusedasfeatures
Structurebasedsimilaritymeasure
Maximalcommonsubgraph
Grapheditdistance:insertion,deletion,andrelabel
Graphalignmentdistance
Subgraphpatternsfromdomainknowledge
Moleculardescriptors
Subgraphpatternsfromdatamining
Generalidea
Eachgraphisrepresentedasafeaturevectorx =
{x1,x2,,xn},wherexiisthefrequencyoftheith
patterninthatgraph
Eachvectorisassociatedwithaclasslabel
Classifythesevectorsinavectorspace
December26, DataMining:Conceptsand 331
2012 h
GraphMining
MethodsforMiningFrequentSubgraphs
MiningVariantandConstrainedSubstructure
Patterns
Applications:
ClassificationandClustering
GraphIndexing
SimilaritySearch
Summary
December26, DataMining:Conceptsand 332
2012 h
GraphSearch
Queryinggraphdatabases:
Givenagraphdatabaseandaquerygraph,findall
thegraphscontainingthisquerygraph
Sequentialscan
DiskI/Os
Subgraphisomorphismtesting
Anindexingmechanismisneeded
DayLight:Daylight.com(commercial)
GraphGrep:DennisShasha,etal.PODS'02
Grace:SrinathSrinivasa,etal.ICDE'03
Graphmininghaswideapplications
Frequentandclosedsubgraphminingmethods
gSpanandCloseGraph:patterngrowthdepthfirstsearchapproach
Graphindexingtechniques
Frequentanddiscriminativesubgraphsarehighqualityindexing
features
Similaritysearchingraphdatabases
Indexingandfeaturebasedmatching
Furtherdevelopmentandapplicationexploration
SocialNetworkIntroduction
StatisticsandProbabilityTheory
ModelsofSocialNetworkGeneration
NetworksinBiologicalSystem
Miningon SocialNetwork
Summary
December26, DataMining:Conceptsand 337
2012 h
Complex systems
Made of
many non-identical elements
connected by diverse interactions.
NETWORK
December26, DataMining:Conceptsand 338
2012 h
NaturalNetworksandUniversality
Considermanykindsofnetworks:
social,technological,business,economic,content,
Thesenetworkstendtosharecertaininformal properties:
largescale;continualgrowth
distributed,organicgrowth:verticesdecidewhotolinkto
interactionrestrictedtolinks
mixtureoflocalandlongdistanceconnections
abstractnotionsofdistance:geographical,content,social,
Donaturalnetworkssharemorequantitative universals?
Whatwouldtheseuniversalsbe?
Howcanwemakethempreciseandmeasurethem?
Howcanweexplaintheiruniversality?
Thisisthedomainofsocialnetworktheory
Sometimesalsoreferredtoaslinkanalysis
Allofthenetworkgenerationmodelswewillstudyare
probabilistic orstatistical innature
Theycangeneratenetworksofanysize
Theyoftenhavevariousparameters thatcanbeset:
sizeofnetworkgenerated
averagedegreeofavertex
fractionoflongdistanceconnections
Themodelsgenerateadistribution overnetworks
Statementsarealwaysstatistical innature:
withhighprobability,diameterissmall
onaverage,degreedistributionhasheavytail
Thus,weregoingtoneedsomebasicstatisticsandprobability
theory
December26, DataMining:Conceptsand 342
2012 h
SocialNetworkAnalysis
SocialNetworkIntroduction
StatisticsandProbabilityTheory
ModelsofSocialNetworkGeneration
NetworksinBiologicalSystem
Miningon SocialNetwork
Summary
December26, DataMining:Conceptsand 343
2012 h
WorldWideWeb
December26,
R. Albert, H. Jeong, A-L Barabasi, Nature, 401
344
130 (1999)
DataMining:Conceptsand
2012 h
WorldWideWeb
ExpectedResult RealResult
k ~ 6
P(k=500) ~ 10-99 Pout(k) ~ k-out Pin(k) ~ k- in
NWWW ~ 109 P(k=500) ~ 10-6 NWWW ~ 109
N(k=500) ~ 103
N(k=500)~10-90
J. Kleinberg, et. al, Proceedings of the ICCC (1999)
December26, DataMining:Conceptsand 345
2012 h
WorldWideWeb
3
l15=2 [125]
6
1
4 l17=4 [1346 7]
7
2 5 < l > = ??
Finite size scaling: create a network with N nodes with Pin(k) and Pout(k)
Thenumberofnodes(N)isnotfixed
Networkscontinuouslyexpandbyadditionalnewnodes
WWW:additionofnewnodes
Citation:publicationofnewpapers
Theattachmentisnotuniform
Anodeislinkedwithhigherprobabilitytoanodethatalreadyhasalarge
numberoflinks
WWW:newdocumentslinktowellknownsites(CNN,
Yahoo,Google)
Citation:Wellcitedpapersaremorelikelytobecited
again
December26, DataMining:Conceptsand 348
2012 h
Case1:InternetBackbone
SocialNetworkIntroduction
StatisticsandProbabilityTheory
ModelsofSocialNetworkGeneration
NetworksinBiologicalSystem
Miningon SocialNetwork
Summary
December26, DataMining:Conceptsand 351
2012 h
InformationontheSocial
Network
Heterogeneous,multirelationaldatarepresentedasagraphor
network
Nodesareobjects
Mayhavedifferentkindsofobjects
Objectshaveattributes
Objectsmayhavelabelsorclasses
Edgesarelinks
Mayhavedifferentkindsoflinks
Linksmayhaveattributes
Linksmaybedirected,arenotrequiredtobebinary
Linksrepresentrelationshipsandinteractionsbetweenobjects
richcontentformining
ObjectRelatedTasks
Linkbasedobjectranking
Linkbasedobjectclassification
Objectclustering(groupdetection)
Objectidentification(entityresolution)
LinkRelatedTasks
Linkprediction
GraphRelatedTasks
Subgraphdiscovery
Graphclassification
Generativemodelforgraphs
Link:relationshipamongdata
Twokindsoflinkednetworks
homogeneousvs.heterogeneous
Homogeneousnetworks
Singleobjecttypeandsinglelinktype
Singlemodelsocialnetworks(e.g.,friends)
WWW:acollectionoflinkedWebpages
Heterogeneousnetworks
Multipleobjectandlinktypes
Medicalnetwork:patients,doctors,disease,contacts,treatments
Bibliographicnetwork:publications,authors,venues
Intuitions
Linksarelikecitationsinliterature
Apagethatiscitedoftencanbeexpectedtobemoreusefulingeneral
PageRankisessentiallycitationcounting,butimprovesover
simplecounting
Considerindirectcitations (beingcitedbyahighlycitedpapercounts
alot)
Smoothingofcitations(everypageisassumedtohaveanonzero
citationcount)
PageRankcanalsobeinterpretedasrandomsurfing(thus
capturingpopularity)
Randomsurfingmodel:
Atanypage,
Withprob.,randomlyjumpingtoapage
Withprob.(1 ),randomlypickingalinktofollow
d1 0 0 1/ 2 1/ 2
1 0 0 0
M = Transition matrix
0 1 0 0 Same as
d3 1/ 2 1/ 2 0 0 /N (why?)
d2
1
pt +1 (di ) = (1 )
d j IN ( di )
m ji pt ( d j ) +
k N
pt (d k )
d4 1
p(di ) = [ + (1 )mki ] p(d k ) Stationary (stable)
k N distribution, so we
v v
p = ( I + (1 ) M )T p I = 1/N ignore time
ij
Predictwhetheralinkexistsbetweentwoentities,basedon
attributesandotherobservedlinks
Applications
Web:predictiftherewillbealinkbetweentwopages
Citation:predictingifapaperwillciteanotherpaper
Epidemics:predictingwhoapatientscontactsare
Methods
Oftenviewedasabinaryclassificationproblem
Localconditionalprobabilitymodel,basedonstructuralandattribute
features
Difficulty:sparsenessofexistinglinks
Collectiveprediction,e.g.,Markovrandomfieldmodel
Classificationovermultiplerelationsindatabases
Clusteringovermultirelationsbyuserguidance
LinkClus:Efficientclusteringbyexploringthepowerlaw
distribution
Distinct:Distinguishingobjectswithidenticalnamesbylink
analysis
Miningacrossmultipleheterogeneousdataandinformation
repositories
Summary
StartingwithPageRankandHITS
CrossMine:Classificationofmultirelationsbylinkanalysis
CrossClus:Clusteringovermultirelationsbyuserguidance
Morerecentworkandconclusions
Workonsingleflatrelations
Contact
Doctor Patient
flatten
Loseinformationoflinkagesandrelationships
Cannotutilizeinformationofdatabasestructuresorschemas
eCommerce:discoveringpatternsinvolvingcustomers,
products,manufacturers,
Bioinformatics/Medicaldatabases:discoveringpatterns
involvinggenes,patients,diseases,
Networkingsecurity:discoveringpatternsinvolvinghosts,
connections,services,
Manyotherrelationaldatasources
Example:EvidenceExtractionandLinkDiscovery(EELD):ADARPA
fundingprojectthatemphasizesmultirelationalandmultidatabase
linkageanalysis
InductiveLogicProgramming(ILP)
Findmodelsthatarecoherentwithbackground
knowledge
MultirelationalClusteringAnalysis
Clusteringobjectswithmultirelationalinformation
ProbabilisticRelationalModels
Modelcrossrelationalprobabilisticdistributions
EfficientMultiRelationalClassification
TheCrossMineApproach[Yinetal,2004]
Findahypothesisthatisconsistentwith
backgroundknowledge(trainingdata)
FOIL,Golem,Progol,TILDE,
Backgroundknowledge
Relations(predicates),Tuples(groundfacts)
Trainingexamples Backgroundknowledge
Parent(ann, mary) Female(ann)
Daughter(mary, ann) + Parent(ann, tom) Female(mary)
Daughter(eve, tom) + Parent(tom, eve) Female(eve)
Daughter(tom, ann) Parent(tom, ian)
Daughter(eve, ann)
Whynotconvertmultiplerelationaldataintoasingletableby
joins?
Relationaldatabasesaredesignedbydomainexpertsviasemantic
modeling(e.g.,ERmodeling)
Indiscriminativejoinsmayloosesomeessentialinformation
Oneuniversalrelationmaynotbeappealingtoefficiency,scalabilityand
semanticspreservation
Ourapproachtomultirelationalclassification:
Automaticallyclassifyingobjectsusingmultiplerelations
Approve or not?
Apply for loan
Howtomakedecisionstoloanapplications?
Motivation
RulebasedClassification
TupleIDPropagation
RuleGeneration
NegativeTupleSampling
PerformanceStudy
Searchforgoodpredicatesacrossmultiplerelations
Loan Applications
Applicant #2
Account ID Frequency Open date District ID
128 monthly 02/27/96 61820
108 weekly 09/23/95 61820
45 monthly 12/09/94 61801
Applicant #3 Orders
67 weekly 01/01/95 61822
Accounts
Applicant #4
Other relations Districts
December26, DataMining:Conceptsand 374
2012 h
PreviousApproaches
InductiveLogicProgramming(ILP)
Tobuildarule
Repeatedlyfindthebestpredicate
ToevaluateapredicateonrelationR,firstjointargetrelation
withR
Notscalablebecause
Hugesearchspace(numerouscandidatepredicates)
Notefficienttoevaluateeachpredicate
Toevaluateapredicate
Loan(L, +) :- Loan (L, A,?,?,?,?), Account(A,?, monthly,?)
firstjoinloanrelationwithaccountrelation
CrossMineismorescalableandmorethanonehundredtimesfaster
ondatasetswithreasonablesizes
A3=1&&A1=2
A3=1&&A1=2
&&A8=5A3=1
Positive Negative
examples examples
Startfromthetargetrelation
Onlythetargetrelationisactive
Repeat
Searchinallactiverelations
Searchinallrelationsjoinabletoactiverelations
Addthebestpredicatetothecurrentrule
Settheinvolvedrelationtoactive
Until
Thebestpredicatedoesnothaveenoughgain
Currentruleistoolong
Twotypesofrelations:EntityandRelationship
Oftencannotfindusefulpredicatesonrelationsofrelationship
No good predicate
Target
Relation
SolutionofCrossMine:
WhenpropagatingIDstoarelationofrelationship,propagateonemore
steptonextrelationofentity.
Classificationovermultiplerelationsindatabases
Clusteringovermultirelationsbyuserguidance
LinkClus:Efficientclusteringbyexploringthepowerlaw
distribution
Distinct:Distinguishingobjectswithidenticalnamesbylink
analysis
Miningacrossmultipleheterogeneousdataandinformation
repositories
Summary
December26, DataMining:Conceptsand 380
2012 h
MultiRelationalandMultiDBMining
Classificationovermultiplerelationsindatabases
ClusteringovermultirelationsbyUserGuidance
Miningacrossmultirelationaldatabases
Miningacrossmultipleheterogeneousdataand
informationrepositories
Summary
December26, DataMining:Conceptsand 381
2012 h
Motivation1:MultiRelationalClustering
Publication
Publish
Advise title
Group author
professor year
name title
student conf
area
degree
Student
Register
name
student
Target of office
clustering course
position
semester
unit
grade
Traditionalclusteringworksonasingletable
Mostdataissemanticallylinkedwithmultiplerelations
Thusweneedinformationinmultiplerelations
December26, 382
DataMining:Conceptsand
2012 h
Motivation2:UserGuidedClustering
Publish Publication
Advise author title
Group
professor
name title year
student
area conf
degree Register
Userh int student
Student course
name semester
Target of office
unit
clustering position
grade
Userusuallyhasagoalofclustering,e.g.,clusteringstudentsbyresearcharea
UserspecifieshisclusteringgoaltoCrossClus
User hint
Userspecifiedfeature(intheformof
attribute)isusedasahint,notclasslabels
Theattributemaycontaintoomanyor
toofewdistinctvalues
E.g.,ausermaywanttocluster
studentsinto20clusters
insteadof3
Additionalfeaturesneedtobeincluded
inclusteranalysis
Semisupervisedclustering[Wagstaff,etal 01,Xing,etal.02]
Userprovidesatrainingsetconsistingofsimilar anddissimilar pairsof
objects
Userguidedclustering
Userspecifiesanattributeasahint,andmorerelevantfeaturesarefoundfor
clustering
Muchinformation(inmultiplerelations)isneededtojudgewhethertwo
tuplesaresimilar
Ausermaynotbeabletoprovideagoodtrainingset
Itismucheasierforausertospecifyanattributeasahint,suchasa
studentsresearcharea
Tuples to be compared
User hint
December26, DataMining:Conceptsand 386
2012 h
SearchingforPertinentFeatures
Differentfeaturesconveydifferentaspectsofinformation
Featuresconveyingsameaspectofinformationusuallycluster
objectsinmoresimilarways
researchgroupareasvs.conferencesofpublications
Givenuserspecifiedfeature
Findpertinentfeaturesbycomputingfeaturesimilarity
Forclustering,weuseCLARANS,ascalablekmedoids[Ng&
Han94]algorithm
Singlelink(highestsimilaritybetweenpointsintwoclusters)?
No,becausereferencestodifferentobjectscanbeconnected.
Completelink(minimumsimilaritybetweenthem)?
No,becausereferencestothesameobjectmaybeweaklyconnected.
Averagelink(averagesimilaritybetweenpointsintwo
clusters)?
Abettermeasure
Procedure
Initialization:Useeachreferenceasacluster
Keepfindingandmergingthemostsimilarpairofclusters
Untilnopairofclustersissimilarenough
Randomwalkprobability
Classificationovermultiplerelationsindatabases
Clusteringovermultirelationsbyuserguidance
LinkClus:Efficientclusteringbyexploringthepowerlaw
distribution
Distinct:Distinguishingobjectswithidenticalnamesbylink
analysis
Miningacrossmultipleheterogeneousdataandinformation
repositories
Summary
December26, DataMining:Conceptsand 395
2012 h
Summary
Knowledgeispower,butknowledgeishiddeninmassivelinks
MorestoriesthanWebpagerankandsearch
CrossMine:Classificationofmultirelationsbylinkanalysis
CrossClus:Clusteringovermultirelationsbyuserguidance
LinkClus:Efficientclusteringbyexploringthepowerlaw
distribution
Distinct:Distinguishingobjectswithidenticalnamesbylink
analysis
Muchmoretobeexplored!
Statetheimportanceofslidingwindowmodeltoanalyzestreamdata?
Writeanoteandatastreammanagementsystems(DSMS)
Statethedifferencebetweenonetimequeryandcontinuousquery.
Howdoesthelossycountryalgorithmfindfrequentitems?
Giveanoteonstreamqueryprocessing?
Whatisatimeseriesdatabase?
Definesequentialpatternmining?
Whatisperiodicityanalysis?
Distinguishbetweenfullperiodicpatternandpartialperiodicpattern
StateMarkovchainmodel
Statetheimportanceofsynopsesincontextwithscreendata?
Statetheneedforbiologicalsequenceanalysis?
Discussaboutconstraintbasedmining?
Whatisasocialnetwork?
Briefoutmultirelationdatamining?
DataminingconceptsandTechniquesby
JiaweiHanandMichelineKamber
DataMining:Principlesand
12/26/2012 Algorithms 399
MiningObject,SpatialandMultiMediaData
Miningobjectdatasets
Miningspatialdatabasesanddatawarehouses
SpatialDBMS
SpatialDataWarehousing
SpatialDataMining
SpatiotemporalDataMining
Miningmultimediadata
Summary
Setvaluedattribute
Generalizationofeachvalueinthesetintoitscorrespondinghigherlevel
concepts
Derivationofthegeneralbehavioroftheset,suchasthenumberof
elementsintheset,thetypesorvaluerangesintheset,ortheweighted
averagefornumericaldata
E.g.,hobby ={tennis,hockey,chess,violin,PC_games}generalizesto
{sports,music,e_games}
Listvaluedorasequencevaluedattribute
Sameassetvaluedattributesexceptthattheorderoftheelementsin
thesequenceshouldbeobservedinthegeneralization
Spatialdata:
Generalizedetailedgeographicpointsintoclusteredregions,suchas
business,residential,industrial,oragriculturalareas,accordingtoland
usage
Requirethemergeofasetofgeographicareasbyspatialoperations
Image data:
Extractedbyaggregationand/orapproximation
Size,color,shape,texture,orientation,andrelativepositionsand
structuresofthecontainedobjectsorregionsintheimage
Musicdata:
Summarizeitsmelody:basedontheapproximatepatternsthat
repeatedlyoccurinthesegment
Summarizeditsstyle:basedonitstone,tempo,orthemajormusical
instrumentsplayed
Objectidentifier
generalizetothelowestlevelofclassintheclass/subclasshierarchies
Classcompositionhierarchies
generalizeonlythosecloselyrelatedinsemantics tothecurrentone
Constructionandminingofobjectcubes
Extendtheattributeorientedinductionmethod
Applyasequenceofclassbasedgeneralizationoperatorsondifferent
attributes
Continueuntilgettingasmallnumberofgeneralizedobjectsthatcan
besummarizedasaconciseinhighlevelterms
Implementation
Examineeachattribute,generalizeittosimplevalueddata
Constructamultidimensionaldatacube(objectcube)
Problem:itisnotalwaysdesirabletogeneralizeasetofvaluesto
singlevalueddata
12/26/2012 DataMining:Principlesand 403
l h
Ex.:PlanMiningbyDivideandConquer
Plan:asequenceofactions
E.g.,Travel(flight):<traveler,departure,arrival,dtime,atime,airline,
price,seat>
Planmining:extractionofimportantorsignificantgeneralized(sequential)
patternsfromaplanbase(alargecollectionofplans)
E.g.,Discovertravelpatternsinanairflightdatabase,or
findsignificantpatternsfromthesequencesofactionsintherepairof
automobiles
Method
Attributeorientedinductiononsequencedata
Ageneralizedtravelplan:<smallbig*small>
Divide&conquer:Minecharacteristicsforeachsubsequence
E.g.,big*:sameairline,smallbig:nearbyregion
12/26/2012 DataMining:Principlesand 404
l h
ATravelDatabaseforPlanMining
Example:Miningatravelplanbase
Travelplantable
plan# action# departure depart_time arrival arrival_time airline
1 1 ALB 800 JFK 900 TWA
1 2 JFK 1000 ORD 1230 UA
1 3 ORD 1300 LAX 1600 UA
1 4 LAX 1710 SAN 1800 DAL
2 1 SPI 900 ORD 950 AA
. . . . . . . .
. . . . . . . .
. . . . . . . .
Airportinfotable
airport_code city state region airport_size
1 1 ALB 800
1 2 JFK 1000
1 3 ORD 1300
1 4 LAX 1710
2 1 SPI 900
. . . . .
. . . . .
. . . . .
12/26/2012 DataMining:Principlesand 405
l h
MultidimensionalAnalysis
AmultiDmodelfortheplanbase
Strategy
Generalizethe
planbaseindifferent
directions
Lookforsequential
patternsinthe
generalizedplans
Derivehighlevel
plans
Miningobjectdatasets
Miningspatialdatabasesanddatawarehouses
SpatialDBMS
SpatialDataWarehousing
SpatialDataMining
SpatiotemporalDataMining
Miningmultimediadata
Summary
Geometric,geographicorspatialdata:spacerelateddata
Example:Geographicspace(2Dabstractionofearthsurface),VLSI
design,modelofhumanbrain,3Dspacerepresentingthe
arrangementofchainsofproteinmolecule.
Spatialdatabasesystemvs.imagedatabasesystems.
Imagedatabasesystem:handlingdigitalrasterimage(e.g.,satellite
sensing,computertomography),mayalsocontaintechniquesfor
objectanalysisandextractionfromimagesandsomespatialdatabase
functionality.
Spatial(geometric,geographic)databasesystem:handlingobjectsin
spacethathaveidentityandwelldefinedextents,locations,and
relationships.
12/26/2012 DataMining:Principlesand 408
l h
GIS (Geographic Information System)
Whatneedstoberepresented?
Twoimportantalternativeviews
Singleobjects:distinctentitiesarrangedinspaceeachof
whichhasitsowngeometricdescription
modelingcities,forests,rivers
Spatiallyrelatedcollectionofobjects:describespaceitself
(abouteverypointinspace)
modelinglanduse,partitionofacountryintodistricts
Point:locationonlybutnotextent
Line(oracurveusuallyrepresentedbyapolyline,asequenceof
linesegment):
movingthroughspace,orconnectionsinspace(roads,rivers,
cables,etc.)
Region:
Somethinghavingextentin2Dspace(country,lake,park).It
mayhaveaholeorconsistofseveraldisjointpieces.
Modelingspatiallyrelatedcollectionofobjects:planepartitionsandnetworks.
Apartition:asetofregionobjectsthatarerequiredtobedisjoint(e.g.,a
thematicmap).Thereexistoftenpairsofobjectswithacommonboundary
(adjacencyrelationship).
Anetwork:agraphembeddedintotheplane,consistingofasetofpoint
objects,formingitsnodes,andasetoflineobjectsdescribingthe
geometryoftheedges,e.g.,highways.rivers,powersupplylines.
Otherinterestedspatiallyrelatedcollectionofobjects:nestedpartitions,
oradigitalterrain(elevation)model.
data Pine
x
point, line, polygon, (a)
Objects, Attributes Object Viewpoint of Forest Stands Field Viewpoint of Forest Stands
"Pine," 2 x 4 ; 2 y 4
Dominant
Area-ID Area/Boundary
Tree Species
f(x,y) "Fir," 0 x 2; 0 y 2
FS1 Pine [(0,2),(4,2),(4,4),(0,4)]
"Oak," 2 x 4; 0 y 2
FS2 Fir [(0,0),(2,0),(2,2),(0,2)]
(b) (c)
12/26/2012 DataMining:Principlesand 414
l h
Spatial Query Language
Spatialquery language
Spatial data types, e.g. point, line segment, polygon,
Spatial operations, e.g. overlap, distance, nearest
neighbor,
Callable from a query language (e.g. SQL3) of
underlying DBMS
SELECTS.name
FROM SenatorS
WHERES.district.Area() >300
Standards
SQL3 (a.k.a. SQL 1999) is a standard for query
languages
OGIS is a standard for spatial data types and operators
Both standards enjoy wide support in industry
MBR
A FILTER B
B
Query
Region C C
D D
REFINE
Data Object
C
12/26/2012 DataMining:Principlesand 416
Algorithms
Join Query Processing
sweep line
(T.xu, T.yu)
S3
y-axis
y-axis
R2
S2 R1 T
R4 R3 (T.xl, T.yl)
S1
x-axis x-axis
(a) (b)
R4 S2 S1 R1 S3 R2 R3
12/26/2012 DataMining:Principlesand
(c) 417
Algorithms
File Organization and Indices
Spatial Indexing
B-tree works on spatial data with space filling curve
R-tree: Heighted balanced extention of B+ tree
Objects are represented as MBR
provides better performance
A
A B C
e
d
C
B i d e f g h i j
g
f
h
12/26/2012 DataMining:Principlesand 419
Algorithms
Spatial Query Optimization
Spatialdatawarehouse:Integrated,subjectoriented,timevariant,and
nonvolatilespatialdatarepository
Spatialdataintegration:abigissue
Vendorspecificformats (ESRI,MapInfo,Integraph,IDRISI,etc.)
Geospecificformats (geographicvs.equalareaprojection,etc.)
Spatialdatacube:multidimensionalspatialdatabase
Bothdimensionsandmeasuresmaycontainspatialcomponents
Dimensions Measures
nonspatial numerical(e.g.monthlyrevenueof
e.g.2530degrees aregion)
generalizestohot (both
distributive(e.g.count,sum)
arestrings)
spatialtononspatial algebraic(e.g.average)
e.g.Seattlegeneralizesto holistic(e.g.median,rank)
descriptionPacific spatial
Northwest (asastring)
collectionofspatialpointers
spatialtospatial
(e.g.pointerstoallregionswith
e.g.Seattle generalizesto
PacificNorthwest (asa temperatureof2530degrees
spatialregion) inJuly)
Spatialassociationrule: A B [s%,c%]
AandBaresetsofspatialornonspatialpredicates
Topologicalrelations:intersects,overlaps,disjoint,etc.
Spatialorientations:left_of,west_of,under, etc.
Distanceinformation:close_to,within_distance,etc.
s% isthesupportandc% istheconfidenceoftherule
Examples
1) is_a(x,large_town)^intersect(x,highway) adjacent_to(x,water)
[7%,85%]
2) Whatkindsofobjectsaretypicallylocatedclosetogolfcourses?
Spatialdatatendstobehighlyselfcorrelated
Example:Neighborhood,Temperature
Itemsinatraditionaldataareindependentofeachother,
whereaspropertiesoflocationsinamapareoftenauto
correlated.
Firstlawofgeography:
Everythingisrelatedtoeverything,butnearbythingsare
morerelatedthandistantthings.
Methodsinclassification
Decisiontreeclassification,NaveBayesianclassifier+
boosting,neuralnetwork,logisticregression,etc.
Associationbasedmultidimensionalclassification
Example:classifyinghousevaluebasedonproximityto
lakes,highways,mountains,etc.
Assuminglearningsamplesareindependentofeachother
Spatialautocorrelationviolatesthisassumption!
Popularspatialclassificationmethods
Spatialautoregression(SAR)
Markovrandomfield(MRF)
12/26/2012 DataMining:Principlesand 426
l h
SpatialAutoRegression
LinearRegression
Y=X +
Spatialautoregressiveregression(SAR)
Y=WY+X +
W:neighborhoodmatrix.
modelsstrengthofspatialdependencies
errorvector
Theestimatesof and canbederivedusingmaximumlikelihood
theoryorBayesianstatistics
Bayesianclassifiers
MRF
Asetofrandomvariableswhoseinterdependencyrelationshipis
representedbyanundirectedgraph(i.e.,asymmetricneighborhood
matrix)iscalledaMarkovRandomField.
Pr(X | Ci, Li) Pr(Ci | Li)
Pr(Ci | X, Li) =
Pr (X)
Li denotessetoflabelsintheneighborhoodofsiexcludinglabelsatsi
Pr(Ci |Li) canbeestimatedfromtrainingdatabyexaminetheratiosof
thefrequenciesofclasslabelstothetotalnumberoflocations
Pr(X|Ci,Li) canbeestimatedusingkernelfunctionsfromtheobserved
valuesinthetrainingdataset
Function
Detectchangesandtrendsalongaspatialdimension
Studythetrendofnonspatialorspatialdatachanging
withspace
Applicationexamples
Observethetrendofchangesoftheclimateorvegetation
withincreasingdistancefromanocean
Crimerateorunemploymentratechangewithregardto
citygeodistribution
Miningclusterskmeans,kmedoids,
hierarchical,densitybased,etc.
Analysisofdistinctfeaturesoftheclusters
Constraintsonindividualobjects
Simpleselectionofrelevantobjectsbeforeclustering
Clusteringparameters asconstraints
Kmeans,densitybased:radius,min#ofpoints
ConstraintsspecifiedonclustersusingSQLaggregates
Sumoftheprofitsineachcluster>$1million
Constraintsimposedbyphysicalobstacles
Clusteringwithobstructeddistance
C3
C2
C1
River
Mountain C4
Outlier
Globaloutliers:Observationswhichisinconsistentwiththe
restofthedata
Spatialoutliers:Alocalinstabilityofnonspatialattributes
Spatialoutlierdetection
Graphicaltests
Variogramclouds
Moranscatterplots
Quantitativetests
Scatterplots
SpatialStatisticZ(S(x))
QuantitativetestsaremoreaccuratethanGraphicaltests
Miningobjectdatasets
Miningspatialdatabasesanddatawarehouses
SpatialDBMS
SpatialDataWarehousing
SpatialDataMining
SpatiotemporalDataMining
Miningmultimediadata
Summary
Descriptionbasedretrievalsystems
Buildindicesandperformobjectretrievalbasedonimage
descriptions,suchaskeywords,captions,size,andtimeof
creation
Laborintensiveifperformedmanually
Resultsaretypicallyofpoorqualityifautomated
Contentbasedretrievalsystems
Supportretrievalbasedontheimagecontent,suchas
colorhistogram,texture,shape,objects,andwavelet
transforms
Imagesamplebasedqueries
Findalloftheimagesthataresimilartothegivenimage
sample
Comparethefeaturevector(signature)extractedfromthe
samplewiththefeaturevectorsofimagesthathave
alreadybeenextractedandindexedintheimagedatabase
Imagefeaturespecificationqueries
Specifyorsketchimagefeatureslikecolor,texture,or
shape,whicharetranslatedintoafeaturevector
Matchthefeaturevectorwiththefeaturevectorsofthe
imagesinthedatabase
Colorhistogrambasedsignature
Thesignatureincludescolorhistogramsbasedoncolor
compositionofanimageregardlessofitsscaleororientation
Noinformationaboutshape,location,ortexture
Twoimageswithsimilarcolorcompositionmaycontainvery
differentshapesortextures,andthuscouldbecompletely
unrelatedinsemantics
Multifeaturecomposedsignature
Definedifferentdistancefunctionsforcolor,shape,location,
andtexture,andsubsequentlycombinethemtoderivethe
overallresult
Waveletbasedsignature
Usethedominantwaveletcoefficientsofanimageasits
signature
Waveletscaptureshape,texture,andlocationinformation
inasingleunifiedframework
Improvedefficiencyandreducedtheneedforproviding
multiplesearchprimitives
Mayfailtoidentifyimagescontainingsimilarobjectsthat
areindifferentlocations.
Walnus:[NRS99]byNatsev,Rastogi,andShim
Similarimagesmaycontainsimilarregions,butaregioninone
imagecouldbeatranslationorscalingofamatchingregionin
theother
Waveletbasedsignaturewithregionbasedgranularity
Defineregionsbyclusteringsignaturesofwindowsof
varyingsizeswithintheimage
Signatureofaregionisthecentroidofthecluster
Similarityisdefinedintermsofthefractionoftheareaof
thetwoimagescoveredbymatchingpairsofregionsfrom
twoimages
12/26/2012 DataMining:Principlesand 439
l h
MultidimensionalAnalysisofMultimediaData
Multimediadatacube
Designandconstructionsimilartothatoftraditionaldata
cubesfromrelationaldata
Containadditionaldimensionsandmeasuresformultimedia
information,suchascolor,texture,andshape
Thedatabasedoesnotstoreimagesbuttheirdescriptors
Featuredescriptor:asetofvectorsforeachvisual
characteristic
Colorvector:containsthecolorhistogram
MFC(MostFrequentColor)vector:fivecolorcentroids
MFO(MostFrequentOrientation)vector:fiveedgeorientation
centroids
Layoutdescriptor:containsacolorlayoutvectorandanedge
layoutvector
By Size
By Format
By Format & Size
RED
WHITE
BLUE
Cross Tab By Colour & Size
JPEG GIF By Colour By Format & Colour
RED
WHITE Sum By Colour
BLUE Format of image
By Format Duration
Group By
Colour
Sum Colors
RED Textures
WHITE Keywords
BLUE
Size
Measurement Width
Sum
Height
Internet domain of image
Internet domain of parent pages
Image popularity
12/26/2012 DataMining:Principlesand 444
l h
MiningMultimediaDatabasesin
Specialfeatures:
Need#ofoccurrencesbesidesBooleanexistence,e.g.,
Tworedsquareandonebluecircleimpliesthemeair
show
Needspatialrelationships
Blueontopofwhitesquaredobjectisassociatedwith
brownbottom
Needmultiresolutionandprogressiverefinementmining
Itisexpensivetoexploredetailedassociationsamong
objectsathighresolution
Itiscrucialtoensurethecompletenessofsearchatmulti
resolutionspace
Difficulttoimplementadatacubeefficientlygivenalarge
numberofdimensions,especiallyseriousinthecaseof
multimediadatacubes
Manyoftheseattributesaresetorientedinsteadofsingle
valued
Restrictingnumberofdimensionsmayleadtothemodelingof
animageataratherrough,limited,andimprecisescale
Moreresearchisneededtostrikeabalancebetweenefficiency
andpowerofrepresentation
Miningobjectdataneedsfeature/attributebased
generalizationmethods
Spatial,spatiotemporalandmultimediadataminingisoneof
importantresearchfrontiersindataminingwithbroad
applications
Spatialdatawarehousing,OLAPandmining facilitates
multidimensionalspatialanalysisandfindingspatial
associations,classificationsandtrends
Multimediadatamining needscontentbasedretrieval and
similaritysearch integratedwithminingmethods
Textmining,naturallanguageprocessingand
informationextraction:AnIntroduction
Textcategorizationmethods
MiningWeblinkagestructures
Summary
Prep Phrase
Semantic analysis Verb Phrase
Syntactic analysis
Dog(d1). (Parsing)
Boy(b1).
Playground(p1). Verb Phrase
Chasing(d1,b1,p1).
+ Sentence
Scared(x) if Chasing(_,x,_).
A person saying this may
be reminding another person to
get the dog back
Scared(b1)
Inference Pragmatic analysis
(speech act)
12/26/2012
(Taken from ChengXiang Zhai, CS 397cxzDataMining:Principlesand
Fall 2003)
456
l h
GeneralNLPTooDifficult!
Wordlevelambiguity
designcanbeanounoraverb (AmbiguousPOS)
roothasmultiplemeanings (Ambiguoussense)
Syntacticambiguity
naturallanguageprocessing(Modification)
Amansawaboywithatelescope. (PPAttachment)
Anaphoraresolution
JohnpersuadedBilltobuyaTVforhimself.
(himself =JohnorBill?)
Presupposition
Hehasquitsmoking.impliesthathesmokedbefore.
antonym
12/26/2012 damp anhydrous 459
DataMining:Principlesand
l h
PartofSpeechTagging
Training data (Annotated text)
This sentence serves as an example of annotated text
Det N V1 P Det N P V2 N
p(w1likely
Pick the most ,..., wk , ttag
1 ,..., tk )
sequence.
p(t1 | w1 )... p(tk | wk ) p(w1 )... p(wk )
p(w1 ,..., wk , t1 ,..., tk ) = k
p(wi | ti ) p(ti | ti1 )
Independent assignment
p(t1 | w1 )... p(tk | wk ) p(iw =11 )... p( wk )
Most common tag
= k
p(wi | ti ) p(ti | ti 1 )
i =1 Partial dependency
(HMM)
12/26/2012
(Adapted from ChengXiang Zhai, CS 397cxz DataMining:Principlesand
Fall 2003)
460
l h
WordSenseDisambiguation
?
The difficulties of computational linguistics are rooted in ambiguity.
N Aux V P N
Supervised Learning
Features:
Neighboring POS tags (N Aux V P N)
Neighboring words (linguistics are rooted in ambiguity)
Stemmed form (root)
Dictionary/Thesaurus entries of neighboring words
High co-occurrence words (plant, tree, origin,)
Other senses of word within discourse
Algorithms:
Rule-based Learning (e.g. IG guided)
Statistical Learning (i.e. Nave Bayes)
Unsupervised Learning (i.e. Nearest Neighbor)
12/26/2012 DataMining:Principlesand 461
l h
Parsing
Choose most likely parse tree S Probability of this tree=0.000015
Probabilistic CFG NP VP
Ambiguity
A man saw a boy with a telescope.
Computational Intensity
Imposes a context horizon.
Textmining,naturallanguageprocessingand
informationextraction:AnIntroduction
Textinformationsystemandinformation
retrieval
Textcategorizationmethods
MiningWeblinkagestructures
Summary
Textdatabases(documentdatabases)
Largecollectionsofdocumentsfromvarioussources:news
articles,researchpapers,books,digitallibraries,email
messages,andWebpages,librarydatabase,etc.
Datastoredisusuallysemistructured
Traditionalinformationretrievaltechniquesbecome
inadequatefortheincreasinglyvastamountsoftextdata
Informationretrieval
Afielddevelopedinparallelwithdatabasesystems
Informationisorganizedinto(alargenumberof)documents
Informationretrievalproblem:locatingrelevantdocuments
basedonuserinput,suchaskeywordsorexample
documents
TypicalIRsystems
Onlinelibrarycatalogs
Onlinedocumentmanagementsystems
Informationretrievalvs.databasesystems
SomeDBproblemsarenotpresentinIR,e.g.,update,
transactionmanagement,complexobjects
SomeIRproblemsarenotaddressedwellinDBMS,e.g.,
unstructureddocuments,approximatesearchusing
keywordsandrelevance
12/26/2012 DataMining:Principlesand 467
l h
BasicMeasuresforTextRetrieval
Relevant Relevant&
Retrieved Retrieved
AllDocuments
Precision: thepercentageofretrieveddocumentsthatareinfactrelevanttothe
query(i.e.,correctresponses)
| {Relevant} {Retrieved} |
precision =
| {Retrieved} |
Recall: thepercentageofdocumentsthatarerelevanttothequeryandwere,in
fact,retrieved
| {Relevant} {Retrieved} |
precision =
| {Relevant} |
12/26/2012 DataMining:Principlesand 468
l h
InformationRetrievalTechniques
BasicConcepts
Adocumentcanbedescribedbyasetofrepresentative
keywordscalledindexterms.
Differentindextermshavevaryingrelevancewhenusedto
describedocumentcontents.
Thiseffectiscapturedthroughtheassignmentofnumerical
weightstoeachindexterm ofadocument.(e.g.:frequency,
tfidf)
DBMSAnalogy
IndexTerms Attributes
Weights AttributeValues
IndexTerms(Attribute)Selection:
Stoplist
Wordstem
Indextermsweightingmethods
TermsU DocumentsFrequencyMatrices
InformationRetrievalModels:
BooleanModel
VectorModel
ProbabilisticModel
12/26/2012 DataMining:Principlesand 470
l h
BooleanModel
Considerthatindextermsareeitherpresentorabsentina
document
Asaresult,theindextermweightsareassumedtobeall
binaries
Aqueryiscomposedofindextermslinkedbythree
connectives:not,and,andor
e.g.:carand repair,planeor airplane
TheBooleanmodelpredictsthateachdocumentiseither
relevantornonrelevantbasedonthematchofa
documenttothequery
Adocumentisrepresentedbyastring,whichcanbeidentified
byasetofkeywords
Queriesmayuseexpressions ofkeywords
E.g.,carand repairshop,teaor coffee,DBMSbutnot Oracle
Queriesandretrievalshouldconsidersynonyms, e.g.,repair
andmaintenance
Majordifficultiesofthemodel
Synonymy:AkeywordT doesnotappearanywhereinthe
document,eventhoughthedocumentiscloselyrelatedto
T,e.g.,datamining
Polysemy:Thesamekeywordmaymeandifferentthingsin
differentcontexts,e.g.,mining
Findssimilardocumentsbasedonasetofcommonkeywords
Answershouldbebasedonthedegreeofrelevancebasedon
thenearnessofthekeywords,relativefrequencyofthe
keywords,etc.
Basictechniques
Stoplist
Setofwordsthataredeemedirrelevant,eventhough
theymayappearfrequently
E.g.,a,the,of,for,to,with,etc.
Stoplistsmayvarywhendocumentsetvaries
Wordstem
Severalwordsaresmallsyntacticvariantsofeachother
sincetheyshareacommonwordstem
E.g.,drug,drugs,drugged
Atermfrequencytable
Eachentry frequent_table(i,j) =#ofoccurrencesofthe
word ti indocumentdi
Usually,theratio insteadoftheabsolutenumberof
occurrencesisused
Similaritymetrics:measuretheclosenessofadocumenttoa
query(asetofkeywords)
Relativetermoccurrences v1 v2
sim(v1 , v2 ) =
Cosinedistance: | v1 || v2 |
12/26/2012 DataMining:Principlesand 474
l h
IndexingTechniques
Invertedindex
Maintainstwohash orB+treeindexedtables:
document_table:asetofdocumentrecords<doc_id,postings_list>
term_table:asetoftermrecords,<term,postings_list>
Answerquery:Findalldocsassociatedwithoneorasetofterms
+easytoimplement
donothandlewellsynonymyandpolysemy,andpostinglistscouldbe
toolong(storagecouldbeverylarge)
Signaturefile
Associateasignaturewitheachdocument
Asignatureisarepresentationofanorderedlistoftermsthatdescribethe
document
Orderisobtainedbyfrequencyanalysis,stemmingandstoplists
Documentsanduserqueriesarerepresentedasmdimensionalvectors,
wheremisthetotalnumberofindextermsinthedocumentcollection.
Thedegreeofsimilarityofthedocumentdwithregardtothequeryqis
calculatedasthecorrelationbetweenthevectorsthatrepresentthem,
usingmeasuressuchastheEuclidiandistanceorthecosineoftheangle
betweenthesetwovectors.
Basicassumption:Givenauserquery,thereisasetof
documentswhichcontainsexactlytherelevantdocumentsand
noother(idealanswerset)
Queryingprocessasaprocessofspecifyingthepropertiesof
anidealanswerset.Sincethesepropertiesarenotknownat
querytime,aninitialguessismade
Thisinitialguessallowsthegenerationofapreliminary
probabilisticdescriptionoftheidealanswersetwhichisused
toretrievethefirstsetofdocuments
Aninteractionwiththeuseristheninitiatedwiththepurpose
ofimprovingtheprobabilisticdescriptionoftheanswerset
Keywordbasedassociationanalysis
Automaticdocumentclassification
Similaritydetection
Clusterdocumentsbyacommonauthor
Clusterdocumentscontaininginformationfromacommon
source
Linkanalysis:unusualcorrelationbetweenentities
Sequenceanalysis:predictingarecurringevent
Anomalydetection:findinformationthatviolatesusual
patterns
Hypertextanalysis
Patternsinanchors/links
Anchortextcorrelationswithlinkedobjects
Motivation
Collectsetsofkeywordsortermsthatoccurfrequentlytogetherandthen
findtheassociation or correlationrelationshipsamongthem
AssociationAnalysisProcess
Preprocessthetextdatabyparsing,stemming,removingstopwords,etc.
Evokeassociationminingalgorithms
Considereachdocumentasatransaction
Viewasetofkeywordsinthedocumentasasetofitemsinthetransaction
Termlevelassociationmining
Noneedforhumaneffortintaggingdocuments
Thenumberofmeaninglessresultsandtheexecutiontimeisgreatlyreduced
Motivation
Automaticclassificationforthelargenumberofonlinetextdocuments
(Webpages,emails,corporateintranets,etc.)
ClassificationProcess
Datapreprocessing
Definitionoftrainingsetandtestsets
Creationoftheclassificationmodelusingtheselectedclassification
algorithm
Classificationmodelvalidation
Classificationofnew/unknowntextdocuments
Textdocumentclassificationdiffersfromtheclassificationofrelational
data
Documentdatabasesarenotstructuredaccordingtoattributevalue
pairs
ClassificationAlgorithms:
SupportVectorMachines
KNearestNeighbors
NaveBayes
NeuralNetworks
DecisionTrees
Associationrulebased
Boosting
Motivation
Automaticallygrouprelateddocumentsbasedontheir
contents
Nopredeterminedtrainingsetsortaxonomies
Generateataxonomyatruntime
ClusteringProcess
Datapreprocessing:removestopwords,stem,feature
extraction,lexicalanalysis,etc.
Hierarchicalclustering:computesimilaritiesapplying
clusteringalgorithms.
ModelBasedclustering(NeuralNetworkApproach):clusters
arerepresentedbyexemplars.(e.g.:SOM)
12/26/2012 DataMining:Principlesand 482
l h
TextCategorization
Pregivencategoriesandlabeleddocument
examples(Categoriesmayformhierarchy)
Classifynewdocuments
Astandardclassification(supervisedlearning)
problem Categorization
Sports
Business
System
Education
Sports
Science
Business
Education
12/26/2012 DataMining:Principlesand 483
l h
Applications
Newsarticleclassification
Automaticemailfiltering
Webpageclassification
Wordsensedisambiguation
Giventwodocument
Similaritydefinition
dotproduct
normalizeddotproduct(orcosine)
text
doc1 mining Sim(newdoc,doc1)=4.8*2.4+4.5*4.5
search
engine
text Sim(newdoc,doc2)=2.4*2.4
To whom is newdoc
more similar?
travel
text
Sim(newdoc,doc3)=0
doc2 map
travel
Wideapplicationdomain
Comparableeffectivenesstoprofessionals
ManualTCisnot100%andunlikelytoimprove
substantially.
A.T.C.isgrowingatasteadypace
Prospectsandextensions
Verynoisytext,suchastextfromO.C.R.
Speechtranscripts
Google:whatisthenextstep?
Howtofindthepagesthatmatchapproximatelythe
sohpisticateddocuments,withincorporationofuserprofiles
orpreferences?
LookbackofGoogle:invertedindicies
Constructionofindiciesforthesohpisticateddocuments,
withincorporationofuserprofilesorpreferences
Similaritysearchofsuchpagesusingsuchindicies
Textmining,naturallanguageprocessingand
informationextraction:AnIntroduction
Textcategorizationmethods
MiningWeblinkagestructures
BasedontheslidesbyDengCai
Summary
BackgroundonWebSearch
VIPS(VIsionbasedPageSegmentation)
BlockbasedWebSearch
BlockbasedLinkAnalysis
WebImageSearch&Clustering
12/26/2012 DataMining:Principlesand 493
l h
SearchEngine TwoRankFunctions
Ranking based on link
Search structure analysis
Importance Ranking
Rank Functions (Link Analysis)
Similarity
based on Relevance Ranking
content or text Backward Link Web Topology
(Anchor Text) Graph
Inverted Indexer
Index
Anchor Text Web Graph
Generator Constructor
Web Pages
12/26/2012 DataMining:Principlesand 494
l h
RelevanceRanking
Invertedindex
Adatastructureforsupportingtextqueries
likeindexinabook
aalborg 3452, 11437, ..
.
.
.
indexing .
.
arm 4, 19, 29, 98, 143, ...
diskswith armada 145, 457, 789, ...
documents armadillo 678, 2134, 3970, ...
armani 90, 256, 372, 511, ...
.
.
.
.
.
zz 602, 1189, 3209, ...
invertedindex
ThePageRankAlgorithm
Basicidea
significanceofapageisdeterminedby
thesignificanceofthepageslinkingtoit
Importance = Low
Importance = Med
Importance = High
Problemsoftreatingawebpageasanatomicunit
Webpageusuallycontainsnotonlypurecontent
Noise:navigation,decoration,interaction,
Multipletopics
Differentpartsofapagearenotequallyimportant
Webpagehasinternalstructure
Twodimensionlogicalstructure&Visuallayout
presentation
> Freetextdocument
< Structureddocument
Layout the3rd dimensionofWebpage
1st dimension:content
2nd dimension:hyperlink
12/26/2012 DataMining:Principlesand 499
l h
IsDOMaGoodRepresentationofPageStructure?
PagesegmentationusingDOM
ExtractstructuraltagssuchasP,TABLE,UL,
TITLE,H1~H6,etc
DOMismorerelatedcontentdisplay,doesnot
necessarilyreflectsemanticstructure
HowaboutXML?
AlongwaytogotoreplacetheHTML
Ahierarchicalstructureoflayoutblock
ADegreeofCoherence(DOC) isdefinedfor
eachblock
Showtheintracoherenceoftheblock
DoC ofchildblockmustbenolessthan
itsparents
ThePermittedDegreeofCoherence(PDOC)
canbepredefinedtoachievedifferent
granularitiesforthecontentstructure
Thesegmentationwillstoponlywhenall
theblocksDoC isnolessthanPDoC
ThesmallerthePDoC,thecoarserthe
contentstructurewouldbe
Indexblockinsteadofwholepage
Blockretrieval
CombingDocRankandBlockRank
Blockqueryexpansion
Selectexpansiontermfromrelevantblocks
Dataset
26.5millionswebpages
11.6millionsimages
Queryset
45hotqueriesinGoogleimagesearchstatistics
Groundtruth
Fivevolunteerswerechosentoevaluatethetop100
resultsreturnedbythesystem(iFind)
Rankingmethod
ImagesearchaccuracyusingImageRankand
PageRank.Bothofthemachievedtheirbest
resultsat=0.25.
12/26/2012 DataMining:Principlesand 507
l h
ExampleonImageClustering&Embedding
1710JPGimagesin1287pagesarecrawledwithinthewebsite
http://www.yahooligans.com/content/animals/
Six Categories
Fish
Reptile
Mammal
(a)
(b)
Figure 1. Top 8 returns of query pluto in Googles image search engine (a)
and AltaVistas image search engine (b)
Twodifferenttopicsinthesearchresult
Apossiblesolution:
Clustersearchresultsintodifferentsemanticgroups
VisualFeatureBasedRepresentation
TraditionalCBIR
TextualFeatureBasedRepresentation
Surroundingtextinimageblock
LinkGraphBasedRepresentation
Imagegraphembedding
Clusteringbasedonthreerepresentations
Visualfeature
Hardtoreflectthesemanticmeaning
Textualfeature
Semantic
Sometimesthesurroundingtextistoolittle
Linkgraph:
Semantic
Manydisconnectedsubgraph(toomanyclusters)
TwoSteps:
Usingtextsandlinkinformationtogetsemanticclusters
Foreachcluster,usingvisualfeaturetoreorganizetheimages
tofacilitateusersbrowsing
12/26/2012 DataMining:Principlesand 512
l h
OurSystem
Dataset
26.5millionswebpages
http://dir.yahoo.com/Arts/Visual_Arts/Photography/Museums_and_Galleries/
11.6millionsimages
Filterimageswhoseratiobetweenwidthandheightaregreaterthan
5orsmallerthan1/5
Removedimageswhosewidthandheightarebothsmallerthan60
pixels
Analyzepagesandindeximages
VIPS:Pages Blocks
Surroundingtextsusedtoindeximages
Anillustrativeexample
QueryPluto
Top500results
12/26/2012 DataMining:Principlesand 513
l h
ClusteringUsingVisualFeature
Figure 5. Five clusters of search results of query pluto using low level visual
feature. Each row is a cluster.
Fromtheperspectivesofcolorandtexture,theclustering
resultsarequitegood.Differentclustershavedifferent
colorsandtextures.However,fromsemanticperspective,
theseclustersmakelittlesense.
12/26/2012 DataMining:Principlesand 514
l h
ClusteringUsingTextual
Feature
0.04
0.035
0.03
0.025
0.02
0.015
0.01
0.005
0
0 5 10 15 20 25 30 35 40
Sixsemanticcategoriesarecorrectly
identifiedifwechoosek =6.
Moreimprovementonwebsearchcanbe
madebyminingwebpageLayoutstructure
Leveragevisualcuesforwebinformation
analysis&informationextraction
Demos:
http://www.ews.uiuc.edu/~dengcai2
Papers
VIPSdemo&dll
Definespecialdatamining?
Whatisdocumentrankbasedonthecontextoftext
mining?
Canweconstructaspecialdatawarehouse?
Listthetwotypeofmeasuresinaspecialdatacube?
Enlistthetwotypesofmultimediaindexingandretrieval
system?
Giveanoteonmultimediadatacube?
Whatisinformationretrieval?
Listthemethodsforinformationretrieval?
Whatismeantbyauthoritativewebpage?
Whatiswebusagemining?
DataminingconceptsandTechniquesby
JiaweiHanandMichelineKamber