Data Mining PDF

DATAMINING/IT0467
UNITI
AnIntroductiononData
MiningandPreprocessing
December26, DataMining:Conceptsand 2
2012 h
Chapter1.Introduction
Motivation:Whydatamining?
Whatisdatamining?
DataMining:Onwhatkindofdata?
Dataminingfunctionality
Classificationofdataminingsystems
Top10mostpopulardataminingalgorithms
Majorissuesindatamining
Overviewofthecourse
2012 h
WhyDataMining?
TheExplosiveGrowthofData:fromterabytestopetabytes
Datacollectionanddataavailability
Automateddatacollectiontools,databasesystems,Web,
computerizedsociety
Majorsourcesofabundantdata
Business:Web,ecommerce,transactions,stocks,
Science:Remotesensing,bioinformatics,scientificsimulation,
Societyandeveryone:news,digitalcameras,YouTube
Wearedrowningindata,butstarvingforknowledge!
NecessityisthemotherofinventionDataminingAutomatedanalysisof
massivedatasets
2012 h
WhatIsDataMining?
Datamining(knowledgediscoveryfromdata)
Extractionofinteresting(nontrivial, implicit,previouslyunknown and
potentiallyuseful) patternsorknowledgefromhugeamountofdata
Datamining:amisnomer?
Alternativenames
Knowledgediscovery(mining)indatabases(KDD),knowledge
extraction,data/patternanalysis,dataarcheology,datadredging,
informationharvesting,businessintelligence,etc.
Watchout:Iseverythingdatamining?
Simplesearchandqueryprocessing
(Deductive)expertsystems
2012 h
KnowledgeDiscovery(KDD)Process
Dataminingcoreof Pattern Evaluation

knowledgediscovery
process
Data Mining
Task-relevant Data
Data Warehouse Selection
Data Cleaning
Data Integration
December26,
Databases 6
DataMining:Conceptsand
2012 h
DataMiningandBusinessIntelligence
Increasing potential
to support
business decisions End User
Decision
Making
DataPresentation Business
Analyst
Visualization Techniques
DataMining Data
Information Discovery Analyst
DataExploration
Statistical Summary, Querying, and Reporting
DataPreprocessing/Integration,DataWarehouses
DBA
DataSources
Paper, Files, Web documents, Scientific experiments, Database Systems
2012 h
DataMining:ConfluenceofMultipleDisciplines
Database
Technology Statistics
Machine Visualization
Learning DataMining
Pattern
Recognition Other
Algorithm Disciplines
2012 h
WhyNotTraditionalDataAnalysis?
Tremendousamountofdata
Algorithmsmustbehighlyscalabletohandlesuchasterabytesofdata
Highdimensionalityofdata
Microarraymayhavetensofthousandsofdimensions
Highcomplexityofdata
Datastreamsandsensordata
Timeseriesdata,temporaldata,sequencedata
Structuredata,graphs,socialnetworksandmultilinkeddata
Heterogeneousdatabasesandlegacydatabases
Spatial,spatiotemporal,multimedia,textandWebdata
Softwareprograms,scientificsimulations
Newandsophisticatedapplications
2012 h
MultiDimensionalViewofDataMining
Datatobemined
Relational,datawarehouse,transactional,stream,objectoriented/relational,
active,spatial,timeseries,text,multimedia,heterogeneous,legacy,WWW
Knowledgetobemined
Characterization,discrimination,association,classification,clustering,
trend/deviation,outlieranalysis,etc.
Multiple/integratedfunctionsandminingatmultiplelevels
Techniquesutilized
Databaseoriented,datawarehouse(OLAP),machinelearning,statistics,
visualization,etc.
Applicationsadapted
Retail,telecommunication,banking,fraudanalysis,biodatamining,stock
marketanalysis,textmining,Webmining,etc.
2012 h
DataMining:ClassificationSchemes
Generalfunctionality
Descriptivedatamining
Predictivedatamining
Differentviewsleadtodifferentclassifications
Data view:Kindsofdatatobemined
Knowledge view:Kindsofknowledgetobediscovered
Method view:Kindsoftechniquesutilized
Application view:Kindsofapplicationsadapted
2012 h
DataMining:OnWhatKindsofData?
Databaseorienteddatasetsandapplications
Relationaldatabase,datawarehouse,transactionaldatabase
Advanceddatasetsandadvancedapplications
Datastreamsandsensordata
Timeseriesdata,temporaldata,sequencedata(incl.biosequences)
Structuredata,graphs,socialnetworksandmultilinkeddata
Objectrelationaldatabases
Heterogeneousdatabasesandlegacydatabases
Spatialdataandspatiotemporaldata
Multimediadatabase
Textdatabases
TheWorldWideWeb
2012 h
DataMiningFunctionalities
Multidimensionalconceptdescription:Characterizationanddiscrimination
Generalize,summarize,andcontrastdatacharacteristics,e.g.,dryvs.
wetregions
Frequentpatterns,association,correlationvs.causality
Diaper Beer[0.5%,75%](Correlationorcausality?)
Classificationandprediction
Constructmodels(functions)thatdescribeanddistinguishclassesor
conceptsforfutureprediction
E.g.,classifycountriesbasedon(climate),orclassifycarsbasedon
(gasmileage)
Predictsomeunknownormissingnumericalvalues
2012 h
DataMiningFunctionalities(2)
Clusteranalysis
Classlabelisunknown:Groupdatatoformnewclasses,e.g.,cluster
housestofinddistributionpatterns
Maximizingintraclasssimilarity&minimizinginterclasssimilarity
Outlieranalysis
Outlier:Dataobjectthatdoesnotcomplywiththegeneralbehaviorof
thedata
Noiseorexception?Usefulinfrauddetection,rareeventsanalysis
Trendandevolutionanalysis
Trendanddeviation:e.g.,regressionanalysis
Sequentialpatternmining:e.g.,digitalcamera largeSDmemory
Periodicityanalysis
Similaritybasedanalysis
Otherpatterndirectedorstatisticalanalyses
2012 h
MajorIssuesinDataMining
Miningmethodology
Miningdifferentkindsofknowledgefromdiversedatatypes,e.g.,bio,stream,Web
Performance:efficiency,effectiveness,andscalability
Patternevaluation:theinterestingnessproblem
Incorporationofbackgroundknowledge
Handlingnoiseandincompletedata
Parallel,distributedandincrementalminingmethods
Integrationofthediscoveredknowledgewithexistingone:knowledgefusion
Userinteraction
Dataminingquerylanguagesandadhocmining
Expressionandvisualizationofdataminingresults
Interactiveminingof knowledgeatmultiplelevelsofabstraction
Applicationsandsocialimpacts
Domainspecificdatamining&invisibledatamining
Protectionofdatasecurity,integrity,andprivacy
2012 h
WhyDataMiningQueryLanguage?
Automatedvs.querydriven?
Findingallthepatternsautonomouslyinadatabase?unrealistic
becausethepatternscouldbetoomanybutuninteresting
Dataminingshouldbeaninteractiveprocess
Userdirectswhattobemined
Usersmustbeprovidedwithasetofprimitives tobeusedtocommunicate
withthedataminingsystem
Incorporatingtheseprimitivesinadataminingquerylanguage
Moreflexibleuserinteraction
Foundationfordesignofgraphicaluserinterface
Standardizationofdataminingindustryandpractice
2012 h
PrimitivesthatDefineaDataMiningTask
Taskrelevantdata
Databaseordatawarehousename
Databasetablesordatawarehousecubes
Conditionfordataselection
Relevantattributesordimensions
Datagroupingcriteria
Typeofknowledgetobemined
Characterization,discrimination,association,classification,prediction,
clustering,outlieranalysis,otherdataminingtasks
Backgroundknowledge
Patterninterestingnessmeasurements
Visualization/presentationofdiscoveredpatterns
2012 h
DMQLADataMiningQueryLanguage
Motivation
ADMQLcanprovidetheabilitytosupportadhocandinteractive
datamining
Byprovidingastandardizedlanguage likeSQL
HopetoachieveasimilareffectlikethatSQLhasonrelational
database
Foundationforsystemdevelopmentandevolution
Facilitateinformationexchange,technologytransfer,
commercializationandwideacceptance
Design
DMQLisdesignedwiththe primitivesdescribedearlier
2012 h
AnExampleQueryinDMQL
2012 h
IntegrationofDataMiningandDataWarehousing
Dataminingsystems,DBMS,Datawarehousesystemscoupling
Nocoupling,loosecoupling,semitightcoupling,tightcoupling
Onlineanalyticalminingdata
integrationofminingandOLAPtechnologies
Interactiveminingmultilevelknowledge
Necessityofminingknowledgeandpatternsatdifferentlevelsof
abstractionbydrilling/rolling,pivoting,slicing/dicing,etc.
Integrationofmultipleminingfunctions
Characterizedclassification,firstclusteringandthenassociation
2012 h
CouplingDataMiningwithDB/DWSystems
Nocouplingflatfileprocessing,notrecommended
Loosecoupling
FetchingdatafromDB/DW
SemitightcouplingenhancedDMperformance
ProvideefficientimplementafewdataminingprimitivesinaDB/DW
system,e.g.,sorting,indexing,aggregation,histogramanalysis,
multiwayjoin,precomputationofsomestatfunctions
TightcouplingAuniforminformationprocessing
environment
DMissmoothlyintegratedintoaDB/DWsystem,miningqueryis
optimizedbasedonminingquery,indexing,queryprocessing
methods,etc.
2012 h
Architecture:TypicalDataMiningSystem
GraphicalUserInterface
PatternEvaluation
Knowl
DataMiningEngine edge
Base
DatabaseorDataWarehouse
Server
data cleaning, integration, and selection
Data World-Wide Other Info

Database Repositories
Warehouse Web
2012 h
ChapterDataPreprocessing
Whypreprocessthedata?
Descriptivedatasummarization
Datacleaning
Dataintegrationandtransformation
Datareduction
Discretizationandconcepthierarchygeneration
Summary
2012 h
WhyDataPreprocessing?
Dataintherealworldisdirty
incomplete:lackingattributevalues,lacking
certainattributesofinterest,orcontainingonly
aggregatedata
e.g.,occupation=
noisy:containingerrorsoroutliers
e.g.,Salary=10
inconsistent:containingdiscrepanciesincodesor
names
e.g.,Age=42Birthday=03/07/1997
e.g.,Wasrating1,2,3,nowratingA,B,C
e.g.,discrepancybetweenduplicaterecords
2012 h
WhyIsDataDirty?
Incompletedatamaycomefrom
Notapplicabledatavaluewhencollected
Differentconsiderationsbetweenthetimewhenthedatawascollectedand
whenitisanalyzed.
Human/hardware/softwareproblems
Noisydata(incorrectvalues)maycomefrom
Faultydatacollectioninstruments
Humanorcomputererroratdataentry
Errorsindatatransmission
Inconsistentdatamaycomefrom
Differentdatasources
Functionaldependencyviolation(e.g.,modifysomelinkeddata)
Duplicaterecordsalsoneeddatacleaning
2012 h
WhyIsDataPreprocessingImportant?
Noqualitydata,noqualityminingresults!
Qualitydecisionsmustbebasedonqualitydata
e.g.,duplicateormissingdatamaycauseincorrectoreven
misleadingstatistics.
Datawarehouseneedsconsistentintegrationofqualitydata
Dataextraction,cleaning,andtransformationcomprisesthe
majorityoftheworkofbuildingadatawarehouse
2012 h
MultiDimensionalMeasureofDataQuality
Awellacceptedmultidimensionalview:
Accuracy
Completeness
Consistency
Timeliness
Believability
Valueadded
Interpretability
Accessibility
Broadcategories:
Intrinsic,contextual,representational,andaccessibility
2012 h
MajorTasksinDataPreprocessing
Datacleaning
Fillinmissingvalues,smoothnoisydata,identifyorremoveoutliers,and
resolveinconsistencies
Dataintegration
Integrationofmultipledatabases,datacubes,orfiles
Datatransformation
Normalizationandaggregation
Datareduction
Obtainsreducedrepresentationinvolumebutproducesthesameorsimilar
analyticalresults
Datadiscretization
Partofdatareductionbutwithparticularimportance,especiallyfornumerical
data
2012 h
FormsofDataPreprocessing
2012 h
DataPreprocessing
Datacleaning
Datareduction
Summary
2012 h
MiningDataDescriptive Characteristics
Motivation
Tobetterunderstandthedata:centraltendency,variationandspread
Datadispersioncharacteristics
median,max,min,quantiles,outliers,variance,etc.
Numericaldimensions correspondtosortedintervals
Datadispersion:analyzedwithmultiplegranularitiesofprecision
Boxplotorquantileanalysisonsortedintervals
Dispersionanalysisoncomputedmeasures
Foldingmeasuresintonumericaldimensions
Boxplotorquantileanalysisonthetransformedcube
2012 h
MeasuringtheCentralTendency
=
1 n
x
Mean(algebraicmeasure)(samplevs.population): x =
n

i =1
xi
N
Weightedarithmeticmean: n
wx i i
Trimmedmean:choppingextremevalues x = i =1
n
Median:Aholisticmeasure
w
i =1
i
Middlevalueifoddnumberofvalues,oraverageofthemiddletwovalues
otherwise
Estimatedbyinterpolation(forgroupeddata):
n / 2 ( f )l
Mode median = L1 + ( )c
Valuethatoccursmostfrequentlyinthedata f median
Unimodal,bimodal,trimodal
Empiricalformula:
mean mode = 3 (mean median)
2012 h
Symmetricvs.SkewedData
Median,meanandmodeofsymmetric,
positivelyandnegativelyskeweddata
2012 h
MeasuringtheDispersionofData
Quartiles,outliersandboxplots
Quartiles:Q1 (25th percentile),Q3 (75th percentile)
Interquartilerange:IQR=Q3 Q1
Fivenumbersummary:min,Q1,M, Q3,max
Boxplot:endsoftheboxarethequartiles,medianismarked,whiskers,andplotoutlier
individually
Outlier:usually,avaluehigher/lowerthan1.5xIQR
Varianceandstandarddeviation(sample: s,population:)
Variance:(algebraic,scalablecomputation)
1 n 1 n 2 1 n 2 1 n 1 n
s =2
( xi x) =
2
n 1Standarddeviation
i =1
[ xi ( xi ) ] = ( xi ) 2 =
2
n s(or)isthesquarerootofvariances
1 i=1 n i=1 N 2(ior
=1
2) N
xi 2
i =1
2
2012 h
DataPreprocessing
Datacleaning
Datareduction
Summary
2012 h
DataCleaning
Importance
Datacleaningisoneofthethreebiggestproblemsindata
warehousingRalphKimball
Datacleaningisthenumberoneproblemindatawarehousing
DCIsurvey
Datacleaningtasks
Fillinmissingvalues
Identifyoutliersandsmoothoutnoisydata
Correctinconsistentdata
Resolveredundancycausedbydataintegration
2012 h
MissingData
Dataisnotalwaysavailable
E.g.,manytupleshavenorecordedvalueforseveralattributes,suchas
customerincomeinsalesdata
Missingdatamaybedueto
equipmentmalfunction
inconsistentwithotherrecordeddataandthusdeleted
datanotenteredduetomisunderstanding
certaindatamaynotbeconsideredimportantatthetimeofentry
notregisterhistoryorchangesofthedata
Missingdatamayneedtobeinferred.
2012 h
HowtoHandleMissingData?
Ignorethetuple:usuallydonewhenclasslabelismissing(assumingthe
tasksinclassificationnoteffectivewhenthepercentageofmissingvalues
perattributevariesconsiderably.
Fillinthemissingvaluemanually:tedious+infeasible?
Fillinitautomaticallywith
aglobalconstant:e.g.,unknown,anewclass?!
theattributemean
theattributemeanforallsamplesbelongingtothesameclass:smarter
themostprobablevalue:inferencebasedsuchasBayesianformulaordecision
tree
2012 h
NoisyData
Noise:randomerrororvarianceinameasuredvariable
Incorrectattributevaluesmaydueto
faultydatacollectioninstruments
dataentryproblems
datatransmissionproblems
technologylimitation
inconsistencyinnamingconvention
Otherdataproblemswhichrequiresdatacleaning
duplicaterecords
incompletedata
inconsistentdata
2012 h
HowtoHandleNoisyData?
Binning
firstsortdataandpartitioninto(equalfrequency)bins
thenonecansmoothbybinmeans,smoothbybinmedian,smoothby
binboundaries,etc.
Regression
smoothbyfittingthedataintoregressionfunctions
Clustering
detectandremoveoutliers
Combinedcomputerandhumaninspection
detectsuspiciousvaluesandcheckbyhuman(e.g.,dealwithpossible
outliers)
2012 h
SimpleDiscretization
Methods:Binning
Equalwidth (distance)partitioning
DividestherangeintoN intervalsofequalsize:uniformgrid
ifA andB arethelowestandhighestvaluesoftheattribute,thewidthof
intervalswillbe:W=(BA)/N.
Themoststraightforward,butoutliersmaydominatepresentation
Skeweddataisnothandledwell
Equaldepth (frequency)partitioning
DividestherangeintoN intervals,eachcontainingapproximatelysamenumber
ofsamples
Gooddatascaling
Managingcategoricalattributescanbetricky
2012 h
BinningMethodsforData
Smoothing
Sorteddataforprice(indollars):4,8,9,15,21,21,24,25,26,28,29,34
*Partitionintoequalfrequency(equidepth)bins:
Bin1:4,8,9,15
Bin2:21,21,24,25
Bin3:26,28,29,34
*Smoothingbybinmeans:
Bin1:9,9,9,9
Bin2:23,23,23,23
Bin3:29,29,29,29
*Smoothingbybinboundaries:
Bin1:4,4,4,15
Bin2:21,21,25,25
Bin3:26,26,26,34
2012 h
Regression
Y1
Y1 y=x+1
X1 x
2012 h
ClusterAnalysis
2012 h
DataCleaningasaProcess
Datadiscrepancydetection
Usemetadata(e.g.,domain,range,dependency,distribution)
Checkfieldoverloading
Checkuniquenessrule,consecutiveruleandnullrule
Usecommercialtools
Datascrubbing:usesimpledomainknowledge(e.g.,postalcode,
spellcheck)todetecterrorsandmakecorrections
Dataauditing:byanalyzingdatatodiscoverrulesandrelationshipto
detectviolators(e.g.,correlationandclusteringtofindoutliers)
Datamigrationandintegration
Datamigrationtools:allowtransformationstobespecified
ETL(Extraction/Transformation/Loading)tools:allowuserstospecify
transformationsthroughagraphicaluserinterface
Integrationofthetwoprocesses
Iterativeandinteractive(e.g.,PottersWheels)
2012 h
DataPreprocessing
Datacleaning
Datareduction
Summary
2012 h
DataIntegration
Dataintegration:
Combinesdatafrommultiplesourcesintoacoherentstore
Schemaintegration:e.g.,A.custid B.cust#
Integratemetadatafromdifferentsources
Entityidentificationproblem:
Identifyrealworldentitiesfrommultipledatasources,e.g.,BillClinton=
WilliamClinton
Detectingandresolvingdatavalueconflicts
Forthesamerealworldentity,attributevaluesfromdifferentsourcesare
different
Possiblereasons:differentrepresentations,differentscales,e.g.,metric
vs.Britishunits
2012 h
HandlingRedundancyinDataIntegration
Redundantdataoccuroftenwhenintegrationofmultiple
databases
Objectidentification:Thesameattributeorobjectmayhavedifferent
namesindifferentdatabases
Derivabledata: Oneattributemaybeaderivedattributeinanother
table,e.g.,annualrevenue
Redundantattributesmaybeabletobedetectedby
correlationanalysis
Carefulintegrationofthedatafrommultiplesourcesmayhelp
reduce/avoidredundanciesandinconsistenciesandimprove
miningspeedandquality
2012 h
CorrelationAnalysis(NumericalData)
Correlationcoefficient(alsocalledPearsonsproductmoment
coefficient)
rA , B =
( A A )( B B ) ( AB ) n A B
=
( n 1) A B ( n 1) A B
wherenisthenumberoftuples,andaretherespectivemeansofAandB,
A B A
andBaretherespectivestandarddeviationofAandB,and(AB)isthesumof
theABcrossproduct.
IfrA,B >0,AandBarepositivelycorrelated(Asvaluesincreaseas
Bs).Thehigher,thestrongercorrelation.
rA,B =0:independent;rA,B <0:negativelycorrelated
2012 h
CorrelationAnalysis(CategoricalData)
2 (chisquare)test
(Observed Expected ) 2
2 =
Expected
Thelargerthe2 value,themorelikelythevariablesarerelated
Thecellsthatcontributethemosttothe2 valuearethose
whoseactualcountisverydifferentfromtheexpectedcount
Correlationdoesnotimplycausality
#ofhospitalsand#ofcartheftinacityarecorrelated
Botharecausallylinkedtothethirdvariable:population
2012 h
DataTransformation
Smoothing:removenoisefromdata
Aggregation:summarization,datacubeconstruction
Generalization:concepthierarchyclimbing
Normalization:scaledtofallwithinasmall,specifiedrange
minmaxnormalization
zscorenormalization
normalizationbydecimalscaling
Attribute/featureconstruction
Newattributesconstructedfromthegivenones
2012 h
DataTransformation:Normalization
Minmaxnormalization:to[new_minA,new_maxA]
v minA
v' = (new _ maxA new _ minA) + new _ minA
maxA minA
Ex.Letincomerange$12,000to$98,000normalizedto[0.0,1.0].Then
$73,000ismappedto 73,600 12,000
(1.0 0) + 0 = 0.716
98,000 12,000
Zscorenormalization(:mean,:standarddeviation):
v A
v'=
A
Ex.Let =54,000, =16,000.Then

73,600 54,000
= 1.225
Normalizationbydecimalscaling 16,000
v
v' = j Where j is the smallest integer such that Max(||) < 1
10
2012 h
DataPreprocessing
Datacleaning
Datareduction
Summary
2012 h
DataReductionStrategies
Whydatareduction?
Adatabase/datawarehousemaystoreterabytesofdata
Complexdataanalysis/miningmaytakeaverylongtimetorunonthe
completedataset
Datareduction
Obtainareducedrepresentationofthedatasetthatismuchsmallerin
volumebutyetproducethesame(oralmostthesame)analyticalresults
Datareductionstrategies
Datacubeaggregation:
Dimensionalityreduction e.g., removeunimportantattributes
DataCompression
Numerosityreduction e.g., fitdataintomodels
2012 h
DataCubeAggregation
Thelowestlevelofadatacube(basecuboid)
Theaggregateddataforanindividualentityofinterest
E.g.,acustomerinaphonecallingdatawarehouse
Multiplelevelsofaggregationindatacubes
Furtherreducethesizeofdatatodealwith
Referenceappropriatelevels
Usethesmallestrepresentationwhichisenoughtosolvethetask
Queriesregardingaggregatedinformationshouldbeanswered
usingdatacube,whenpossible
2012 h
AttributeSubsetSelection
Featureselection(i.e.,attributesubsetselection):
Selectaminimumsetoffeaturessuchthattheprobabilitydistributionof
differentclassesgiventhevaluesforthosefeaturesisascloseaspossible
totheoriginaldistributiongiventhevaluesofallfeatures
reduce#ofpatternsinthepatterns,easiertounderstand
Heuristicmethods(duetoexponential#ofchoices):
Stepwiseforwardselection
Stepwisebackwardelimination
Combiningforwardselectionandbackwardelimination
Decisiontreeinduction
2012 h
ExampleofDecisionTreeInduction
Initial attribute set:

{A1, A2, A3, A4, A5, A6}
A4 ?
A1? A6?
Class 1 Class 2 Class 1 Class 2
> Reduced attribute set: {A1, A4, A6}
2012 h
HeuristicFeatureSelectionMethods
Thereare2d possiblesubfeaturesofd features

Severalheuristicfeatureselectionmethods:
Bestsinglefeaturesunderthefeatureindependenceassumption:
choosebysignificancetests
Beststepwisefeatureselection:
Thebestsinglefeatureispickedfirst
Thennextbestfeatureconditiontothefirst,...
Stepwisefeatureelimination:
Repeatedlyeliminatetheworstfeature
Bestcombinedfeatureselectionandelimination
Optimalbranchandbound:
Usefeatureeliminationandbacktracking
2012 h
DataCompression
Stringcompression
Thereareextensivetheoriesandwelltunedalgorithms
Typicallylossless
Butonlylimitedmanipulationispossiblewithoutexpansion
Audio/videocompression
Typicallylossycompression,withprogressiverefinement
Sometimessmallfragmentsofsignalcanbereconstructedwithout
reconstructingthewhole
Timesequenceisnotaudio
Typicallyshortandvaryslowlywithtime
2012 h
DataCompression
Original Data Compressed

Data
lossless
Original Data
Approximated
2012 h
DimensionalityReduction:PrincipalComponent
Analysis(PCA)
GivenN datavectorsfromndimensions,findk n orthogonalvectors
(principalcomponents)thatcanbebestusedtorepresentdata
Steps
Normalizeinputdata:Eachattributefallswithinthesamerange
Computek orthonormal(unit)vectors,i.e.,principalcomponents
Eachinputdata(vector)isalinearcombinationofthek principalcomponent
vectors
Theprincipalcomponentsaresortedinorderofdecreasingsignificanceor
strength
Sincethecomponentsaresorted,thesizeofthedatacanbereducedby
eliminatingtheweakcomponents,i.e.,thosewithlowvariance.(i.e.,usingthe
strongestprincipalcomponents,itispossibletoreconstructagood
approximationoftheoriginaldata
Worksfornumericdataonly
Usedwhenthenumberofdimensionsislarge
2012 h
PrincipalComponentAnalysis
X2
Y1
Y2
X1
2012 h
DataReductionMethod(1):RegressionandLog
LinearModels
Linearregression:Dataaremodeledtofitastraightline
Oftenusestheleastsquaremethodtofittheline
Multipleregression:allowsaresponsevariableYtobe
modeledasalinearfunctionofmultidimensionalfeature
vector
Loglinearmodel:approximatesdiscretemultidimensional
probabilitydistributions
2012 h
RegressAnalysisandLogLinearModels
Linearregression:Y=wX+b
Tworegressioncoefficients,w andb, specifythelineandaretobe
estimatedbyusingthedataathand
UsingtheleastsquarescriteriontotheknownvaluesofY1,Y2,,X1,X2,
.
Multipleregression:Y=b0+b1X1+b2X2.
Manynonlinearfunctionscanbetransformedintotheabove
Loglinearmodels:
Themultiwaytableofjointprobabilitiesisapproximatedbyaproduct
oflowerordertables
Probability:p(a,b,c,d)=abacad bcd
DataReductionMethod(2):Histograms
Dividedataintobucketsandstore 40
average(sum)foreachbucket
35
Partitioningrules:
Equalwidth:equalbucketrange 30
Equalfrequency(orequaldepth) 25
Voptimal:withtheleasthistogram
20
variance (weightedsumoftheoriginal
valuesthateachbucketrepresents)
15
MaxDiff:setbucketboundarybetween
10
eachpairforpairshavethe1largest
differences
5
0
10000 30000 50000 70000 90000
2012 h
DataReductionMethod(3):Clustering
Partitiondatasetintoclustersbasedonsimilarity,andstorecluster
representation(e.g.,centroidanddiameter)only
Canbeveryeffectiveifdataisclusteredbutnotifdataissmeared
Canhavehierarchicalclusteringandbestoredinmultidimensionalindex
treestructures
Therearemanychoicesofclusteringdefinitionsandclusteringalgorithms
ClusteranalysiswillbestudiedindepthinChapter7
2012 h
DataReductionMethod(4):Sampling
Sampling:obtainingasmallsamples torepresentthewhole
datasetN
Allowaminingalgorithmtorunincomplexitythatispotentially
sublineartothesizeofthedata
Choosearepresentative subsetofthedata
Simplerandomsamplingmayhaveverypoorperformanceinthe
presenceofskew
Developadaptivesamplingmethods
Stratifiedsampling:
Approximatethepercentageofeachclass(or
subpopulationofinterest)intheoveralldatabase
Usedinconjunctionwithskeweddata
Note:SamplingmaynotreducedatabaseI/Os(pageatatime)
2012 h
Sampling:withorwithoutReplacement
Raw Data
2012 h
DataPreprocessing
Datacleaning
Datareduction
Summary
2012 h
Discretization
Threetypesofattributes:
Nominal valuesfromanunorderedset,e.g.,color,profession
Ordinal valuesfromanorderedset,e.g.,militaryoracademicrank
Continuous realnumbers,e.g.,integerorrealnumbers
Discretization:
Dividetherangeofacontinuousattributeintointervals
Someclassificationalgorithmsonlyacceptcategoricalattributes.
Reducedatasizebydiscretization
Prepareforfurtheranalysis
2012 h
DiscretizationandConcept
Hierarchy
Discretization
Reducethenumberofvaluesforagivencontinuousattributebydividingthe
rangeoftheattributeintointervals
Intervallabelscanthenbeusedtoreplaceactualdatavalues
Supervisedvs.unsupervised
Split(topdown)vs.merge(bottomup)
Discretizationcanbeperformedrecursivelyonanattribute
Concepthierarchyformation
Recursivelyreducethedatabycollectingandreplacinglowlevelconcepts(such
asnumericvaluesforage)byhigherlevelconcepts(suchasyoung,middleaged,
orsenior)
2012 h
DiscretizationandConceptHierarchyGenerationfor
NumericData
Typicalmethods:Allthemethodscanbeappliedrecursively
Binning(coveredabove)
Topdownsplit,unsupervised,
Histogramanalysis(coveredabove)
Topdownsplit,unsupervised
Clusteringanalysis(coveredabove)
Eithertopdownsplitorbottomupmerge,unsupervised
Entropybaseddiscretization:supervised,topdownsplit
Intervalmergingby2 Analysis:unsupervised,bottomupmerge
Segmentationbynaturalpartitioning:topdownsplit,unsupervised
2012 h
Exampleof345Rule
count
Step 1: -$351 -$159 profit $1,838 $4,700

Min Low (i.e, 5%-tile) High(i.e, 95%-0 tile) Max
Step 2: msd=1,000 Low=-$1,000 High=$2,000
(-$1,000 - $2,000)
Step 3:
(-$1,000 - 0) (0 -$ 1,000) ($1,000 - $2,000)
(-$400 -$5,000)
Step 4:
(-$400 - 0) ($2,000 - $5, 000)

(0 - $1,000) ($1,000 - $2, 000)
(0 -
(-$400 - ($1,000 -
$200)
$1,200) ($2,000 -
-$300)
($200 - $3,000)
($1,200 -
(-$300 - $400)
$1,400)
-$200) ($3,000 -
($400 - ($1,400 - $4,000)
(-$200 - $600) $1,600) ($4,000 -
-$100) $5,000)
($600 - ($1,600 -
$800) ($800 - ($1,800 -
$1,800)
(-$100 - $1,000) $2,000)
December26,
0)
DataMining:Conceptsand 73
2012 h
ConceptHierarchyGenerationforCategoricalData
Specificationofapartial/totalorderingofattributesexplicitlyat
theschemalevelbyusersorexperts
street<city<state<country
Specificationofahierarchyforasetofvaluesbyexplicitdata
grouping
{Urbana,Champaign,Chicago}<Illinois
Specificationofonlyapartialsetofattributes
E.g.,onlystreet<city,notothers
Automaticgenerationofhierarchies(orattributelevels)bythe
analysisofthenumberofdistinctvalues
E.g.,forasetofattributes:{street,city,state,country}
2012 h
AutomaticConceptHierarchyGeneration
Somehierarchiescanbeautomaticallygeneratedbasedon
theanalysisofthenumberofdistinctvaluesperattributein
thedataset
Theattributewiththemostdistinctvaluesisplacedatthelowest
levelofthehierarchy
Exceptions,e.g.,weekday,month,quarter,year
country 15 distinct values
province_or_ state 365 distinct values
city 3567 distinct values
street 674,339 distinct values

2012 h
DataPreprocessing
Datacleaning
Datareduction
Summary
2012 h
Summary
Datapreparationorpreprocessingisabigissueforbothdata
warehousinganddatamining
Discriptivedatasummarizationisneedforqualitydata
preprocessing
Datapreparationincludes
Datacleaninganddataintegration
Datareductionandfeatureselection
Discretization
Alotamethodshavebeendevelopedbutdatapreprocessing
stillanactiveareaofresearch
2012 h
ReviewQuestions
Howisdatawarehousedifferentfromadatabase?Howare
theysimilar?
Listthefiveprimitivesforspecifyingadataminingtask?
Statethedataminingfunctionalities?
Enlisttheclassificationofdataminingsystems
WriteanoteondataminingqueryLanguage?
Describethestepsinvolvedindataminingwhenviewedasa
processofknowledgediscovery?
Statethevariouskindsoffrequentpattern?
Giveanexampleformultidimensionalassociationrule?
Statetheneedforoutlieranalysis?
Areallofthepatterninteresting? Justify
.Whatarethepossibleintegrationschemesincludedinthe
integrationofdataminingsystemwithadatabaseordata
warehousesystem?
2012 h
Bibliography
DataminingconceptsandTechniquesby
JiaweiHanandMichelineKamber
T.DasuandT.Johnson.ExploratoryData
MiningandDataCleaning.JohnWiley&Sons,
2003
2012 h
UNITII
2012 h
ClosedPatternsandMaxPatterns
Alongpatterncontainsacombinatorialnumberofsub
patterns,e.g.,{a1,,a100}contains (1001)+(1002)++(110000)=
2100 1=1.27*1030subpatterns!
Solution:Mineclosedpatterns andmaxpatterns instead
AnitemsetX isclosedifXisfrequent andthereexistsnosuper
pattern Y X,withthesamesupport asX(proposedby
Pasquier,etal.@ICDT99)
AnitemsetXisamaxpattern ifXisfrequentandthereexists
nofrequentsuperpatternY X(proposedbyBayardo@
SIGMOD98)
Closedpatternisalosslesscompressionoffreq.patterns
Reducingthe#ofpatternsandrules
2012 h
ClosedPatternsandMaxPatterns
Exercise.DB={<a1,,a100>,<a1,,a50>}
Min_sup=1.
Whatisthesetofcloseditemset?
<a1,,a100>:1
<a1,,a50>:2
Whatisthesetofmaxpattern?
<a1,,a100>:1
Whatisthesetofallpatterns?
!!
2012 h
Chapter5:MiningFrequentPatterns,Associationand
Correlations
Basicconceptsandaroadmap
Efficientandscalablefrequentitemset
miningmethods
Miningvariouskindsofassociationrules
Fromassociationminingtocorrelation
analysis
Constraintbasedassociationmining
Summary
2012 h
ScalableMethodsforMiningFrequentPatterns
Thedownwardclosure propertyoffrequentpatterns
Anysubsetofafrequentitemsetmustbefrequent
If{beer,diaper,nuts} isfrequent,sois{beer,diaper}
i.e.,everytransactionhaving{beer,diaper,nuts}alsocontains
{beer,diaper}
Scalableminingmethods:Threemajorapproaches
Apriori(Agrawal&Srikant@VLDB94)
Freq.patterngrowth(FPgrowthHan,Pei&Yin
@SIGMOD00)
Verticaldataformatapproach(CharmZaki&Hsiao
@SDM02)
2012 h
Apriori:ACandidateGenerationandTestApproach
Aprioripruningprinciple:Ifthereisany itemsetwhichis
infrequent,itssupersetshouldnotbegenerated/tested!
(Agrawal&Srikant@VLDB94,Mannila,etal.@KDD94)
Method:
Initially,scanDBoncetogetfrequent1itemset
Generate length(k+1)candidate itemsetsfromlengthk
frequent itemsets
TestthecandidatesagainstDB
Terminatewhennofrequentorcandidatesetcanbe
generated
2012 h
TheAprioriAlgorithmAnExample
Supmin =2 Itemset sup
Itemset sup
Database TDB {A} 2
Tid Items
L1 {A} 2
C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2
2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2
{B, C} 2 {A, E}
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}
C3 Itemset L3 Itemset sup

3rd scan
{B, C, E} {B, C, E} 2
2012 h
TheAprioriAlgorithm
Pseudocode:
Ck:Candidateitemsetofsizek
Lk :frequentitemsetofsizek
L1 ={frequentitems};
for (k =1;Lk !=;k++)dobegin
Ck+1 =candidatesgeneratedfromLk;
foreach transactiont indatabasedo
incrementthecountofallcandidatesinCk+1
thatarecontainedint
Lk+1 =candidatesinCk+1 withmin_support
end
return k Lk;
2012 h
ImportantDetailsofApriori
Howtogeneratecandidates?
Step1:selfjoiningLk
Step2:pruning
Howtocountsupportsofcandidates?
ExampleofCandidategeneration
L3={abc,abd,acd,ace,bcd}
Selfjoining:L3*L3
abcdfromabc andabd
acde fromacd andace
Pruning:
acde isremovedbecauseade isnotinL3
C4={abcd}
2012 h
HowtoGenerateCandidates?
SupposetheitemsinLk1 arelistedinanorder
Step1:selfjoiningLk1
insertinto Ck
selectp.item1,p.item2,,p.itemk1,q.itemk1
fromLk1 p,Lk1q
wherep.item1=q.item1,,p.itemk2=q.itemk2,p.itemk1<q.itemk1
Step2:pruning
forallitemsetscinCk do
forall(k1)subsetssofcdo
if(sisnotinLk1)thendeletec fromCk
2012 h
HowtoCountSupportsofCandidates?
Whycountingsupportsofcandidatesaproblem?
Thetotalnumberofcandidatescanbeveryhuge
Onetransactionmaycontainmanycandidates
Method:
Candidateitemsetsarestoredinahashtree
Leafnodeofhashtreecontainsalistofitemsetsand
counts
Interiornode containsahashtable
Subsetfunction:findsallthecandidatescontainedina
transaction
2012 h
Example:CountingSupportsofCandidates
Subset function
Transaction: 1 2 3 5 6
3,6,9
1,4,7
2,5,8
1+2356
13+56 234
567
145 345 356 367
136 368
357
12+356
689
124
457 125 159
458
2012 h
EfficientImplementationofAprioriinSQL
HardtogetgoodperformanceoutofpureSQL(SQL92)
basedapproachesalone
MakeuseofobjectrelationalextensionslikeUDFs,BLOBs,
Tablefunctionsetc.
Getordersofmagnitudeimprovement
S.Sarawagi,S.Thomas,andR.Agrawal.Integrating
associationruleminingwithrelationaldatabasesystems:
Alternativesandimplications.InSIGMOD98
2012 h
ChallengesofFrequentPatternMining
Challenges
Multiplescansoftransactiondatabase
Hugenumberofcandidates
Tediousworkloadofsupportcountingforcandidates
ImprovingApriori:generalideas
Reducepassesoftransactiondatabasescans
Shrinknumberofcandidates
Facilitatesupportcountingofcandidates
2012 h
Partition:ScanDatabaseOnlyTwice
AnyitemsetthatispotentiallyfrequentinDBmustbe
frequentinatleastoneofthepartitionsofDB
Scan1:partitiondatabaseandfindlocalfrequentpatterns
Scan2:consolidateglobalfrequentpatterns
A.Savasere,E.Omiecinski,andS.Navathe.Anefficient
algorithmforminingassociationinlargedatabases.In
VLDB95
2012 h
SamplingforFrequentPatterns
Selectasampleoforiginaldatabase,minefrequentpatterns
withinsampleusingApriori
Scandatabaseoncetoverifyfrequentitemsetsfoundin
sample,onlyborders ofclosureoffrequentpatternsare
checked
Example:checkabcd insteadofab,ac,,etc.
Scandatabaseagaintofindmissedfrequentpatterns
H.Toivonen.Samplinglargedatabasesforassociationrules.In
VLDB96
2012 h
BottleneckofFrequent
patternMining
Multipledatabasescansarecostly
Mininglongpatternsneedsmanypassesof
scanningandgenerateslotsofcandidates
Tofindfrequentitemseti1i2i100
#ofscans:100
#ofCandidates:(1001)+(1002)++(110000)=21001=
1.27*1030!
Bottleneck:candidategenerationandtest
Canweavoidcandidategeneration?
2012 h
MiningFrequentPatternsWithout Candidate
Generation
Growlongpatternsfromshortonesusinglocal
frequentitems
abcisafrequentpattern
Getalltransactionshavingabc:DB|abc
disalocalfrequentiteminDB|abc abcdisa
frequentpattern
2012 h
ConstructFPtreefromaTransactionDatabase
TID Items bought (ordered) frequent items

100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b} min_support = 3
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} {}
Header Table
1. ScanDBonce,find
frequent1itemset Item frequency head f:4 c:1
(singleitempattern) f 4
c 4 c:3 b:1 b:1
2. Sortfrequentitemsin a 3
frequencydescending b 3
order,flist a:3 p:1
m 3
3. ScanDBagain,construct p 3
m:2 b:1
FPtree
Flist=fcabmp p:2 m:1
2012 h
BenefitsoftheFPtreeStructure
Completeness
Preservecompleteinformationforfrequentpatternmining
Neverbreakalongpatternofanytransaction
Compactness
Reduceirrelevantinfoinfrequentitemsaregone
Itemsinfrequencydescendingorder:themorefrequently
occurring,themorelikelytobeshared
Neverbelargerthantheoriginaldatabase(notcountnode
linksandthecount field)
ForConnect4DB,compressionratiocouldbeover100
2012 h
FindPatternsHavingPFromPconditionalDatabase
StartingatthefrequentitemheadertableintheFPtree
TraversetheFPtreebyfollowingthelinkofeachfrequentitemp
Accumulatealloftransformedprefixpaths ofitemptoformps
conditionalpatternbase
{}
Header Table
f:4 c:1 Conditional pattern bases
Item frequency head
f 4 item cond. pattern base
c 4 c:3 b:1 b:1 c f:3
a 3
b 3 a:3 p:1 a fc:3
m 3 b fca:1, f:1, c:1
p 3 m:2 b:1 m fca:2, fcab:1
p:2 m:1 p fcam:2, cb:1

2012 h
MiningFrequentPatterns,AssociationandCorrelations
miningmethods
analysis
Summary
2012 h
MiningVariousKindsofAssociationRules
Miningmultilevelassociation
Mimingmultidimensionalassociation
Miningquantitativeassociation
Mininginterestingcorrelationpatterns

2012 h
MiningMultipleLevelAssociationRules
Itemsoftenformhierarchies
Flexiblesupportsettings
Itemsatthelowerlevelareexpectedtohavelowersupport
Explorationofshared multilevelmining(Agrawal&
Srikant@VLB95,Han&Fu@VLDB95)
uniformsupport reducedsupport
Level 1
Milk Level 1
min_sup = 5%
[support = 10%] min_sup = 5%
Level 2 2% Milk Skim Milk Level 2

min_sup = 5% [support = 6%] [support = 4%] min_sup = 3%

2012 h
MultilevelAssociation:RedundancyFiltering
Somerulesmayberedundantduetoancestorrelationships
betweenitems.
Example
milk wheatbread[support=8%,confidence=70%]
2%milk wheatbread[support=2%,confidence=72%]
Wesaythefirstruleisanancestorofthesecondrule.
Aruleisredundantifitssupportisclosetotheexpected
value,basedontherulesancestor.

2012 h
MiningMultiDimensionalAssociation
Singledimensionalrules:
buys(X,milk) buys(X,bread)
Multidimensionalrules: 2dimensionsorpredicates
Interdimensionassoc.rules(norepeatedpredicates)
age(X,1925) occupation(X,student) buys(X,coke)
hybriddimensionassoc.rules(repeatedpredicates)
age(X,1925) buys(X,popcorn) buys(X,coke)
CategoricalAttributes:finitenumberofpossiblevalues,no
orderingamongvaluesdatacubeapproach
QuantitativeAttributes:numeric,implicitorderingamong
valuesdiscretization,clustering,andgradientapproaches
2012 h
MiningQuantitativeAssociations
Techniquescanbecategorizedbyhownumericalattributes,
suchasageor salary aretreated
1. Staticdiscretizationbasedonpredefinedconcepthierarchies
(datacubemethods)
2. Dynamicdiscretizationbasedondatadistribution
(quantitativerules,e.g.,Agrawal&Srikant@SIGMOD96)
3. Clustering:Distancebasedassociation(e.g.,Yang&
Miller@SIGMOD97)
onedimensionalclusteringthenassociation
4. Deviation:(suchasAumannandLindell@KDD99)
Sex=female=>Wage:mean=$7/hr(overallmean=$9)

2012 h
QuantitativeAssociationRules
ProposedbyLent,SwamiandWidomICDE97
Numericattributesaredynamically discretized
Suchthattheconfidenceorcompactnessoftherulesmined
ismaximized
2Dquantitativeassociationrules:Aquan1 Aquan2 Acat
Clusteradjacent associationrules
toformgeneralrulesusinga
2Dgrid
Example
age(X,34-35) income(X,30-50K)
buys(X,high resolution TV)

2012 h
MiningOtherInterestingPatterns
Flexiblesupportconstraints(Wangetal.@VLDB02)
Someitems(e.g.,diamond)mayoccurrarelybutare
valuable
Customizedsupminspecificationandapplication
TopKclosedfrequentpatterns(Han,etal.@ICDM02)
Hardtospecifysupmin,buttopk withlengthminismore
desirable
Dynamicallyraisesupmin inFPtreeconstructionandmining,
andselectmostpromisingpathtomine

2012 h
miningmethods
analysis
Summary
2012 h
InterestingnessMeasure:Correlations(Lift)
playbasketball eatcereal [40%,66.7%]ismisleading

Theoverall%ofstudentseatingcerealis75%>66.7%.
playbasketball noteatcereal [20%,33.3%]ismoreaccurate,although
withlowersupportandconfidence
Measureofdependent/correlatedevents:lift
Basketball Not basketball Sum (row)
P( A B) Cereal 2000 1750 3750

lift = Not cereal 1000 250 1250
P( A) P( B)
Sum(col.) 3000 2000 5000
2000 / 5000 1000 / 5000

lift ( B, C ) = = 0.89 lift ( B, C ) = = 1.33
3000 / 5000 * 3750 / 5000 3000 / 5000 *1250 / 5000

2012 h
Chapter5:MiningFrequentPatterns,Associationand
Correlations
Efficientandscalablefrequentitemsetmining
methods
Fromassociationminingtocorrelationanalysis
Summary
2012 h
Constraintbased(QueryDirected)Mining
Findingall thepatternsinadatabaseautonomously?
unrealistic!
Thepatternscouldbetoomanybutnotfocused!
Dataminingshouldbeaninteractiveprocess
Userdirectswhattobeminedusingadataminingquery
language(oragraphicaluserinterface)
Constraintbasedmining
Userflexibility:provides constraints onwhattobemined
Systemoptimization:exploressuchconstraintsforefficient
miningconstraintbasedmining

2012 h
ConstraintsinDataMining
Knowledgetypeconstraint:
classification,association,etc.
Dataconstraint using SQLlikequeries
findproductpairssoldtogetherinstoresinChicago in
Dec.02
Dimension/levelconstraint
inrelevancetoregion,price,brand,customercategory
Rule(orpattern)constraint
smallsales(price<$10)triggersbigsales(sum>$200)
Interestingnessconstraint
strongrules:min_support 3%,min_confidence 60%

2012 h
ConstrainedMiningvs.ConstraintBasedSearch
Constrainedminingvs.constraintbasedsearch/reasoning
Bothareaimedatreducingsearchspace
Findingallpatterns satisfyingconstraintsvs.findingsome(or
one)answer inconstraintbasedsearchinAI
Constraintpushing vs.heuristicsearch
Itisaninterestingresearchproblemonhowtointegrate
them
Constrainedminingvs.queryprocessinginDBMS
Databasequeryprocessingrequirestofindall
Constrainedpatternminingsharesasimilarphilosophyas
pushingselectionsdeeplyinqueryprocessing
2012 h
TheAprioriAlgorithm Example
Database D itemset sup.

TID Items
L1 itemset sup.
C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
C3 itemset Scan D L3 itemset sup
{2 3 5}
December26, {2 3 5} 2 115
2012 h
NaveAlgorithm:Apriori+Constraint
Database D itemset sup.

TID Items
L1 itemset sup.
C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
C3 itemset
December26,2012 Scan D L3 itemset sup Constraint:
{2 3 5}
December26, {2 3 5} 2 Sum{S.price}<5
116
2012 h
Efficientandscalablefrequentitemsetmining
methods
Fromassociationminingtocorrelationanalysis
Summary
2012 h
FrequentPatternMining:Summary
Frequentpatternmininganimportanttaskindatamining
Scalablefrequentpatternminingmethods
Apriori(Candidategeneration&test)
Projectionbased(FPgrowth,CLOSET+,...)
Verticalformatapproach(CHARM,...)
Miningavarietyofrulesandinterestingpatterns
Constraintbasedmining
Miningsequentialandstructuredpatterns
Extensionsandapplications
2012 h
ClusterAnalysis
1. WhatisClusterAnalysis?
2. TypesofDatainClusterAnalysis
3. ACategorizationofMajorClusteringMethods
4. PartitioningMethods
5. HierarchicalMethods
6. DensityBasedMethods
7. GridBasedMethods
8. ModelBasedMethods
9. ClusteringHighDimensionalData
10. ConstraintBasedClustering
11. OutlierAnalysis
12. Summary
2012 h
WhatisClusterAnalysis?
Cluster:acollectionofdataobjects
Similartooneanotherwithinthesamecluster
Dissimilartotheobjectsinotherclusters
Clusteranalysis
Findingsimilaritiesbetweendataaccordingtothe
characteristicsfoundinthedataandgroupingsimilardata
objectsintoclusters
Unsupervisedlearning:nopredefinedclasses
Typicalapplications
Asastandalonetool togetinsightintodatadistribution
Asapreprocessingstep forotheralgorithms
2012 h
Clustering:RichApplicationsand
MultidisciplinaryEfforts
PatternRecognition
SpatialDataAnalysis
CreatethematicmapsinGISbyclusteringfeaturespaces
Detectspatialclustersorforotherspatialminingtasks
ImageProcessing
EconomicScience(especiallymarketresearch)
WWW
Documentclassification
ClusterWeblogdatatodiscovergroupsofsimilaraccess
patterns

2012 h
ExamplesofClusteringApplications
Marketing: Helpmarketersdiscoverdistinctgroupsintheircustomerbases,
andthenusethisknowledgetodeveloptargetedmarketingprograms
Landuse: Identificationofareasofsimilarlanduseinanearthobservation
database
Insurance: Identifyinggroupsofmotorinsurancepolicyholderswithahigh
averageclaimcost
Cityplanning: Identifyinggroupsofhousesaccordingtotheirhousetype,
value,andgeographicallocation
Earthquakestudies: Observedearthquakeepicentersshouldbeclustered
alongcontinentfaults

2012 h
Quality:WhatIsGoodClustering?
Agoodclustering methodwillproducehighqualityclusters
with
highintraclass similarity
lowinterclass similarity
Thequality ofaclusteringresultdependsonboththesimilarity
measureusedbythemethodanditsimplementation
Thequality ofaclusteringmethodisalsomeasuredbyits
abilitytodiscoversomeorallofthehidden patterns

2012 h
MeasuretheQualityofClustering
Dissimilarity/Similaritymetric:Similarityisexpressedinterms
ofadistancefunction,typicallymetric:d(i,j)
Thereisaseparatequalityfunctionthatmeasuresthe
goodnessofacluster.
Thedefinitionsofdistancefunctions areusuallyverydifferent
forintervalscaled,boolean,categorical,ordinalratio,and
vectorvariables.
Weightsshouldbeassociatedwithdifferentvariablesbasedon
applicationsanddatasemantics.
Itishardtodefinesimilarenoughorgoodenough
theansweristypicallyhighlysubjective.
2012 h
RequirementsofClusteringinDataMining
Scalability
Abilitytodealwithdifferenttypesofattributes
Abilitytohandledynamicdata
Discoveryofclusterswitharbitraryshape
Minimalrequirementsfordomainknowledgetodetermine
inputparameters
Abletodealwithnoiseandoutliers
Insensitivetoorderofinputrecords
Highdimensionality
Incorporationofuserspecifiedconstraints
Interpretabilityandusability

2012 h
ClusterAnalysis
7. GridBasedMethods
11. OutlierAnalysis
12. Summary
2012 h
DataStructures
Datamatrix x 11 ... x 1f ... x 1p

(twomodes) ... ... ... ... ...
x ... x if ... x ip
i1
... ... ... ... ...
x ... x nf ... x np
n1
0
d(2,1)
Dissimilaritymatrix 0
d(3,1 ) d ( 3,2 ) 0
(onemode)
: : :
d ( n ,1) d ( n ,2 ) ... ... 0

2012 h
Typeofdatainclusteringanalysis
Intervalscaledvariables
Binaryvariables
Nominal,ordinal,andratiovariables
Variablesofmixedtypes

2012 h
Intervalvaluedvariables
Standardizedata
Calculatethemeanabsolutedeviation:
s f = 1n (| x1 f m f | + | x2 f m f | +...+ | xnf m f |)
where m f = 1n (x1 f + x2 f + ... + xnf )

.
Calculatethestandardizedmeasurement(zscore)
xif m f
zif = sf
Usingmeanabsolutedeviationismorerobustthanusing
standarddeviation
2012 h
SimilarityandDissimilarityBetween
Objects
Distances arenormallyusedtomeasurethesimilarity or
dissimilarity betweentwodataobjects
Somepopularonesinclude:Minkowskidistance:
d (i, j) = q (| x x |q + | x x |q +...+ | x x |q )
i1 j1 i2 j2 ip jp
wherei =(xi1,xi2,,xip)and j =(xj1,xj2,,xjp)aretwop
dimensionaldataobjects,andq isapositiveinteger
Ifq =1,d isManhattandistance
d(i, j) =| x x | +| x x | +...+| x x |
i1 j1 i2 j2 ip jp

2012 h
SimilarityandDissimilarityBetweenObjects
(Cont.)
Ifq =2, disEuclideandistance:

d (i, j) = (| x x |2 + | x x |2 +...+ | x x |2 )
i1 j1 i2 j2 ip jp
Properties
d(i,j) 0
d(i,i) =0
d(i,j) =d(j,i)
d(i,j) d(i,k) +d(k,j)
Also,onecanuseweighteddistance,parametricPearson
productmomentcorrelation,orotherdisimilaritymeasures

2012 h
BinaryVariables
Object j
1 0 sum
Acontingencytableforbinary 1 a b a +b
data Object i
0 c d c+d
sum a + c b + d p
Distancemeasureforsymmetric
binaryvariables: d (i, j ) = b+c
a+b+c+d
Distancemeasurefor
asymmetricbinaryvariables: d (i, j ) = b+c
a+b+c
Jaccardcoefficient(similarity
measureforasymmetricbinary
sim Jaccard (i, j ) = a
variables): a+b+c
2012 h
DissimilaritybetweenBinaryVariables
Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
genderisasymmetricattribute
theremainingattributesareasymmetricbinary
letthevaluesYandPbesetto1,andthevalueNbesetto0
0 + 1
d ( jack , mary ) = = 0 . 33
2 + 0 + 1
1 + 1
d ( jack , jim ) = = 0 . 67
1 + 1 + 1
1 + 2
d ( jim , mary ) = = 0 . 75
1 + 1 + 2
2012 h
NominalVariables
Ageneralizationofthebinaryvariableinthatitcantakemore
than2states,e.g.,red,yellow,blue,green
Method1:Simplematching
m:#ofmatches, p:total#ofvariables
d ( i , j ) = p p m
Method2:usealargenumberofbinaryvariables
creatinganewbinaryvariableforeachoftheM nominal
states

2012 h
OrdinalVariables
Anordinalvariablecanbediscreteorcontinuous
Orderisimportant,e.g.,rank
Canbetreatedlikeintervalscaled
replacexif bytheirrank rif {1,..., M f }
maptherangeofeachvariableonto[0,1]byreplacing ith
objectinthefthvariableby
r if 1
z =
if M f 1
computethedissimilarityusingmethodsforintervalscaled
variables
2012 h
RatioScaledVariables
Ratioscaledvariable:apositivemeasurementonanonlinear
scale,approximatelyatexponentialscale, suchas
AeBt orAeBt
Methods:
treatthemlikeintervalscaledvariablesnotagoodchoice!
(why?thescalecanbedistorted)
applylogarithmictransformation
yif= log(xif)
treatthemascontinuousordinaldatatreattheirrankas
intervalscaled

2012 h
VariablesofMixedTypes
Adatabasemaycontainallthesixtypesofvariables
symmetricbinary,asymmetricbinary,nominal,ordinal,
intervalandratio
Onemayuseaweightedformulatocombinetheireffects
pf = 1 ij( f ) d ij( f )
d (i, j ) =
f isbinaryornominal: pf = 1 ij( f )
dij(f) =0ifxif=xjf ,ordij(f) =1otherwise
f isintervalbased:usethenormalizeddistance
f isordinalorratioscaled
computeranksrif and
andtreatzif asintervalscaled
z if = r 1
if
fM 1
2012 h
VectorObjects
Vectorobjects:keywordsindocuments,gene
featuresinmicroarrays,etc.
Broadapplications:informationretrieval,
biologictaxonomy,etc.
Cosinemeasure
Avariant:Tanimotocoefficient
2012 h
ClusterAnalysis
7. GridBasedMethods
11. OutlierAnalysis
12. Summary
2012 h
MajorClusteringApproaches(I)
Partitioningapproach:
Constructvariouspartitionsandthenevaluatethembysomecriterion,e.g.,
minimizingthesumofsquareerrors
Typicalmethods:kmeans,kmedoids,CLARANS
Hierarchicalapproach:
Createahierarchicaldecompositionofthesetofdata(orobjects)usingsome
criterion
Typicalmethods:Diana,Agnes,BIRCH,ROCK,CAMELEON
Densitybasedapproach:
Basedonconnectivityanddensityfunctions
Typicalmethods:DBSACN,OPTICS,DenClue

2012 h
MajorClusteringApproaches(II)
Gridbasedapproach:
basedonamultiplelevelgranularitystructure
Typicalmethods:STING,WaveCluster,CLIQUE
Modelbased:
Amodelishypothesizedforeachoftheclustersandtriestofindthebestfitof
thatmodeltoeachother
Typicalmethods: EM,SOM,COBWEB
Frequentpatternbased:
Basedontheanalysisoffrequentpatterns
Typicalmethods:pCluster
Userguidedorconstraintbased:
Clusteringbyconsideringuserspecifiedorapplicationspecificconstraints
Typicalmethods:COD(obstacles),constrainedclustering

2012 h
ClusterAnalysis
7. GridBasedMethods
11. OutlierAnalysis
12. Summary
2012 h
PartitioningAlgorithms:BasicConcept
Partitioningmethod: ConstructapartitionofadatabaseD ofn objectsintoa

setofk clusters,s.t.,minsumofsquareddistance
km=1tmiKm (Cm tmi ) 2

Givenak,findapartitionofkclustersthatoptimizesthechosenpartitioning
criterion
Globaloptimal:exhaustivelyenumerateallpartitions
Heuristicmethods:kmeans andkmedoids algorithms
kmeans (MacQueen67):Eachclusterisrepresentedbythecenterofthe
cluster
kmedoids orPAM(Partitionaroundmedoids)(Kaufman&
Rousseeuw87):Eachclusterisrepresentedbyoneoftheobjectsinthe
cluster

2012 h
TheKMeans ClusteringMethod
Givenk,thekmeans algorithmisimplementedinfour
steps:
Partitionobjectsintok nonemptysubsets
Computeseedpointsasthecentroidsoftheclustersof
thecurrentpartition(thecentroidisthecenter,i.e.,
meanpoint,ofthecluster)
Assigneachobjecttotheclusterwiththenearestseed
point
GobacktoStep2,stopwhennomorenewassignment

2012 h
TheKMeans ClusteringMethod
Example
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3 Update 3
3
2 each
2 the 2
1 1
1
objects 0
cluster
0
0
0 1 2 3 4 5 6 7 8 9 10 tomost
0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10
similar
center reassign reassign
10 10
K=2 9 9
8 8
ArbitrarilychooseK 7 7
6 6
objectasinitialcluster 5 5
center 4 Update 4
3 3
2
the 2
1 cluster 1
0
0 1 2 3 4 5 6 7 8 9 10
means 0
0 1 2 3 4 5 6 7 8 9 10

2012 h
CommentsontheKMeans Method
Strength: Relativelyefficient:O(tkn),wheren is#objects,k is#clusters,andt

is#iterations.Normally,k,t <<n.
Comparing:PAM:O(k(nk)2 ),CLARA:O(ks2 +k(nk))
Comment: Oftenterminatesatalocaloptimum.Theglobaloptimum maybe
foundusingtechniquessuchas:deterministicannealing andgenetic
algorithms
Weakness
Applicableonlywhenmean isdefined,thenwhataboutcategoricaldata?
Needtospecifyk,thenumber ofclusters,inadvance
Unabletohandlenoisydataandoutliers
Notsuitabletodiscoverclusterswithnonconvexshapes

2012 h
VariationsoftheKMeans Method
Afewvariantsofthekmeans whichdifferin
Selectionoftheinitialk means
Dissimilaritycalculations
Strategiestocalculateclustermeans
Handlingcategoricaldata:kmodes (Huang98)
Replacingmeansofclusterswithmodes
Usingnewdissimilaritymeasurestodealwithcategoricalobjects
Usingafrequencybasedmethodtoupdatemodesofclusters
Amixtureofcategoricalandnumericaldata:kprototype method

2012 h
WhatIstheProblemoftheKMeansMethod?
Thekmeansalgorithmissensitivetooutliers!
Sinceanobjectwithanextremelylargevaluemaysubstantiallydistort
thedistributionofthedata.
KMedoids:Insteadoftakingthemean valueoftheobjectinaclusterasa
referencepoint,medoids canbeused,whichisthemostcentrallylocated
objectinacluster.
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

2012 h
ClusterAnalysis
7. GridBasedMethods
11. OutlierAnalysis
12. Summary
2012 h
HierarchicalClustering
Usedistancematrixasclusteringcriteria.Thismethoddoes
notrequirethenumberofclustersk asaninput,butneedsa
terminationcondition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
2012 h
ClusterAnalysis
7. GridBasedMethods
11. OutlierAnalysis
12. Summary
2012 h
DensityBasedClusteringMethods
Clusteringbasedondensity(localclustercriterion),suchas
densityconnectedpoints
Majorfeatures:
Discoverclustersofarbitraryshape
Handlenoise
Onescan
Needdensityparametersasterminationcondition
Severalinterestingstudies:
DBSCAN: Ester,etal.(KDD96)
OPTICS:Ankerst,etal(SIGMOD99).
DENCLUE:Hinneburg&D.Keim(KDD98)
CLIQUE:Agrawal,etal.(SIGMOD98)(moregridbased)
2012 h
DensityBasedClustering:BasicConcepts
Twoparameters:
Eps:Maximumradiusoftheneighbourhood
MinPts:MinimumnumberofpointsinanEps
neighbourhoodofthatpoint
NEps(p): {qbelongstoD| dist(p,q)<=Eps}
Directlydensityreachable:Apointp isdirectlydensity
reachablefromapointq w.r.t.Eps,MinPts if
p belongstoNEps(q)
p MinPts = 5
corepointcondition:
q
|NEps (q)|>=MinPts Eps = 1 cm

2012 h
ClusterAnalysis
7. GridBasedMethods
11. OutlierAnalysis
12. Summary
2012 h
GridBasedClusteringMethod
Usingmultiresolutiongriddatastructure
Severalinterestingmethods
STING(aSTatisticalINformationGridapproach)byWang,Yangand
Muntz(1997)
WaveCluster bySheikholeslami,Chatterjee,andZhang(VLDB98)
Amultiresolutionclusteringapproachusingwaveletmethod
CLIQUE:Agrawal,etal.(SIGMOD98)
Onhighdimensionaldata(thusputinthesectionofclusteringhigh
dimensionaldata

2012 h
ClusterAnalysis
7. GridBasedMethods
11. OutlierAnalysis
12. Summary
2012 h
ModelBasedClustering
Whatismodelbasedclustering?
Attempttooptimizethefitbetweenthegivendataandsome
mathematicalmodel
Basedontheassumption:Dataaregeneratedbyamixtureof
underlyingprobabilitydistribution
Typicalmethods
Statisticalapproach
EM(Expectationmaximization),AutoClass
Machinelearningapproach
COBWEB,CLASSIT
Neuralnetworkapproach
SOM(SelfOrganizingFeatureMap)
2012 h
SelfOrganizingFeatureMap(SOM)
SOMs,alsocalledtopologicalorderedmaps,orKohonenSelfOrganizing
FeatureMap(KSOMs)
Itmapsallthepointsinahighdimensionalsourcespaceintoa2to3dtarget
space,s.t.,thedistanceandproximityrelationship(i.e.,topology)are
preservedasmuchaspossible
Similartokmeans:clustercenterstendtolieinalowdimensionalmanifoldin
thefeaturespace
Clusteringisperformedbyhavingseveralunitscompetingforthecurrent
object
Theunitwhoseweightvectorisclosesttothecurrentobjectwins
Thewinneranditsneighborslearnbyhavingtheirweightsadjusted
SOMsarebelievedtoresembleprocessingthatcanoccurinthebrain
Usefulforvisualizinghighdimensionaldatain2 or3Dspace

2012 h
ClusterAnalysis
7. GridBasedMethods
11. OutlierAnalysis
12. Summary
2012 h
ClusteringHighDimensionalData
Clusteringhighdimensionaldata
Manyapplications:textdocuments,DNAmicroarraydata
Majorchallenges:
Manyirrelevantdimensionsmaymaskclusters
Distancemeasurebecomesmeaninglessduetoequidistance
Clustersmayexistonlyinsomesubspaces
Methods
Featuretransformation:onlyeffectiveifmostdimensionsarerelevant
PCA&SVDusefulonlywhenfeaturesarehighlycorrelated/redundant
Featureselection:wrapperorfilterapproaches
usefultofindasubspacewherethedatahaveniceclusters
Subspaceclustering:findclustersinallthepossiblesubspaces
CLIQUE,ProClus,andfrequentpatternbasedclustering
2012 h
CLIQUE(ClusteringInQUEst)
Agrawal,Gehrke,Gunopulos,Raghavan(SIGMOD98)
Automaticallyidentifyingsubspacesofahighdimensionaldataspacethat
allowbetterclusteringthanoriginalspace
CLIQUEcanbeconsideredasbothdensitybasedandgridbased
Itpartitionseachdimensionintothesamenumberofequallengthinterval
Itpartitionsanmdimensionaldataspaceintononoverlappingrectangular
units
Aunitisdenseifthefractionoftotaldatapointscontainedintheunit
exceedstheinputmodelparameter
Aclusterisamaximalsetofconnecteddenseunitswithinasubspace

2012 h
CLIQUE:TheMajorSteps
Partitionthedataspaceandfindthenumberofpointsthatlie
insideeachcellofthepartition.
IdentifythesubspacesthatcontainclustersusingtheApriori
principle
Identifyclusters
Determinedenseunitsinallsubspacesofinterests
Determineconnecteddenseunitsinallsubspacesof
interests.
Generateminimaldescriptionfortheclusters
Determinemaximalregionsthatcoveraclusterofconnected
denseunitsforeachcluster
Determinationofminimalcoverforeachcluster

2012 h
Vacation
(10,000)
(week)
Salary
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
age age
20 30 40 50 60 20 30 40 50 60
=3
Vacation
30 50
age

2012 h
StrengthandWeaknessofCLIQUE
Strength
automatically findssubspacesofthe highestdimensionality
suchthathighdensityclustersexistinthosesubspaces
insensitive totheorderofrecordsininputanddoesnot
presumesomecanonicaldatadistribution
scales linearly withthesizeofinputandhasgoodscalability
asthenumberofdimensionsinthedataincreases
Weakness
Theaccuracyoftheclusteringresultmaybedegradedatthe
expenseofsimplicityofthemethod

2012 h
WhyConstraintBasedClusterAnalysis?
Needuserfeedback:Usersknowtheirapplicationsthebest
Lessparametersbutmoreuserdesiredconstraints,e.g.,anATM
allocationproblem:obstacle&desiredclusters

2012 h
ClusterAnalysis
7. GridBasedMethods
11. OutlierAnalysis
12. Summary
2012 h
WhatIsOutlierDiscovery?
Whatareoutliers?
Thesetofobjectsareconsiderablydissimilarfromthe
remainderofthedata
Example:Sports:MichaelJordon,WayneGretzky,...
Problem:Defineandfindoutliersinlargedatasets
Applications:
Creditcardfrauddetection
Telecomfrauddetection
Customersegmentation
Medicalanalysis

2012 h
OutlierDiscovery:Statistical
Approaches
M Assumeamodelunderlyingdistributionthatgeneratesdata
set(e.g.normaldistribution)
Usediscordancytestsdependingon
datadistribution
distributionparameter(e.g.,mean,variance)
numberofexpectedoutliers
Drawbacks
mosttestsareforsingleattribute
Inmanycases,datadistributionmaynotbeknown
2012 h
OutlierDiscovery:DistanceBasedApproach
Introducedtocounterthemainlimitationsimposedby
statisticalmethods
Weneedmultidimensionalanalysiswithoutknowingdata
distribution
Distancebasedoutlier:ADB(p,D)outlierisanobjectOina
datasetTsuchthatatleastafractionpoftheobjectsinTlies
atadistancegreaterthanDfromO
Algorithmsforminingdistancebasedoutliers
Indexbasedalgorithm
Nestedloopalgorithm
Cellbasedalgorithm

2012 h
ClusterAnalysis
7. GridBasedMethods
11. OutlierAnalysis
12. Summary
2012 h
Summary
Clusteranalysis groupsobjectsbasedontheirsimilarity and

haswideapplications
Measureofsimilaritycanbecomputedforvarioustypesof
data
Clusteringalgorithmscanbecategorized intopartitioning
methods,hierarchicalmethods,densitybasedmethods,grid
basedmethods,andmodelbasedmethods
Outlierdetection andanalysisareveryusefulforfraud
detection,etc.andcanbeperformedbystatistical,distance
basedordeviationbasedapproaches
Therearestilllotsofresearchissuesonclusteranalysis
2012 h
ReviewQuestions
Statetheneedformarketbasketanalysis?
Whatarethetwoconditionsthatmakeassociationrule
interesting?
Statethetwostepprocessofassociationrulemining?
DefineAprioriproperty?
ListthetechniquestoimprovetheefficiencyofApriori
Whatisclusteringanalysis?
Givethetypicalrequirementsofclusteringindatamining?
Whatisthedifferencebetweensymmetricandasymmetric
binaryvariables?
Statethetypesofdatainclusteranalysis?

2012 h
Bibliography
R.Agrawal,J.Gehrke,D.Gunopulos,andP.
Raghavan.Automaticsubspaceclusteringof
highdimensionaldatafordatamining
applications
R.AgrawalandR.Srikant.Fastalgorithmsfor
miningassociationrules.VLDB'94

2012 h
UNITIII
Classificationandprediction

2012 h
Classificationand
Prediction
Whatisclassification?Whatis SupportVectorMachines(SVM)
prediction? Associativeclassification
Issuesregardingclassificationand Lazylearners(orlearningfrom
prediction yourneighbors)
Classificationbydecisiontree Otherclassificationmethods
induction Prediction
Bayesianclassification Accuracyanderrormeasures
Rulebasedclassification Ensemblemethods
Modelselection
Classificationbybackpropagation
Summary

2012 h
Classificationvs.Prediction
Classification
predictscategoricalclasslabels(discreteornominal)
classifiesdata(constructsamodel)basedonthetraining
setandthevalues(classlabels)inaclassifyingattribute
andusesitinclassifyingnewdata
Prediction
modelscontinuousvaluedfunctions,i.e.,predictsunknown
ormissingvalues
Typicalapplications
Creditapproval
Targetmarketing
Medicaldiagnosis
Frauddetection

2012 h
ClassificationATwoStepProcess
Modelconstruction:describingasetofpredeterminedclasses
Eachtuple/sampleisassumedtobelongtoapredefinedclass,as
determinedbytheclasslabelattribute
Thesetoftuplesusedformodelconstructionistrainingset
Themodelisrepresentedasclassificationrules,decisiontrees,or
mathematicalformulae
Modelusage:forclassifyingfutureorunknownobjects
Estimateaccuracy ofthemodel
Theknownlabeloftestsampleiscomparedwiththeclassified
resultfromthemodel
Accuracyrateisthepercentageoftestsetsamplesthatare
correctlyclassifiedbythemodel
Testsetisindependentoftrainingset,otherwiseoverfittingwill
occur
Iftheaccuracyisacceptable,usethemodeltoclassifydata tuples
whoseclasslabelsarenotknown
2012 h
Process(1):ModelConstruction
Classification
Algorithms
Training
Data
NAM E RANK YEARS TENURED Classifier

M ike Assistant Prof 3 no (Model)
M ary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
IF rank = professor
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = yes
2012 h
Process(2):UsingtheModelinPrediction
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAM E RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
M erlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
2012 h
Supervisedvs.UnsupervisedLearning
Supervisedlearning(classification)
Supervision:Thetrainingdata(observations,
measurements,etc.)areaccompaniedbylabelsindicating
theclassoftheobservations
Newdataisclassifiedbasedonthetrainingset
Unsupervisedlearning (clustering)
Theclasslabelsoftrainingdataisunknown
Givenasetofmeasurements,observations,etc.withthe
aimofestablishingtheexistenceofclassesorclustersin
thedata
2012 h
Chapter6.Classification
andPrediction
Modelselection
Summary

2012 h
Issues:DataPreparation
Datacleaning
Preprocessdatainordertoreducenoiseandhandle
missingvalues
Relevanceanalysis(featureselection)
Removetheirrelevantorredundantattributes
Datatransformation
Generalizeand/ornormalizedata

2012 h
Issues:EvaluatingClassificationMethods
Accuracy
classifieraccuracy:predictingclasslabel
predictoraccuracy:guessingvalueofpredictedattributes
Speed
timetoconstructthemodel(trainingtime)
timetousethemodel(classification/predictiontime)
Robustness:handlingnoiseandmissingvalues
Scalability:efficiencyindiskresidentdatabases
Interpretability
understandingandinsightprovidedbythemodel
Othermeasures,e.g.,goodnessofrules,suchasdecisiontree
sizeorcompactnessofclassificationrules

2012 h
Classificationand
Prediction
Modelselection
Summary

2012 h
DecisionTreeInduction:TrainingDataset
age income student credit_rating buys_computer

<=30 high no fair no
Thisfollows <=30 high no excellent no
3140 high no fair yes
an >40 medium no fair yes
exampleof >40 low yes fair yes
Quinlans >40 low yes excellent no
3140 low yes excellent yes
ID3(Playing <=30 medium no fair no
Tennis) <=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
3140 medium no excellent yes
3140 high yes fair yes
>40 medium no excellent no
2012 h
Output:ADecisionTreeforbuys_computer
age?
<=30 overcast
31..40 >40
student? yes credit rating?
no yes excellent fair
no yes yes

2012 h
AlgorithmforDecisionTreeInduction
Basicalgorithm(agreedyalgorithm)
Treeisconstructedinatopdownrecursivedivideandconquermanner
Atstart,allthetrainingexamplesareattheroot
Attributesarecategorical(ifcontinuousvalued,theyarediscretizedin
advance)
Examplesarepartitionedrecursivelybasedonselectedattributes
Testattributesareselectedonthebasisofaheuristicorstatistical
measure(e.g.,informationgain)
Conditionsforstoppingpartitioning
Allsamplesforagivennodebelongtothesameclass
Therearenoremainingattributesforfurtherpartitioning majority
voting isemployedforclassifyingtheleaf
Therearenosamplesleft

2012 h
ClassificationinLargeDatabases
Classificationaclassicalproblemextensivelystudiedby
statisticiansandmachinelearningresearchers
Scalability:Classifyingdatasetswithmillionsofexamplesand
hundredsofattributeswithreasonablespeed
Whydecisiontreeinductionindatamining?
relativelyfasterlearningspeed(thanotherclassification
methods)
convertibletosimpleandeasytounderstandclassification
rules
canuseSQLqueriesforaccessingdatabases
comparableclassificationaccuracywithothermethods

2012 h
DataCubeBasedDecisionTreeInduction
Integrationofgeneralizationwithdecisiontreeinduction
(Kamberetal.97)
Classificationatprimitiveconceptlevels
E.g.,precisetemperature,humidity,outlook,etc.
Lowlevelconcepts,scatteredclasses,bushyclassification
trees
Semanticinterpretationproblems
Cubebasedmultilevelclassification
Relevanceanalysisatmultilevels
Informationgainanalysiswithdimension+level
2012 h
Classificationand
Prediction
Modelselection
Summary

2012 h
BayesianClassification:Why?
Astatisticalclassifier:performsprobabilisticprediction,i.e.,
predictsclassmembershipprobabilities
Foundation: BasedonBayesTheorem.
Performance: AsimpleBayesianclassifier,naveBayesian
classifier,hascomparableperformancewithdecisiontreeand
selectedneuralnetworkclassifiers
Incremental:Eachtrainingexamplecanincrementally
increase/decreasetheprobabilitythatahypothesisiscorrect
priorknowledgecanbecombinedwithobserveddata
Standard:EvenwhenBayesianmethodsarecomputationally
intractable,theycanprovideastandardofoptimaldecision
makingagainstwhichothermethodscanbemeasured

2012 h
BayesianTheorem:Basics
LetX beadatasample(evidence):classlabelisunknown
LetHbeahypothesis thatXbelongstoclassC
ClassificationistodetermineP(H|X),theprobabilitythatthe
hypothesisholdsgiventheobserveddatasampleX
P(H)(priorprobability),theinitialprobability
E.g., X willbuycomputer,regardlessofage,income,
P(X):probabilitythatsampledataisobserved
P(X|H)(posterioriprobability),theprobabilityofobservingthe
sampleX,giventhatthehypothesisholds
E.g., Giventhat X willbuycomputer,theprob.thatXis31..40,
mediumincome
2012 h
BayesianTheorem
Giventrainingdata X,posterioriprobabilityofahypothesisH,
P(H|X),followstheBayestheorem
P (H | X ) = P (X | H )P (H )
P (X )
Informally,thiscanbewrittenas
posteriori=likelihoodxprior/evidence
PredictsX belongstoC2 ifftheprobabilityP(Ci|X)isthehighest
amongalltheP(Ck|X)forallthek classes
Practicaldifficulty:requireinitialknowledgeofmany
probabilities,significantcomputationalcost
2012 h
TowardsNaveBayesianClassifier
LetDbeatrainingsetoftuplesandtheirassociatedclasslabels,
andeachtupleisrepresentedbyannDattributevectorX =(x1,
x2,,xn)
Supposetherearem classesC1,C2,,Cm.
Classificationistoderivethemaximumposteriori,i.e.,the
maximalP(Ci|X)
ThiscanbederivedfromBayestheorem
P(X | C )P(C )
P(C | X) = i i
i P(X)
SinceP(X)isconstantforallclasses,only
P(C | X) = P(X | C )P(C )
i i i
needstobemaximized

2012 h
NaveBayesianClassifier:TrainingDataset
age income studentcredit_rating_comp
<=30 high no fair no
<=30 high no excellent no
Class: 3140 high no fair yes
C1:buys_computer=yes >40 medium no fair yes
C2:buys_computer=no >40 low yes fair yes
>40 low yes excellent no
Datasample
3140 low yes excellent yes
X=(age<=30,
Income=medium, <=30 medium no fair no
Student=yes <=30 low yes fair yes
Credit_rating=Fair) >40 medium yes fair yes
<=30 medium yes excellent yes
3140 medium no excellent yes
3140 high yes fair yes
>40 medium no excellent no
2012 h
NaveBayesianClassifier:AnExample
P(Ci):P(buys_computer=yes)=9/14=0.643
P(buys_computer=no)=5/14=0.357
ComputeP(X|Ci)foreachclass
P(age=<=30|buys_computer=yes)=2/9=0.222
P(age=<=30|buys_computer=no)=3/5=0.6
P(income=medium|buys_computer=yes)=4/9=0.444
P(income=medium|buys_computer=no)=2/5=0.4
P(student=yes|buys_computer=yes)=6/9=0.667
P(student=yes|buys_computer=no)=1/5=0.2
P(credit_rating=fair|buys_computer=yes)=6/9=0.667
P(credit_rating=fair|buys_computer=no)=2/5=0.4
X=(age<=30,income=medium,student=yes,credit_rating=fair)
P(X|Ci): P(X|buys_computer=yes)=0.222x0.444x0.667x0.667=0.044
P(X|buys_computer=no)=0.6x0.4x0.2x0.4=0.019
P(X|Ci)*P(Ci):P(X|buys_computer=yes)*P(buys_computer=yes)=0.028
P(X|buys_computer=no)*P(buys_computer=no)=0.007
Therefore,Xbelongstoclass(buys_computer=yes)

2012 h
NaveBayesianClassifier:Comments
Advantages
Easytoimplement
Goodresultsobtainedinmostofthecases
Disadvantages
Assumption:classconditionalindependence,thereforelossof
accuracy
Practically,dependenciesexistamongvariables
E.g.,hospitals:patients:Profile:age,familyhistory,etc.
Symptoms:fever,coughetc.,Disease:lungcancer,diabetes,etc.
DependenciesamongthesecannotbemodeledbyNaveBayesian
Classifier
Howtodealwiththesedependencies?
BayesianBeliefNetworks

2012 h
BayesianBeliefNetworks
Bayesianbeliefnetworkallowsasubset ofthevariables
conditionallyindependent
Agraphicalmodelofcausalrelationships
Representsdependency amongthevariables
Givesaspecificationofjointprobabilitydistribution
Nodes:randomvariables
Links:dependency
X Y XandYaretheparentsofZ,andYisthe
parentofP
Z NodependencybetweenZandP
P
Hasnoloopsorcycles
2012 h
BayesianBeliefNetwork:AnExample
Family Theconditionalprobabilitytable (CPT)

Smoker
History forvariableLungCancer:
(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)
LC 0.8 0.5 0.7 0.1
LungCancer Emphysema ~LC 0.2 0.5 0.3 0.9
CPTshowstheconditionalprobabilityforeach
possiblecombinationofitsparents
PositiveXRay Dyspnea Derivationoftheprobabilityofa

particularcombinationofvaluesofX,
fromCPT:
n
BayesianBeliefNetworks P ( x 1 ,..., x n ) = P ( x i | Parents ( Y i ))
i =1
2012 h
TrainingBayesianNetworks
Severalscenarios:
Givenboththenetworkstructureandallvariables
observable:learnonlytheCPTs
Networkstructureknown,somehiddenvariables:gradient
descent (greedyhillclimbing)method,analogoustoneural
networklearning
Networkstructureunknown,allvariablesobservable:
searchthroughthemodelspacetoreconstructnetwork
topology
Unknownstructure,allhiddenvariables:Nogood
algorithmsknownforthispurpose
Ref.D.Heckerman:Bayesiannetworksfordatamining
2012 h
Classificationand
Prediction
Modelselection
Summary

2012 h
UsingIFTHENRulesforClassification
RepresenttheknowledgeintheformofIFTHEN rules
R:IFage =youthANDstudent =yesTHENbuys_computer =yes
Ruleantecedent/preconditionvs.ruleconsequent
Assessmentofarule:coverage andaccuracy
ncovers=#oftuplescoveredbyR
ncorrect=#oftuplescorrectlyclassifiedbyR
coverage(R)=ncovers/|D|/*D:trainingdataset*/
accuracy(R)=ncorrect/ncovers
Ifmorethanoneruleistriggered,needconflictresolution
Sizeordering:assignthehighestprioritytothetriggeringrulesthathasthe
toughestrequirement(i.e.,withthemostattributetest)
Classbasedordering:decreasingorderofprevalenceormisclassificationcostper
class
Rulebasedordering(decisionlist):rulesareorganizedintoonelongprioritylist,
accordingtosomemeasureofrulequalityorbyexperts

2012 h
RuleExtractionfromaDecisionTree
age?
<=30 31..40 >40

Rulesareeasiertounderstandthanlargetrees
student? credit rating?
yes
Oneruleiscreatedforeachpathfromtheroottoa
no yes excellent fair
leaf
no yes yes
Eachattributevaluepairalongapathformsa
conjunction:theleafholdstheclassprediction
Rulesaremutuallyexclusiveandexhaustive
Example:Ruleextractionfromourbuys_computer decisiontree
IFage =youngANDstudent =no THENbuys_computer =no
IFage =youngANDstudent =yes THENbuys_computer =yes
IFage =midage THENbuys_computer =yes
IFage =oldANDcredit_rating =excellent THENbuys_computer=yes
IFage =youngANDcredit_rating =fair THENbuys_computer =no

2012 h
RuleExtractionfromtheTrainingData
Sequentialcoveringalgorithm:Extractsrulesdirectlyfromtrainingdata
Typicalsequentialcoveringalgorithms:FOIL,AQ,CN2,RIPPER
Rulesarelearnedsequentially,eachforagivenclassCiwillcovermanytuples
ofCibutnone(orfew)ofthetuplesofotherclasses
Steps:
Rulesarelearnedoneatatime
Eachtimearuleislearned,thetuplescoveredbytherulesareremoved
Theprocessrepeatsontheremainingtuplesunlessterminationcondition,
e.g.,whennomoretrainingexamplesorwhenthequalityofarule
returnedisbelowauserspecifiedthreshold
Comp.w.decisiontreeinduction:learningasetofrulessimultaneously

2012 h
Classificationand
Prediction
Modelselection
Summary

2012 h
Classification:AMathematicalMapping
Classification:
predictscategoricalclasslabels
E.g.,Personalhomepageclassification
xi =(x1,x2,x3,),yi =+1or1
x1 :#ofawordhomepage
x2 :#ofawordwelcome
Mathematically
x X=n,y Y={+1,1}
Wewantafunctionf:X Y

2012 h
LinearClassification
BinaryClassificationproblem
Thedataabovetheredline
belongstoclassx
Thedatabelowredline
x belongstoclasso
x
x x x Examples:SVM,Perceptron,
x ProbabilisticClassifiers
x x o
x
o
x o o
ooo
o o
o o o o

h
DiscriminativeClassifiers
Advantages
predictionaccuracyisgenerallyhigh
AscomparedtoBayesianmethods ingeneral
robust,workswhentrainingexamplescontainerrors
fastevaluationofthelearnedtargetfunction
Bayesiannetworksarenormallyslow
Criticism
longtrainingtime
difficulttounderstandthelearnedfunction(weights)
Bayesiannetworkscanbeusedeasilyforpatterndiscovery
noteasytoincorporatedomainknowledge
Easyintheformofpriorsonthedataordistributions
2012 h
Classificationby
Backpropagation
Backpropagation:Aneuralnetworklearningalgorithm
Startedbypsychologistsandneurobiologiststodevelopand
testcomputationalanaloguesofneurons
Aneuralnetwork:Asetofconnectedinput/outputunits
whereeachconnectionhasaweight associatedwithit
Duringthelearningphase,thenetworklearnsbyadjusting
theweights soastobeabletopredictthecorrectclasslabel
oftheinputtuples
Alsoreferredtoasconnectionistlearning duetothe
connectionsbetweenunits
2012 h
NeuralNetworkasaClassifier
Weakness
Longtrainingtime
Requireanumberofparameterstypicallybestdeterminedempirically,
e.g.,thenetworktopologyor``structure."
Poorinterpretability:Difficulttointerpretthesymbolicmeaningbehind
thelearnedweightsandof``hiddenunits"inthenetwork
Strength
Hightolerancetonoisydata
Abilitytoclassifyuntrainedpatterns
Wellsuitedforcontinuousvaluedinputsandoutputs
Successfulonawidearrayofrealworlddata
Algorithmsareinherentlyparallel
Techniqueshaverecentlybeendevelopedfortheextractionofrules
fromtrainedneuralnetworks
2012 h
ANeuron(=aperceptron)
- k
x0 w0
x1

w1
f
output y
xn wn
For Example
n
Input weight weighted Activation y = sign( wi xi + k )
vector x vector w sum function i =0
Thendimensionalinputvectorx ismappedintovariableybymeansof
thescalarproductandanonlinearfunctionmapping

2012 h
AMultiLayerFeedForwardNeuralNetwork
Outputvector
Err j = O j (1 O j ) Errk w jk
Outputlayer k
j = j + (l) Err j
wij = wij + (l ) Err j Oi
Hiddenlayer Err j = O j (1 O j )(T j O j )
wij 1
Oj = I j
1+ e
Inputlayer
I j = wij Oi + j
i
Inputvector:X
2012 h
HowAMultiLayerNeuralNetworkWorks?
Theinputs tothenetworkcorrespondtotheattributesmeasuredforeach
trainingtuple
Inputsarefedsimultaneouslyintotheunitsmakinguptheinputlayer
Theyarethenweightedandfedsimultaneouslytoahiddenlayer
Thenumberofhiddenlayersisarbitrary,althoughusuallyonlyone
Theweightedoutputsofthelasthiddenlayerareinputtounitsmakingup
theoutputlayer,whichemitsthenetwork'sprediction
Thenetworkisfeedforward inthatnoneoftheweightscyclesbacktoan
inputunitortoanoutputunitofapreviouslayer
Fromastatisticalpointofview,networksperformnonlinearregression:
Givenenoughhiddenunitsandenoughtrainingsamples,theycanclosely
approximateanyfunction

2012 h
DefiningaNetworkTopology
Firstdecidethenetworktopology:#ofunitsintheinputlayer,#
ofhiddenlayers (if>1),#ofunitsineachhiddenlayer,and#of
unitsintheoutputlayer
Normalizingtheinputvaluesforeachattributemeasuredinthe
trainingtuplesto[0.01.0]
Oneinput unitperdomainvalue,eachinitializedto0
Output,ifforclassificationandmorethantwoclasses,one
outputunitperclassisused
Onceanetworkhasbeentrainedanditsaccuracyis
unacceptable,repeatthetrainingprocesswithadifferent
networktopology oradifferentsetofinitialweights

2012 h
Backpropagation
Iterativelyprocessasetoftrainingtuples&comparethenetwork's
predictionwiththeactualknowntargetvalue
Foreachtrainingtuple,theweightsaremodifiedtominimizethemean
squarederror betweenthenetwork'spredictionandtheactualtargetvalue
Modificationsaremadeinthebackwardsdirection:fromtheoutputlayer,
througheachhiddenlayerdowntothefirsthiddenlayer,hence
backpropagation
Steps
Initializeweights(tosmallrandom#s)andbiasesinthenetwork
Propagatetheinputsforward(byapplyingactivationfunction)
Backpropagatetheerror(byupdatingweightsandbiases)
Terminatingcondition(whenerrorisverysmall,etc.)

2012 h
Classificationand
Prediction
Modelselection
Summary

2012 h
AssociativeClassification
Associativeclassification
Associationrulesaregeneratedandanalyzedforuseinclassification
Searchforstrongassociationsbetweenfrequentpatterns(conjunctionsof
attributevaluepairs)andclasslabels
Classification:Basedonevaluatingasetofrulesintheformof
P1 ^p2 ^pl Aclass =C(conf,sup)
Whyeffective?
Itexploreshighlyconfidentassociationsamongmultipleattributesandmay
overcomesomeconstraintsintroducedbydecisiontreeinduction,which
considersonlyoneattributeatatime
Inmanystudies,associativeclassificationhasbeenfoundtobemore
accuratethansometraditionalclassificationmethods,suchasC4.5

2012 h
TypicalAssociativeClassificationMethods
CBA(ClassificationByAssociation:Liu,Hsu&Ma,KDD98)
Mineassociationpossiblerulesintheformof
Condset(asetofattributevaluepairs) classlabel
Buildclassifier:Organizerulesaccordingtodecreasingprecedencebasedon
confidenceandthensupport
CMAR(ClassificationbasedonMultipleAssociationRules:Li,Han,Pei,ICDM01)
Classification:Statisticalanalysisonmultiplerules
CPAR(ClassificationbasedonPredictiveAssociationRules:Yin&Han,SDM03)
Generationofpredictiverules(FOILlikeanalysis)
Highefficiency,accuracysimilartoCMAR
RCBT(Miningtopk coveringrulegroupsforgeneexpressiondata,Congetal.SIGMOD05)
Explorehighdimensionalclassification,usingtopkrulegroups
Achievehighclassificationaccuracyandhighruntimeefficiency
2012 h
ThekNearestNeighborAlgorithm
AllinstancescorrespondtopointsinthenDspace
ThenearestneighboraredefinedintermsofEuclidean
distance,dist(X1,X2)
Targetfunctioncouldbediscrete orreal valued
Fordiscretevalued,kNNreturnsthemostcommonvalue
amongthek trainingexamplesnearestto xq
Vonoroidiagram:thedecisionsurfaceinducedby1NNfor
atypicalsetoftrainingexamples
_
_
_ _
.
+
_ .
+
xq + . . .
_
December26,
+ .
2012 h
Classificationand
Prediction
Modelselection
Summary

2012 h
WhatIsPrediction?
(Numerical)predictionissimilartoclassification
constructamodel
usemodeltopredictcontinuousororderedvalueforagiveninput
Predictionisdifferentfromclassification
Classificationreferstopredictcategoricalclasslabel
Predictionmodelscontinuousvaluedfunctions
Majormethodforprediction:regression
modeltherelationshipbetweenoneormoreindependent orpredictor
variablesandadependent orresponse variable
Regressionanalysis
Linearandmultipleregression
Nonlinearregression
Otherregressionmethods:generalizedlinearmodel,Poissonregression,
loglinearmodels,regressiontrees
2012 h
LinearRegression
Linearregression:involvesaresponsevariableyandasinglepredictor
variablex
y=w0 +w1 x
wherew0 (yintercept)andw1 (slope)areregressioncoefficients
Methodofleastsquares:estimatesthebestfittingstraightline
|D|
(x x )( y i y )
w = w = y w x
i
i =1
1 |D|
0 1
(x
i =1
i x )2
Multiplelinearregression:involvesmorethanonepredictorvariable
Trainingdataisoftheform(X1,y1),(X2,y2),,(X|D|,y|D|)
Ex.For2Ddata,wemayhave:y=w0 +w1 x1+w2 x2
SolvablebyextensionofleastsquaremethodorusingSAS,SPlus
Manynonlinearfunctionscanbetransformedintotheabove
2012 h
NonlinearRegression
Somenonlinearmodelscanbemodeledbyapolynomialfunction
Apolynomialregressionmodelcanbetransformedintolinear
regressionmodel.Forexample,
y=w0 +w1 x+w2 x2+w3 x3
convertibletolinearwithnewvariables:x2=x2,x3=x3
y=w0 +w1 x+w2 x2+w3 x3
Otherfunctions,suchaspowerfunction,canalsobetransformed
tolinearmodel
Somemodelsareintractablenonlinear(e.g.,sumofexponential
terms)
possibletoobtainleastsquareestimatesthroughextensive
calculationonmorecomplexformulae
2012 h
OtherRegressionBasedModels
Generalizedlinearmodel:
Foundationonwhichlinearregressioncanbeappliedtomodeling
categoricalresponsevariables
Varianceofyisafunctionofthemeanvalueofy,notaconstant
Logisticregression:modelstheprob.ofsomeeventoccurringasalinear
functionofasetofpredictorvariables
Poissonregression:modelsthedatathatexhibitaPoissondistribution
Loglinearmodels:(forcategoricaldata)
Approximatediscretemultidimensionalprob.distributions
Alsousefulfordatacompressionandsmoothing
Regressiontreesandmodeltrees
Treestopredictcontinuousvaluesratherthanclasslabels

2012 h
RegressionTreesandModelTrees
Regressiontree:proposedinCARTsystem(Breimanetal.1984)
CART:ClassificationAndRegressionTrees
Eachleafstoresacontinuousvaluedprediction
Itistheaveragevalueofthepredictedattribute forthetrainingtuples
thatreachtheleaf
Modeltree:proposedbyQuinlan(1992)
Eachleafholdsaregressionmodelamultivariatelinearequationfor
thepredictedattribute
Amoregeneralcasethanregressiontree
Regressionandmodeltreestendtobemoreaccuratethanlinearregression
whenthedataarenotrepresentedwellbyasimplelinearmodel

2012 h
PredictiveModelinginMultidimensionalDatabases
Predictivemodeling:Predictdatavaluesorconstruct
generalizedlinearmodelsbasedonthedatabasedata
Onecanonlypredictvaluerangesorcategorydistributions
Methodoutline:
Minimalgeneralization
Attributerelevanceanalysis
Generalizedlinearmodelconstruction
Prediction
Determinethemajorfactorswhichinfluencetheprediction
Datarelevanceanalysis:uncertaintymeasurement,entropy
analysis,expertjudgement,etc.
Multilevelprediction:drilldownandrollupanalysis

2012 h
Boosting
Analogy:Consultseveraldoctors,basedonacombinationofweighted
diagnosesweightassignedbasedonthepreviousdiagnosisaccuracy
Howboostingworks?
Weightsareassignedtoeachtrainingtuple
Aseriesofkclassifiersisiterativelylearned
AfteraclassifierMi islearned,theweightsareupdatedtoallowthe
subsequentclassifier,Mi+1,topaymoreattentiontothetrainingtuples
thatweremisclassifiedbyMi
ThefinalM*combinesthevotesofeachindividualclassifier,wherethe
weightofeachclassifier'svoteisafunctionofitsaccuracy
Theboostingalgorithmcanbeextendedforthepredictionofcontinuous
values
Comparingwithbagging:boostingtendstoachievegreateraccuracy,butit
alsorisksoverfittingthemodeltomisclassifieddata
2012 h
Classificationand
Prediction
Modelselection
Summary

2012 h
Summary(I)
Classificationand prediction aretwoformsofdataanalysisthatcanbeused

toextractmodels describingimportantdataclassesortopredictfuture
datatrends.
Effectiveandscalablemethodshavebeendevelopedfordecisiontrees
induction,NaiveBayesianclassification,Bayesianbeliefnetwork,rulebased
classifier,Backpropagation,SupportVectorMachine(SVM),associative
classification,nearestneighborclassifiers, andcasebasedreasoning,and
otherclassificationmethodssuchasgeneticalgorithms,roughsetandfuzzy
set approaches.
Linear,nonlinear,andgeneralizedlinearmodelsofregression canbeused
forprediction.Manynonlinearproblemscanbeconvertedtolinear
problemsbyperformingtransformationsonthepredictorvariables.
Regressiontrees andmodeltrees arealsousedforprediction.

2012 h
Summary(II)
Stratifiedkfoldcrossvalidation isarecommendedmethodforaccuracy
estimation.Baggingandboosting canbeusedtoincreaseoverallaccuracyby
learningandcombiningaseriesofindividualmodels.
Significancetests andROCcurves areusefulformodelselection
Therehavebeennumerouscomparisonsofthedifferentclassificationand
predictionmethods,andthematterremainsaresearchtopic
Nosinglemethodhasbeenfoundtobesuperioroverallothersforalldata
sets
Issuessuchasaccuracy,trainingtime,robustness,interpretability,and
scalabilitymustbeconsideredandcaninvolvetradeoffs,further
complicatingthequestforanoverallsuperiormethod

2012 h
ReviewQuestions
Howdoesclassificationworks?
Howispredictiondifferentformclassification?
DefineDatacleaning?
Listthecriteriainvolvedincomparingandevaluatingtheclassification
andpredictionmethods?
WhatareBayesianclassifier?
StateBayestheorem
DefineBackpropagationandhowdoesitwork?
StateRulepruning?
Whatifwewouldliketopredictacontinuousvalue,ratherthana
categoricallabel?
Statelinearregression?
Statepolynomialregression?
Giveanoteonbootstrapmethod?
Whatisboosting?Statewhyitmayimprovetheaccuracyofdecision
treeinduction?

2012 h
Bibliography
T.DasuandT.Johnson.ExploratoryData
MiningandDataCleaning.JohnWiley&Sons,
2003

2012 h
UNITIV
Techniques
December26, 233
2012
MiningStream,TimeSeries,andSequenceData
Miningdatastreams
Miningtimeseriesdata
Miningsequencepatternsintransactional
databases
Miningsequencepatternsinbiologicaldata

2012 h
MiningDataStreams
Whatisstreamdata?WhyStreamDataSystems?
Streamdatamanagementsystems:Issuesandsolutions
StreamdatacubeandmultidimensionalOLAPanalysis
Streamfrequentpatternanalysis
Streamclassification
Streamclusteranalysis
Researchissues

2012 h
CharacteristicsofDataStreams
DataStreams
Datastreamscontinuous,ordered,changing,fast,hugeamount
TraditionalDBMSdatastoredinfinite,persistent datasets
Characteristics
Hugevolumesofcontinuousdata,possiblyinfinite
Fastchangingandrequiresfast,realtimeresponse
Datastreamcapturesnicelyourdataprocessingneedsoftoday
Randomaccessisexpensivesinglescanalgorithm(canonlyhaveone
look)
Storeonlythesummaryofthedataseenthusfar
Moststreamdataareatprettylowlevelormultidimensionalinnature,
needsmultilevelandmultidimensionalprocessing

2012 h
StreamDataApplications
Telecommunicationcallingrecords
Business:creditcardtransactionflows
Networkmonitoringandtrafficengineering
Financialmarket:stockexchange
Engineering&industrialprocesses:powersupply&
manufacturing
Sensor,monitoring&surveillance:videostreams,RFIDs
Securitymonitoring
WeblogsandWebpageclickstreams
Massivedatasets(evensavedbutrandomaccessistoo
expensive)

2012 h
DBMSversusDSMS
Persistentrelations Transientstreams
Onetimequeries Continuousqueries
Randomaccess Sequentialaccess
Unboundeddiskstore Boundedmainmemory
Onlycurrentstatematters Historicaldataisimportant
Norealtimeservices Realtimerequirements
Relativelylowupdaterate PossiblymultiGBarrivalrate
Dataatanygranularity Dataatfinegranularity
Assumeprecisedata Datastale/imprecise
Accessplandeterminedbyquery Unpredictable/variabledataarrival
processor,physicalDBdesign andcharacteristics
Ack. From Motwanis PODS tutorial slides
2012 h
MiningDataStreams
Researchissues

2012 h
Architecture:StreamQueryProcessing
SDMS (Stream Data User/Application

Management System)
Continuous Query
Results
Multiple streams
Stream Query
Processor
Scratch Space
(Main memory and/or Disk)
2012 h
ChallengesofStreamDataProcessing
Multiple,continuous,rapid,timevarying,ordered streams
Mainmemory computations
Queriesareoftencontinuous
Evaluatedcontinuouslyasstreamdataarrives
Answerupdatedovertime
Queriesareoftencomplex
Beyondelementatatimeprocessing
Beyondstreamatatimeprocessing
Beyondrelationalqueries(scientific,datamining,OLAP)
Multilevel/multidimensionalprocessinganddatamining
Moststreamdataareatlowlevelormultidimensionalinnature

2012 h
ProcessingStreamQueries
Querytypes
Onetimequeryvs.continuousquery (beingevaluatedcontinuouslyas
streamcontinuestoarrive)
Predefinedquery vs.adhocquery(issuedonline)
Unboundedmemoryrequirements
Forrealtimeresponse,mainmemoryalgorithm shouldbeused
Memoryrequirementisunboundedifonewilljoinfuturetuples
Approximatequeryanswering
Withboundedmemory,itisnotalwayspossibletoproduceexact
answers
Highqualityapproximateanswers aredesired
Datareductionandsynopsisconstructionmethods
Sketches,randomsampling,histograms,wavelets,etc.
2012 h
MethodologiesforStreamDataProcessing
Majorchallenges
Keeptrackofalargeuniverse,e.g.,pairsofIPaddress,notages
Methodology
Synopses(tradeoffbetweenaccuracyandstorage)
Usesynopsisdatastructure,muchsmaller(O(logk N)space)thantheir
basedataset(O(N)space)
Computeanapproximateanswer withinasmallerrorrange (factor of
theactualanswer)
Majormethods
Randomsampling
Histograms
Slidingwindows
Multiresolutionmodel
Sketches
Radomizedalgorithms
2012 h
StreamDataMiningvs.StreamQuerying
StreamminingAmorechallengingtaskinmanycases
Itsharesmostofthedifficultieswithstreamquerying
Butoftenrequireslessprecision,e.g.,nojoin,grouping,
sorting
Patternsarehiddenandmoregeneralthanquerying
Itmayrequireexploratoryanalysis
Notnecessarilycontinuousqueries
Streamdataminingtasks
Multidimensionalonlineanalysisofstreams
Miningoutliersandunusualpatternsinstreamdata
Clusteringdatastreams
Classificationofstreamdata
2012 h
MiningDataStreams
Researchissues

2012 h
ChallengesforMiningDynamicsinDataStreams
Moststreamdataareatprettylowlevelormultidimensional
innature:needsML/MDprocessing
Analysisrequirements
Multidimensionaltrendsandunusualpatterns
Capturingimportantchangesatmultidimensions/levels
Fast,realtimedetectionandresponse
Comparingwithdatacube:Similarityanddifferences
Stream(data)cubeorstreamOLAP:Isthisfeasible?
Canweimplementitefficiently?

2012 h
AStreamCubeArchitecture
A tiltedtimeframe
Differenttimegranularities
second,minute,quarter,hour,day,week,
Criticallayers
Minimuminterestlayer (mlayer)
Observationlayer (olayer)
User:watchesatolayerandoccasionallyneedstodrilldowndowntom
layer
Partialmaterializationofstreamcubes
Fullmaterialization:toospaceandtimeconsuming
Nomaterialization:slowresponseatquerytime
Partialmaterialization:whatdowemeanpartial?

2012 h
MiningDataStreams
Researchissues

2012 h
FrequentPatternsforStreamData
Frequentpatternminingisvaluableinstreamapplications
e.g.,networkintrusionmining(Dokas,etal02)
Miningprecise freq.patternsinstreamdata:unrealistic
Evenstoretheminacompressedform,suchasFPtree
Howtominefrequentpatternswithgoodapproximation?
Approximatefrequentpatterns(Manku&MotwaniVLDB02)
Keeponlycurrentfrequentpatterns?Nochangescanbedetected
Miningevolutionfreq.patterns(C.Giannella,J.Han,X.Yan,P.S.Yu,2003)
Usetiltedtimewindowframe
Miningevolutionanddramaticchangesoffrequentpatterns
Spacesavingcomputationoffrequentandtopkelements(Metwally,Agrawal,andEl
Abbadi,ICDT'05)

2012 h
MiningApproximateFrequentPatterns
Miningprecise freq.patternsinstreamdata:unrealistic
Evenstoretheminacompressedform,suchasFPtree
Approximateanswers areoftensufficient(e.g.,trend/patternanalysis)
Example:arouterisinterestedinallflows:
whosefrequency isatleast1%() oftheentiretrafficstreamseenso
far
andfeelsthat1/10of ( =0.1%)error iscomfortable
Howtominefrequentpatternswithgoodapproximation?
LossyCountingAlgorithm(Manku&Motwani,VLDB02)
Majorideas:nottracingitemsuntilitbecomesfrequent
Adv:guaranteederrorbound
Disadv:keepalargesetoftraces
2012 h
MiningDataStreams
Researchissues

2012 h
ClassificationforDynamicDataStreams
Decisiontreeinductionforstreamdataclassification
VFDT(VeryFastDecisionTree)/CVFDT(Domingos,Hulten,Spencer,
KDD00/KDD01)
Isdecisiontreegoodformodelingfastchangingdata,e.g.,stockmarket
analysis?
Otherstreamclassificationmethods
Insteadofdecisiontrees,considerothermodels
NaveBayesian
Ensemble(Wang,Fan,Yu,Han.KDD03)
Knearestneighbors(Aggarwal,Han,Wang,Yu.KDD04)
Tiltedtimeframework,incrementalupdating,dynamicmaintenance,
andmodelconstruction
Comparingofmodelstofindchanges
2012 h
HoeffdingTree
Withhighprobability,classifiestuplesthesame
Onlyusessmallsample
BasedonHoeffdingBoundprinciple
HoeffdingBound(AdditiveChernoffBound)
r:randomvariable
R:rangeofr
n:#independentobservations
Meanofrisatleastravg ,withprobability1 d
R 2 ln( 1 / )
=
December26,
2n 253
2012 h
HoeffdingTreeAlgorithm
HoeffdingTreeInput
S:sequenceofexamples
X:attributes
G():evaluationfunction
d:desiredaccuracy
HoeffdingTreeAlgorithm
foreachexampleinS
retrieveG(Xa)andG(Xb)//twohighestG(Xi)
if(G(Xa) G(Xb)> )
splitonXa
recursetonextnode
break

2012 h
DecisionTreeInductionwithDataStreams
Packets > 10
Data Stream
yes no
Protocol = http
Packets > 10
Data Stream
yes no
Bytes > 60K
Protocol = http
yes
Protocol = ftp Ack. From Gehrkes SIGMOD tutorial slides

2012 h
HoeffdingTree:StrengthsandWeaknesses
Strengths
Scalesbetterthantraditionalmethods
Sublinearwithsampling
Verysmallmemoryutilization
Incremental
Makeclasspredictionsinparallel
Newexamplesareaddedastheycome
Weakness
Couldspendalotoftimewithties
Memoryusedwithtreeexpansion
Numberofcandidateattributes

2012 h
EnsembleofClassifiersAlgorithm
H.Wang,W.Fan,P.S.Yu,andJ.Han,MiningConceptDrifting
DataStreamsusingEnsembleClassifiers,KDD'03.
Method(derivedfromtheensembleideainclassification)
trainKclassifiersfromKchunks
foreachsubsequentchunk
trainanewclassifier
testotherclassifiersagainstthechunk
assignweighttoeachclassifier
selecttopKclassifiers

2012 h
MiningDataStreams
Researchissues

2012 h
ClusteringDataStreams[GMMO01]
Base on the k-median method

Data stream points from metric space
Find k clusters in the stream s.t. the sum of distances
from data points to their closest center is minimized
Constant factor approximation algorithm
In small space, a simple two step algorithm:
1. For each set of M records, Si, find O(k) centers in
S1, , Sl
Local clustering: Assign each point in Si to its
closest center
2. Let S be centers for S1, , Sl with each center
weighted by number of points assigned to it
Cluster S to find k centers

2012 h
HierarchicalClusteringTree
level-(i+1) medians
level-i medians
data points
2012 h
HierarchicalTreeandDrawbacks
Method:
maintainatmostmlevelimedians
Onseeingmofthem,generateO(k)level(i+1)
mediansofweightequaltothesumoftheweightsof
theintermediatemediansassignedtothem
Drawbacks:
Lowqualityforevolvingdatastreams(registeronlyk
centers)
Limitedfunctionalityindiscoveringandexploring
clustersoverdifferentportionsofthestreamover
time

2012 h
Summary:StreamDataMining
Streamdatamining:Arichandongoingresearchfield
Currentresearchfocusindatabasecommunity:
DSMSsystemarchitecture,continuousqueryprocessing,supporting
mechanisms
StreamdataminingandstreamOLAPanalysis
Powerfultoolsforfindinggeneralandunusualpatterns
Effectiveness,efficiencyandscalability:lotsofopenproblems
Ourphilosophyonstreamdataanalysisandmining
Amultidimensionalstreamanalysis framework
Timeisaspecialdimension:Tiltedtimeframe
Whattocomputeandwhattosave?Criticallayers
partialmaterializationandprecomputation
Miningdynamics ofstreamdata
2012 h

2012 h
Miningdatastreams
databases

2012 h
TimeSeriesandSequentialPatternMining
RegressionandtrendanalysisAstatistical
approach
Similaritysearchintimeseriesanalysis
SequentialPatternMining
MarkovChain
HiddenMarkovModel

2012 h
MiningTimeSeriesData
Timeseriesdatabase
Consistsofsequencesofvaluesoreventschanging
withtime
Dataisrecordedatregularintervals
Characteristictimeseriescomponents
Trend,cycle,seasonal,irregular
Applications
Financial:stockprice,inflation
Industry:powerconsumption
Scientific:experimentresults
Meteorological:precipitation
2012 h
CategoriesofTimeSeriesMovements
CategoriesofTimeSeriesMovements
Longtermortrendmovements(trendcurve):generaldirectioninwhich
atimeseriesismovingoveralongintervaloftime
Cyclicmovementsorcyclevariations:longtermoscillationsabouta
trendlineorcurve
e.g.,businesscycles,mayormaynotbeperiodic
Seasonalmovementsorseasonalvariations
i.e,almostidenticalpatternsthatatimeseriesappearstofollow
duringcorrespondingmonthsofsuccessiveyears.
Irregularorrandommovements
Timeseriesanalysis:decompositionofatimeseriesintothesefourbasic
movements
AdditiveModal:TS=T+C+S+I
MultiplicativeModal:TS=T C S I
2012 h
EstimationofTrendCurve
Thefreehandmethod
Fitthecurvebylookingatthegraph
Costlyandbarelyreliableforlargescaleddatamining
Theleastsquaremethod
Findthecurveminimizingthesumofthesquaresof
thedeviationofpointsonthecurvefromthe
correspondingdatapoints
Themovingaveragemethod

2012 h
TrendDiscoveryinTimeSeries(1): Estimationof
SeasonalVariations
Seasonalindex
Setofnumbersshowingtherelativevaluesofavariableduringthe
monthsoftheyear
E.g.,ifthesalesduringOctober,November,andDecemberare80%,
120%,and140%oftheaveragemonthlysalesforthewholeyear,
respectively,then80,120,and140areseasonalindexnumbersfor
thesemonths
Deseasonalizeddata
Dataadjustedforseasonalvariationsforbettertrendandcyclicanalysis
Dividetheoriginalmonthlydatabytheseasonalindexnumbersforthe
correspondingmonths

2012 h
TrendDiscoveryinTimeSeries(2)
Estimationofcyclicvariations
If(approximate)periodicityofcyclesoccurs,cyclic
indexcanbeconstructedinmuchthesame
mannerasseasonalindexes
Estimationofirregularvariations
Byadjustingthedatafortrend,seasonalandcyclic
variations
Withthesystematicanalysisofthetrend,cyclic,seasonal,and
irregularcomponents,itispossibletomakelong orshortterm
predictionswithreasonablequality
2012 h
TimeSeries&SequentialPatternMining
RegressionandtrendanalysisAstatistical
approach
Similaritysearchintimeseriesanalysis
SequentialPatternMining
MarkovChain
HiddenMarkovModel

2012 h
SimilaritySearchinTimeSeriesAnalysis
Normaldatabasequeryfindsexactmatch
Similaritysearchfindsdatasequencesthatdifferonlyslightly
fromthegivenquerysequence
Twocategoriesofsimilarityqueries
Wholematching:findasequencethatissimilarto
thequerysequence
Subsequencematching:findallpairsofsimilar
sequences
TypicalApplications
Financialmarket
Marketbasketdataanalysis
Scientificdatabases
Medicaldiagnosis
2012 h
DataTransformation
Manytechniquesforsignalanalysisrequirethedatatobein
thefrequencydomain
Usuallydataindependenttransformationsareused
Thetransformationmatrixisdeterminedapriori
discreteFouriertransform(DFT)
discretewavelettransform(DWT)
Thedistancebetweentwosignalsinthetimedomainisthe
sameastheirEuclideandistanceinthefrequencydomain

2012 h
Miningsequencepatternsintransactionaldatabases

2012 h
Miningdatastreams
Miningsequencepatternsin
transactionaldatabases

2012 h
SequenceDatabases&SequentialPatterns
Transactiondatabases,timeseriesdatabasesvs.sequencedatabases
Frequentpatternsvs.(frequent)sequentialpatterns
Applicationsofsequentialpatternmining
Customershoppingsequences:
Firstbuycomputer,thenCDROM,andthendigitalcamera,
within3months.
Medicaltreatments,naturaldisasters(e.g.,earthquakes),
science&eng.processes,stocksandmarkets,etc.
Telephonecallingpatterns,Weblogclickstreams
DNAsequencesandgenestructures

2012 h
WhatIsSequentialPatternMining?
Givenasetofsequences,findthecompleteset
offrequentsubsequences
Asequence:<(ef)(ab)(df)cb>
Asequencedatabase
SID sequence Anelementmaycontainasetofitems.
10 <a(abc)(ac)d(cf)> Itemswithinanelementareunordered
20 <(ad)c(bc)(ae)> andwelistthemalphabetically.
30 <(ef)(ab)(df)cb>
40 <eg(af)cbc> <a(bc)dc>isasubsequence of
<a(abc)(ac)d(cf)>
Givensupportthreshold min_sup=2,<(ab)c>isasequential
pattern
2012 h
ChallengesonSequentialPatternMining
Ahuge numberofpossiblesequentialpatternsarehiddenin
databases
Aminingalgorithmshould
findthecompletesetofpatterns,whenpossible,
satisfyingtheminimumsupport(frequency)
threshold
behighlyefficient,scalable,involvingonlyasmall
numberofdatabasescans
beabletoincorporatevariouskindsofuser
specificconstraints
2012 h
SequentialPatternMiningAlgorithms
ConceptintroductionandaninitialApriorilikealgorithm
Agrawal&Srikant.Miningsequentialpatterns,ICDE95
Aprioribasedmethod:GSP(GeneralizedSequentialPatterns:Srikant&
Agrawal@EDBT96)
Patterngrowthmethods:FreeSpan&PrefixSpan (Hanetal.@KDD00;Pei,
etal.@ICDE01)
Verticalformatbasedmining:SPADE (Zaki@MachineLeanining00)
Constraintbasedsequentialpatternmining(SPIRIT:Garofalakis,Rastogi,
Shim@VLDB99;Pei,Han,Wang@CIKM02)
Miningclosedsequentialpatterns:CloSpan (Yan,Han&Afshar@SDM03)

2012 h
TheAprioriPropertyofSequentialPatterns
Abasicproperty:Apriori(Agrawal&Sirkant94)
IfasequenceSisnotfrequent
ThennoneofthesupersequencesofSisfrequent
E.g,<hb>isinfrequent sodo<hab>and<(ah)b>
Seq.ID Sequence
Givensupportthreshold min_sup
10 <(bd)cb(ac)>
=2
20 <(bf)(ce)b(fg)>
30 <(ah)(bf)abf>
40 <(be)(ce)d>
50 <a(bd)bcb(ade)>

2012 h
TheSPADEAlgorithm
SPADE(SequentialPAtternDiscoveryusingEquivalentClass)
developedbyZaki2001
Averticalformatsequentialpatternminingmethod
Asequencedatabaseismappedtoalargesetof
Item:<SID,EID>
Sequentialpatternminingisperformedby
growingthesubsequences(patterns)oneitemat
atimebyAprioricandidategeneration

2012 h
TheSPADEAlgorithm

2012 h

2012 h
Miningdatastreams
databases
Miningsequencepatternsin
biologicaldata
2012 h
MiningSequencePatternsinBiologicalData
Abriefintroductiontobiologyandbioinformatics
Alignmentofbiologicalsequences
HiddenMarkovmodelforbiologicalsequence
analysis
Summary

2012 h
BiologyFundamentals(1):DNAStructure
DNA:helixshapedmolecule
whoseconstituentsaretwo
parallelstrandsofnucleotides
DNAisusuallyrepresentedby
sequencesofthesefour
nucleotides
Thisassumesonlyonestrandis
Nucleotides(bases)
considered;thesecondstrandis Adenine(A)
alwaysderivablefromthefirstby Cytosine(C)
pairingAswithTsandCswith Guanine(G)
Thymine(T)
Gsandviceversa

2012 h
BiologyFundamentals(2):Genes
Gene:Contiguoussubpartsofsingle
strandDNAthataretemplatesfor
producingproteins.Genescanappearin
eitheroftheDNAstrand.
Chromosomes:compactchainsofcoiled
DNA
Genome:Thesetofallgenes inagiven
organism.
Noncoding part:ThefunctionofDNA
materialbetweengenesislargely
unknown.Certainintergenic regionsof
DNAareknowntoplayamajorrolein
cellregulation (controlstheproduction
ofproteinsandtheirpossible
interactionswithDNA).
Source:www.mtsinai.on.ca/pdmg/Genetics/basic.htm
December26, 287
2012 h
BiologyFundamentals(3):Transcription
Proteins:ProducedfromDNAusing3operationsortransformations:
transcription,splicing andtranslation
In eukaryotes (cellswithnucleus):genesareonlyaminutepartofthetotalDNA
In prokaryotes (cellswithoutnucleus):thephaseofsplicingdoesnotoccur(no
preRNAgenerated)
DNAiscapableofreplicatingitself(DNApolymerase)
Centerdogma:ThecapabilityofDNAforreplicationandundergoingthe
three(ortwo)transformations
Genesaretranscribed intopreRNAbyacomplexensembleofmolecules
(RNApolymerase).DuringtranscriptionTissubstitutedbytheletterU(for
uracil).
PreRNAcanberepresentedbyalternationsoffsequencesegmentscalled
exons andintrons.TheexonsrepresentsthepartsofpreRNAthatwillbe
expressed,i.e.,translatedintoproteins.

2012 h
BiologyFundamentals(4):Proteins
Splicing (byspliceosomeanensembleofproteins):concatenatesthe
exonsandexcisesintronstoformmRNA(orsimplyRNA)
Translation (byribosomesanensembleofRNAandproteins)
Repeatedlyconsidersatriplet ofconsecutivenucleotides(calledcodon)inRNA
andproducesonecorrespondingaminoacid
InRNA,thereisonespecialcodoncalledstartcodon andafewotherscalled
stopcodons
AnOpenReadingFrame (ORF):asequenceofcodonsstartingwithastart
codonandendingwithanendcodon.TheORFisthusasequenceof
nucleotidesthatisusedbytheribosometoproducethesequenceof
aminoacidthatmakesupaprotein.
Therearebasically20aminoacids (A,L,V,S,...) butincertainraresituations,
otherscanbeaddedtothatlist.

2012 h
BiologicalInformation:FromGenesto
Proteins
Gene
DNA
Transcription genomics
molecular
RNA
biology
Translation
structural
Protein Protein folding biology
biophysics

2012 h
BiologyFundamentals(5):3DStructure
Sincethereare64differentcodonsand20aminoacids,thetablelookup
fortranslatingeachcodonintoanaminoacidisredundant:multiple
codonscanproducethesameaminoacid
Thetableusedbynaturetoperformtranslationiscalledthegeneticcode
Duetotheredundancy ofthegeneticcode,certainnucleotidechangesin
DNAmaynotaltertheresultingprotein
Onceaproteinisproduced,itfoldsintoauniquestructurein3Dspace,
with3typesofcomponents:helices,sheets andcoils.
Thesecondary structureofaproteinisitssequenceofaminoacids,
annotatedtodistinguishtheboundaryofeachcomponent
Thetertiary structureisits3Drepresentation

2012 h
BiologicalDataAvailable
Vastmajorityofdataare sequenceofsymbols(nucleotidesgenomicdata,
butalsogoodamounton aminoacids).
Nextinvolume:microarray experimentsandalsoproteinarray data
Comparablysmall:3Dstructureofproteins (PDB)
NCBI(NationalCenterforBiotechnologyInformation)server:
Total26Bbp:3Bbphumangenome,thenseveralbacteria(e.g.,E.Coli),higher
organisms:yeast,worm,fruitful,mouse,andplants
Thelargestknowngeneshas~20millionbpandthelargestproteinconsistsof
~34kaminoacids
PDBhasacatalogueofonly45kproteins,specifiedbytheir3Dstructure(i.e,
needtoinferproteinshapefromsequencedata)

2012 h
Bioinformatics
Computationalmanagementand
analysisofbiologicalinformation
InterdisciplinaryField(Molecular
Biology,Statistics,ComputerScience,
Genomics,Genetics,Databases,
Chemistry,Radiology)
Bioinformaticsvs.computational
Functional
Bioinformatics Genomics biology (moreonalgorithm
correctness,complexityandother
themescentraltotheoreticalCS)
Genomics
Proteomics
Structural
Bioinformatics
2012 h
DataMining&Bioinformatics:Why?
Manybiologicalprocessesarenotwellunderstood
Biologicalknowledgeishighlycomplex,imprecise,descriptive,and
experimental
Biologicaldataisabundantandinformationrich
Genomics&proteomicsdata(sequences),microarrayandproteinarrays,protein
database(PDB),biotestingdata
Hugedatabanks,richliterature,openlyaccessible
Largestandrichestscientificdatasetsintheworld
Mining:gainbiologicalinsight(data/information knowledge)
Miningforcorrelations,linkagesbetweendiseaseandgenesequences,protein
networks,classification,clustering,outliers,...
Findcorrelationsamonglinkagesinliteratureandheterogeneousdatabases

2012 h
DataMining&Bioinformatics:How(1)
DataIntegration:Handlingheterogeneous,distributedbiodata
BuildWebbased,interchangeable,integrated,multidimensionalgenome
databases
Datacleaninganddataintegrationmethodsbecomescrucial
Miningcorrelatedinformationacrossmultipledatabasesitselfbecomesadata
miningtask
Typicalstudies:miningdatabasestructures,informationextractionfromdata,
referencereconciliation,documentclassification,clusteringandcorrelation
discoveryalgorithms,...

2012 h
DataMining&Bioinformatics:How(2)
Masterandexplorationofexistingdataminingtools
Genomics,proteomics,andfunctionalgenomics(functionalnetworksofgenes
andproteins)
Whatarethecurrentbioinformaticstoolsaimingfor?
Inferringaproteinsshapeandfunctionfromagivensequenceofaminoacids
Findingallthegenesandproteinsinagivengenome
Determiningsitesintheproteinstructurewheredrugmoleculescanbeattached

2012 h
DataMining&Bioinformatics How (3)
Researchanddevelopmentofnewtoolsforbioinformatics
Similaritysearchandcomparisonbetweenclassesofgenes(e.g.,diseasedandhealthy)by
findingandcomparingfrequentpatterns
Identifysequentialpatterns thatplayrolesinvariousdiseases
Newclusteringandclassification methodsformicroarraydataandproteinarraydata
analysis
Mining,indexingandsimilaritysearchinsequentialandstructured(e.g.,graphandnetwork)
datasets
Pathanalysis:linkinggenes/proteinstodifferentdiseasedevelopmentstages
Developpharmaceuticalinterventionsthattargetthedifferentstagesseparately
HighdimensionalanalysisandOLAPmining
Visualizationtoolsandgenetic/proteomicdataanalysis

2012 h
AlgorithmsUsedinBioinformatics
Comparingsequences:Comparinglargenumbersoflongsequences,allow
insertion/deletion/mutationsofsymbols
Constructingevolutionary(phylogenetic)trees:Comparingseq.ofdiff.organisms,
&buildtreesbasedontheirdegreeofsimilarity(evolution)
Detectingpatternsinsequences
SearchforgenesinDNAorsubcomponentsofaseq.ofaminoacids
Determining3Dstructuresfromsequences
E.g.,inferRNAshapefromseq.&proteinshapefromaminoacidseq.
Inferringcellregulation:
Cellmodelingfromexperimental(say,microarray)data
Determiningproteinfunctionandmetabolicpathways: Interprethuman
annotationsforproteinfunctionanddevelopgraphdbthatcanbequeried
AssemblingDNAfragments (providedbysequencingmachines)
Usingscriptlanguages:scriptontheWebtoanalyzedataandapplications
2012 h
analysis
Summary

2012 h
ComparingSequences
Alllivingorganismsarerelatedtoevolution
Alignment:Liningupsequencestoachievethemaximallevelofidentity
Twosequencesarehomologous iftheyshareacommonancestor
Sequencestobecompared:eithernucleotides(DNA/RNA)oraminoacids
(proteins)
Nucleotides:identical
Aminoacids:identical,orifonecanbederivedfromtheotherbysubstitutionsthatare
likelytooccurinnature
Localvs.globalalignments:Localonlyportionsofthesequencesarealigned.
Globalalignovertheentirelengthofthesequences
Usegaptoindicatepreferablenottoaligntwosymbols
Percentidentity:ratiobetweenthenumberofcolumnscontainingidentical
symbolsvs.thenumberofsymbolsinthelongestsequence
Score ofalignment:summingupthematchesandcountinggapsasnegative

2012 h
SequenceAlignment:ProblemDefinition
Goal:
Giventwoormoreinputsequences
Identifysimilarsequenceswithlongconservedsubsequences
Method:
Usesubstitutionmatrices(probabilitiesofsubstitutionsofnucleotides
oraminoacidsandprobabilitiesofinsertionsanddeletions)
Optimalalignmentproblem:NPhard
Heuristicmethodtofindgoodalignments

2012 h
PairWiseSequenceAlignment
HEAGAWGHEE
Example PAWHEAE
HEAGAWGHE-E HEAGAWGHE-E
P-A--W-HEAE --P-AW-HEAE
Whichoneisbetter? Scoringalignments
Tocomparetwosequencealignments,calculateascore
PAM(PercentAcceptedMutation)orBLOSUM(BlocksSubstitutionMatrix)
(substitution)matrices:Calculatematchesandmismatches,consideringamino
acidsubstitution
Gappenalty:Initiatingagap
Gapextensionpenalty:Extendingagap

2012 h
PairwiseSequenceAlignment:ScoringMatrix
A E G H W Gappenalty:8
A 5 -1 0 -2 -3
E -1 6 -3 0 -3
Gapextension:8
H -2 0 -2 10 -3
P -1 -1 -2 -2 -4
HEAGAWGHE-E
W -3 -3 -3 -3 15 --P-AW-HEAE
(-8) + (-8) + (-1) + 5 + 15 + (-8)
+ 10 + 6 + (-8) + 6 = 9
HEAGAWGHE-E
Exercise:Calculatefor
P-A--W-HEAE

2012 h
HeuristicAlignmentAlgorithms
Motivation:Complexityofalignmentalgorithms:O(nm)
CurrentproteinDB:100millionbasepairs
Matchingeachsequencewitha1,000basepairquerytakesabout3hours!
Heuristicalgorithmsaimatspeedingupatthepriceofpossiblymissingthe
bestscoringalignment
Twowellknownprograms
BLAST:BasicLocalAlignmentSearchTool
FASTA:FastAlignmentTool
Bothfindhighscoringlocalalignmentsbetweenaquerysequenceandatarget
database
Basicidea:firstlocatehighscoringshortstretchesandthenextendthem

2012 h
analysis
Summary

2012 h
MotivationforMarkovModelsin ComputationalBiology
Therearemanycasesinwhichwewouldliketorepresent the
statisticalregularitiesofsomeclassofsequences
genes
variousregulatorysitesinDNA(e.g.,whereRNA polymeraseand
transcriptionfactorsbind)
proteinsinagivenfamily
Markovmodelsarewellsuitedtothistypeoftask

2012 h
AMarkovChainModel
Transitionprobabilities
Pr(xi=a|xi1=g)=0.16
Pr(xi=c|xi1=g)=0.34
Pr(xi=g|xi1=g)=0.38
Pr(xi=t|xi1=g)=0.12
Pr( x | xi i 1 = g) = 1

2012 h
DefinitionofMarkovChainModel
AMarkovchainmodelisdefinedby
asetofstates
somestatesemitsymbols
otherstates(e.g.,thebeginstate)aresilent
asetoftransitionswithassociated probabilities
thetransitionsemanatingfromagivenstatedefinea
distributionoverthepossiblenextstates

2012 h
MarkovChainModels:Properties
GivensomesequencexoflengthL,wecanaskhow
probablethesequenceisgivenourmodel
Foranyprobabilisticmodelofsequences,wecanwritethis
probabilityas
Pr( x) = Pr( xL , xL 1 ,..., x1 )
= Pr( xL / xL 1 ,..., x1 ) Pr( xL 1 | xL 2 ,..., x1 )... Pr( x1 )
keypropertyofa(1storder)Markovchain:theprobability of
eachxi dependsonlyonthevalueof xi1
Pr( x) = Pr( xL / xL 1 ) Pr( xL 1 | xL 2 )... Pr( x2 | x1 ) Pr( x1 )
L
= Pr( x1 ) Pr( xi | xi 1 )
i =2

2012 h
TheProbabilityofaSequenceforaMarkovChainModel
Pr(cggt)=Pr(c)Pr(g|c)Pr(g|g)Pr(t|g)

2012 h
AlgorithmsforLearning&Prediction
Learning
correctpathknownforeachtrainingsequence > simplemaximum likelihood
orBayesianestimation
correctpathnotknown> ForwardBackwardalgorithm+MLor Bayesian
estimation
Classification
simpleMarkovmodel >calculateprobabilityofsequencealongsingle path
foreachmodel
hiddenMarkovmodel > Forwardalgorithmtocalculateprobabilityof
sequencealongallpathsforeachmodel
Segmentation
hiddenMarkovmodel > Viterbialgorithmtofindmostprobablepath for
sequence

2012 h
analysis
Summary

2012 h
Summary:MiningBiologicalData
Biologicalsequenceanalysiscompares,aligns,indexes,andanalyzesbiological
sequences(sequenceofnucleotidesoraminoacids)
Biosequenceanalysiscanbepartitionedintotwoessentialtasks:
pairwisesequencealignmentandmultiplesequencealignment
Dynamicprogrammingapproach(notably,BLAST)hasbeenpopularlyusedfor
sequencealignments
MarkovchainsandhiddenMarkovmodelsareprobabilisticmodelsinwhichthe
probabilityofastatedependsonlyonthatofthepreviousstate
Givenasequenceofsymbols,x,theforward algorithmfindstheprobabilityofobtaining
xinthemodel
TheViterbi algorithmfindsthemostprobablepath(correspondingtox)throughthe
model
TheBaumWelch learnsoradjuststhemodelparameters(transitionandemission
probabilities)tobestexplainasetoftrainingsequences.

2012 h
Graphmining

2012 h
GraphMining
MethodsforMiningFrequentSubgraphs
MiningVariantandConstrainedSubstructure
Patterns
Applications:
GraphIndexing
SimilaritySearch
ClassificationandClustering
Summary
2012 h
WhyGraphMining?
Graphsareubiquitous
Chemicalcompounds(Cheminformatics)
Proteinstructures,biologicalpathways/networks(Bioinformactics)
Programcontrolflow,trafficflow,andworkflowanalysis
XMLdatabases,Web,andsocialnetworkanalysis
Graphisageneralmodel
Trees,lattices,sequences,anditemsaredegeneratedgraphs
Diversityofgraphs
Directedvs.undirected,labeledvs.unlabeled(edges&vertices),
weighted,withangles&geometry(topologicalvs.2D/3D)
Complexityofalgorithms:manyproblemsareofhigh
complexity

2012 h
Graph,Graph,Everywhere
fromH.JeongetalNature411,41(2001)
Aspirin Yeastproteininteractionnetwork
Coauthornetwork
December26, Internet DataMining:Conceptsand 317
2012 h
GraphPatternMining
Frequent subgraphs
A(sub)graphisfrequent ifitssupport (occurrence
frequency)inagivendatasetisnolessthana
minimumsupport threshold
Applicationsofgraphpatternmining
Miningbiochemicalstructures
Programcontrolflowanalysis
MiningXMLstructuresorWebcommunities
Buildingblocksforgraphclassification,clustering,
compression,comparison,andcorrelationanalysis
2012 h
GraphMiningAlgorithms
Incompletebeamsearch Greedy(Subdue)
Inductivelogicprogramming(WARMR)
Graphtheorybasedapproaches
Aprioribasedapproach
Patterngrowthapproach

2012 h
SUBDUE(Holderetal.KDD94)
Startwithsinglevertices
Expandbestsubstructureswithanewedge
Limitthenumberofbestsubstructures
Substructuresareevaluatedbasedontheirabilityto
compressinputgraphs
Usingminimumdescriptionlength(DL)
BestsubstructureS ingraphG minimizes:DL(S)+
DL(G\S)
Terminateuntilnonewsubstructureisdiscovered
2012 h
PropertiesofGraphMiningAlgorithms
Searchorder
breadthvs.depth
Generationofcandidatesubgraphs
apriorivs.patterngrowth
Eliminationofduplicatesubgraphs
passivevs.active
Supportcalculation
embeddingstoreornot
Discoverorderofpatterns
path tree graph
2012 h
AprioriBasedApproach
(k+1)-edge
k-edge
G1
G
G2
G Gn
JOIN

2012 h
AprioriBased,BreadthFirstSearch
Methodology:breadthsearch,joiningtwographs
AGM(Inokuchi,etal.PKDD00)
generatesnewgraphswithonemorenode
FSG(KuramochiandKarypisICDM01)
generatesnewgraphswithonemoreedge

2012 h
GraphPatternExplosionProblem
Ifagraphisfrequent,allofitssubgraphsare
frequent theAprioriproperty
Annedgefrequentgraphmayhave2n subgraphs
Among422 chemicalcompoundswhichare
confirmedtobeactiveinanAIDSantiviralscreen
dataset,thereare1,000,000 frequentgraph
patternsiftheminimumsupportis5%

2012 h
GraphMining
Patterns
Applications:
GraphIndexing
SimilaritySearch
Summary
2012 h
ConstrainedPatterns
Density
Diameter
Connectivity
Degree
Min,Max,Avg

2012 h
ConstraintBasedGraphPatternMining
Highlyconnectedsubgraphsinalargegraph
usuallyarenotartifacts(group,functionality)
Recurrentpatternsdiscoveredinmultiplegraphsaremorerobustthanthe
patternsminedfromasinglegraph

2012 h
GraphMining
Patterns
Applications:
GraphIndexing
SimilaritySearch
Summary
2012 h
GraphClustering
Graphsimilaritymeasure
Featurebasedsimilaritymeasure
Eachgraphisrepresentedasafeaturevector
Thesimilarityisdefinedbythedistanceoftheir
correspondingvectors
Frequentsubgraphscanbeusedasfeatures
Structurebasedsimilaritymeasure
Maximalcommonsubgraph
Grapheditdistance:insertion,deletion,andrelabel
Graphalignmentdistance

2012 h
GraphClassification
Localstructurebasedapproach
Localstructuresinagraph,e.g.,neighbors
surroundingavertex,pathswithfixedlength
Graphpatternbasedapproach
Subgraphpatternsfromdomainknowledge
Subgraphpatternsfromdatamining
Kernelbasedapproach
Randomwalk(Grtner02,Kashimaetal.02,
ICML03,Mahetal.ICML04)
Optimallocalassignment(Frhlichetal.ICML05)
Boosting(Kudoetal.NIPS04)

2012 h
GraphPatternBasedClassification
Subgraphpatternsfromdomainknowledge
Moleculardescriptors
Subgraphpatternsfromdatamining
Generalidea
Eachgraphisrepresentedasafeaturevectorx =
{x1,x2,,xn},wherexiisthefrequencyoftheith
patterninthatgraph
Eachvectorisassociatedwithaclasslabel
Classifythesevectorsinavectorspace
2012 h
GraphMining
Patterns
Applications:
GraphIndexing
SimilaritySearch
Summary
2012 h
GraphSearch
Queryinggraphdatabases:
Givenagraphdatabaseandaquerygraph,findall
thegraphscontainingthisquerygraph
query graph graph database

2012 h
ScalabilityIssue
Sequentialscan
DiskI/Os
Subgraphisomorphismtesting
Anindexingmechanismisneeded
DayLight:Daylight.com(commercial)
GraphGrep:DennisShasha,etal.PODS'02
Grace:SrinathSrinivasa,etal.ICDE'03

2012 h
Summary:GraphMining
Graphmininghaswideapplications
Frequentandclosedsubgraphminingmethods
gSpanandCloseGraph:patterngrowthdepthfirstsearchapproach
Graphindexingtechniques
Frequentanddiscriminativesubgraphsarehighqualityindexing
features
Similaritysearchingraphdatabases
Indexingandfeaturebasedmatching
Furtherdevelopmentandapplicationexploration

2012 h
SocialNetworkAnalysis

2012 h
SocialNetworkIntroduction
StatisticsandProbabilityTheory
ModelsofSocialNetworkGeneration
NetworksinBiologicalSystem
Miningon SocialNetwork
Summary
2012 h
Complex systems
Made of
many non-identical elements
connected by diverse interactions.
NETWORK
2012 h
NaturalNetworksandUniversality
Considermanykindsofnetworks:
social,technological,business,economic,content,
Thesenetworkstendtosharecertaininformal properties:
largescale;continualgrowth
distributed,organicgrowth:verticesdecidewhotolinkto
interactionrestrictedtolinks
mixtureoflocalandlongdistanceconnections
abstractnotionsofdistance:geographical,content,social,
Donaturalnetworkssharemorequantitative universals?
Whatwouldtheseuniversalsbe?
Howcanwemakethempreciseandmeasurethem?
Howcanweexplaintheiruniversality?
Thisisthedomainofsocialnetworktheory
Sometimesalsoreferredtoaslinkanalysis

2012 h
SomeInterestingQuantities
Connectedcomponents:
howmany,andhowlarge?
Network diameter:
maximum(worstcase)oraverage?
excludeinfinitedistances?(disconnectedcomponents)
thesmallworldphenomenon
Clustering:
towhatextentthatlinkstendtoclusterlocally?
whatisthebalancebetweenlocalandlongdistanceconnections?
whatrolesdothetwotypesoflinksplay?
Degree distribution:
whatisthetypicaldegreeinthenetwork?
whatistheoveralldistribution?

2012 h
ACanonicalNaturalNetworkhas
Few connectedcomponents:
oftenonly1orasmallnumber,indep.ofnetworksize
Small diameter:
oftenaconstantindependentofnetworksize(like6)
orperhapsgrowingonlylogarithmicallywithnetworksizeorevenshrink?
typicallyexcludeinfinitedistances
Ahigh degreeofclustering:
considerablymoresothanforarandomnetwork
intensionwithsmalldiameter
Aheavytailed degreedistribution:
asmallbutreliablenumberofhighdegreevertices
oftenofpowerlaw form

2012 h
ProbabilisticModelsofNetworks
Allofthenetworkgenerationmodelswewillstudyare
probabilistic orstatistical innature
Theycangeneratenetworksofanysize
Theyoftenhavevariousparameters thatcanbeset:
sizeofnetworkgenerated
averagedegreeofavertex
fractionoflongdistanceconnections
Themodelsgenerateadistribution overnetworks
Statementsarealwaysstatistical innature:
withhighprobability,diameterissmall
onaverage,degreedistributionhasheavytail
Thus,weregoingtoneedsomebasicstatisticsandprobability
theory
2012 h
Summary
2012 h
WorldWideWeb
Nodes: WWW documents

Links: URL links
800 million documents
(S. Lawrence, 1999)
ROBOT: collects all

URLs found in a
document and follows
them recursively
December26,
R. Albert, H. Jeong, A-L Barabasi, Nature, 401
344
130 (1999)
2012 h
WorldWideWeb
ExpectedResult RealResult
out= 2.45 in = 2.1
k ~ 6
P(k=500) ~ 10-99 Pout(k) ~ k-out Pin(k) ~ k- in
NWWW ~ 109 P(k=500) ~ 10-6 NWWW ~ 109
N(k=500) ~ 103
N(k=500)~10-90
J. Kleinberg, et. al, Proceedings of the ICCC (1999)
2012 h
WorldWideWeb
3
l15=2 [125]
6
1
4 l17=4 [1346 7]
7
2 5 < l > = ??
Finite size scaling: create a network with N nodes with Pin(k) and Pout(k)
< l > = 0.35 + 2.06 log(N)

19 degrees of separation
R. Albert et al Nature (99)
nd.edu
based on 800 million webpages
<l>
[S. Lawrence et al Nature (99)]

IBM
A. Broder et al WWW9 (00)

2012 h
Whatdoesthatmean?
Poisson distribution Power-law distribution
Exponential Network Scale-free Network

2012 h
ScalefreeNetworks
Thenumberofnodes(N)isnotfixed
Networkscontinuouslyexpandbyadditionalnewnodes
WWW:additionofnewnodes
Citation:publicationofnewpapers
Theattachmentisnotuniform
Anodeislinkedwithhigherprobabilitytoanodethatalreadyhasalarge
numberoflinks
WWW:newdocumentslinktowellknownsites(CNN,
Yahoo,Google)
Citation:Wellcitedpapersaremorelikelytobecited
again
2012 h
Case1:InternetBackbone
Nodes: computers, routers

Links: physical lines
(Faloutsos, Faloutsos and Faloutsos, 1999)

2012 h
2012 h
Summary
2012 h
InformationontheSocial
Network
Heterogeneous,multirelationaldatarepresentedasagraphor
network
Nodesareobjects
Mayhavedifferentkindsofobjects
Objectshaveattributes
Objectsmayhavelabelsorclasses
Edgesarelinks
Mayhavedifferentkindsoflinks
Linksmayhaveattributes
Linksmaybedirected,arenotrequiredtobebinary
Linksrepresentrelationshipsandinteractionsbetweenobjects
richcontentformining

2012 h
WhatisNewforLink
MiningHere
Traditionalmachinelearninganddataminingapproaches
assume:
Arandomsampleofhomogeneousobjectsfromsinglerelation
Realworlddatasets:
Multirelational,heterogeneousandsemistructured
LinkMining
Newlyemergingresearchareaattheintersectionofresearchinsocial
networkandlinkanalysis,hypertextandwebmining,graphmining,
relationallearningandinductivelogicprogramming

2012 h
ATaxonomyofCommonLinkMiningTasks
ObjectRelatedTasks
Linkbasedobjectranking
Linkbasedobjectclassification
Objectclustering(groupdetection)
Objectidentification(entityresolution)
LinkRelatedTasks
Linkprediction
GraphRelatedTasks
Subgraphdiscovery
Graphclassification
Generativemodelforgraphs

2012 h
WhatIsaLinkinLinkMining?
Link:relationshipamongdata
Twokindsoflinkednetworks
homogeneousvs.heterogeneous
Homogeneousnetworks
Singleobjecttypeandsinglelinktype
Singlemodelsocialnetworks(e.g.,friends)
WWW:acollectionoflinkedWebpages
Heterogeneousnetworks
Multipleobjectandlinktypes
Medicalnetwork:patients,doctors,disease,contacts,treatments
Bibliographicnetwork:publications,authors,venues

2012 h
PageRank:CapturingPagePopularity (Brin&Page98)
Intuitions
Linksarelikecitationsinliterature
Apagethatiscitedoftencanbeexpectedtobemoreusefulingeneral
PageRankisessentiallycitationcounting,butimprovesover
simplecounting
Considerindirectcitations (beingcitedbyahighlycitedpapercounts
alot)
Smoothingofcitations(everypageisassumedtohaveanonzero
citationcount)
PageRankcanalsobeinterpretedasrandomsurfing(thus
capturingpopularity)

2012 h
ThePageRankAlgorithm(Brin&
Page98)
Randomsurfingmodel:
Atanypage,
Withprob.,randomlyjumpingtoapage
Withprob.(1 ),randomlypickingalinktofollow
d1 0 0 1/ 2 1/ 2
1 0 0 0
M = Transition matrix
0 1 0 0 Same as

d3 1/ 2 1/ 2 0 0 /N (why?)
d2
1
pt +1 (di ) = (1 )
d j IN ( di )
m ji pt ( d j ) +
k N
pt (d k )
d4 1
p(di ) = [ + (1 )mki ] p(d k ) Stationary (stable)
k N distribution, so we
v v
p = ( I + (1 ) M )T p I = 1/N ignore time
ij
Initial value p(d)=1/N Iterate until converge

Essentially an eigenvector problem.
2012 h
LinkPrediction
Predictwhetheralinkexistsbetweentwoentities,basedon
attributesandotherobservedlinks
Applications
Web:predictiftherewillbealinkbetweentwopages
Citation:predictingifapaperwillciteanotherpaper
Epidemics:predictingwhoapatientscontactsare
Methods
Oftenviewedasabinaryclassificationproblem
Localconditionalprobabilitymodel,basedonstructuralandattribute
features
Difficulty:sparsenessofexistinglinks
Collectiveprediction,e.g.,Markovrandomfieldmodel

2012 h
MultirelationalDataMining

2012 h
Classificationovermultiplerelationsindatabases
Clusteringovermultirelationsbyuserguidance
LinkClus:Efficientclusteringbyexploringthepowerlaw
distribution
Distinct:Distinguishingobjectswithidenticalnamesbylink
analysis
Miningacrossmultipleheterogeneousdataandinformation
repositories
Summary

2012 h
Outline
Theme:Knowledgeispower,butknowledgeishiddeninmassivelinks
StartingwithPageRankandHITS
CrossMine:Classificationofmultirelationsbylinkanalysis
CrossClus:Clusteringovermultirelationsbyuserguidance
Morerecentworkandconclusions

2012 h
TraditionalDataMining
Workonsingleflatrelations
Contact
Doctor Patient
flatten
Loseinformationoflinkagesandrelationships
Cannotutilizeinformationofdatabasestructuresorschemas

2012 h
MultiRelationalData
Mining(MRDM)
Motivation
Moststructureddataarestoredinrelational
databases
MRDMcanutilizelinkageandstructuralinformation
Knowledgediscoveryinmultirelational
environments
Multirelationalrules
Multirelationalclustering
Multirelationalclassification
Multirelationallinkageanalysis

2012 h
ApplicationsofMRDM
eCommerce:discoveringpatternsinvolvingcustomers,
products,manufacturers,
Bioinformatics/Medicaldatabases:discoveringpatterns
involvinggenes,patients,diseases,
Networkingsecurity:discoveringpatternsinvolvinghosts,
connections,services,
Manyotherrelationaldatasources
Example:EvidenceExtractionandLinkDiscovery(EELD):ADARPA
fundingprojectthatemphasizesmultirelationalandmultidatabase
linkageanalysis

2012 h
ImportanceofMultirelational
Classification(fromEELDProgram
Description)
TheobjectiveoftheEELDProgramistoresearch,develop,demonstrate,and
transitioncriticaltechnologythatwillenablesignificantimprovementinour
abilitytodetectasymmetricthreats,e.g.,alooselyorganizedterrorist
group.
Patternsofactivitythat,inisolation,areoflimitedsignificancebut,when
combined,areindicativeofpotentialthreats,willneedtobelearned.
Addressingthesethreatscanonlybeaccomplishedbydevelopinganew
levelofautonomicinformationsurveillanceandanalysistoextract,discover,
andlinktogethersparseevidencefromvastamountsofdatasources,in
differentformatsandwithdifferingtypesanddegreesofstructure,to
representandevaluatethesignificanceoftherelatedevidence,andtolearn
patternstoguidetheextraction,discovery,linkageandevaluationprocesses.

2012 h
MRDMApproaches
InductiveLogicProgramming(ILP)
Findmodelsthatarecoherentwithbackground
knowledge
MultirelationalClusteringAnalysis
Clusteringobjectswithmultirelationalinformation
ProbabilisticRelationalModels
Modelcrossrelationalprobabilisticdistributions
EfficientMultiRelationalClassification
TheCrossMineApproach[Yinetal,2004]

2012 h
Findahypothesisthatisconsistentwith
backgroundknowledge(trainingdata)
FOIL,Golem,Progol,TILDE,
Backgroundknowledge
Relations(predicates),Tuples(groundfacts)
Trainingexamples Backgroundknowledge
Parent(ann, mary) Female(ann)
Daughter(mary, ann) + Parent(ann, tom) Female(mary)
Daughter(eve, tom) + Parent(tom, eve) Female(eve)
Daughter(tom, ann) Parent(tom, ian)
Daughter(eve, ann)

2012 h
InductiveLogic
Programming(ILP)
Hypothesis
Thehypothesisisusuallyasetofrules,
whichcanpredictcertainattributesin
certainrelations
Daughter(X,Y)female(X),parent(Y,X)

2012 h
AutomaticallyClassifyingObjectsUsingMultiple
Relations
Whynotconvertmultiplerelationaldataintoasingletableby
joins?
Relationaldatabasesaredesignedbydomainexpertsviasemantic
modeling(e.g.,ERmodeling)
Indiscriminativejoinsmayloosesomeessentialinformation
Oneuniversalrelationmaynotbeappealingtoefficiency,scalabilityand
semanticspreservation
Ourapproachtomultirelationalclassification:
Automaticallyclassifyingobjectsusingmultiplerelations

2012 h
AnExample:LoanApplications
Ask the backend database
Approve or not?
Apply for loan

2012 h
TheBackendDatabase
Account District
account-id district-id
Loan
district-id dist-name
loan-id
Targetrelation: frequency
account-id Card region
Eachtuplehasaclass date card-id #people
date
label,indicating amount disp-id #lt-500
whetheraloanispaid duration type #lt-2000

Transaction issue-date #lt-10000
ontime. payment
trans-id
#gt-10000
account-id
#city
date Disposition ratio-urban
Order type disp-id
avg-salary
order-id
operation account-id
unemploy95
account-id
amount client-id
unemploy96
bank-to
balance
den-enter
account-to
symbol Client #crime95
amount
client-id
#crime96
type
birth-date
gender
district-id
Howtomakedecisionstoloanapplications?

2012 h
Roadmap
Motivation
RulebasedClassification
TupleIDPropagation
RuleGeneration
NegativeTupleSampling
PerformanceStudy

2012 h
RulebasedClassification
Ever bought a house Live in Chicago Approve!

Applicant
Just apply for a credit card Reject

Applicant

2012 h
RuleGeneration
Searchforgoodpredicatesacrossmultiplerelations
Loan ID Account ID Amount Duration Decision

1 124 1000 12 Yes
2 124 4000 12 Yes
Applicant #1
3 108 10000 24 No
4 45 12000 36 No
Loan Applications
Applicant #2
Account ID Frequency Open date District ID
128 monthly 02/27/96 61820
108 weekly 09/23/95 61820
45 monthly 12/09/94 61801
Applicant #3 Orders
67 weekly 01/01/95 61822
Accounts
Applicant #4
Other relations Districts
2012 h
PreviousApproaches
Tobuildarule
Repeatedlyfindthebestpredicate
ToevaluateapredicateonrelationR,firstjointargetrelation
withR
Notscalablebecause
Hugesearchspace(numerouscandidatepredicates)
Notefficienttoevaluateeachpredicate
Toevaluateapredicate
Loan(L, +) :- Loan (L, A,?,?,?,?), Account(A,?, monthly,?)
firstjoinloanrelationwithaccountrelation
CrossMineismorescalableandmorethanonehundredtimesfaster
ondatasetswithreasonablesizes

2012 h
RuleGeneration
Togeneratearule
while(true)
findthebestpredicatep
if foilgain(p)>thresholdthen addp tocurrentrule
else break
A3=1&&A1=2
A3=1&&A1=2
&&A8=5A3=1
Positive Negative
examples examples

2012 h
RuleGeneration
Startfromthetargetrelation
Onlythetargetrelationisactive
Repeat
Searchinallactiverelations
Searchinallrelationsjoinabletoactiverelations
Addthebestpredicatetothecurrentrule
Settheinvolvedrelationtoactive
Until
Thebestpredicatedoesnothaveenoughgain
Currentruleistoolong

2012 h
RuleGeneration:Example
Account District
account-id district-id
Targetrelation Loan
district-id dist-name
loan-id
frequency region
account-id Card
date card-id #people
date
amount Firstpredicate disp-id #lt-500
duration type #lt-2000

Transaction issue-date #lt-10000
payment
trans-id
#gt-10000
account-id
#city
date Disposition ratio-urban
Order type disp-id
avg-salary
order-id
operation account-id
unemploy95
account-id
amount client-id Second
unemploy96
bank-to predicate
balance
den-enter
account-to
symbol Client #crime95
amount
client-id
#crime96
type
birth-date
gender
Range of Search district-id
Add best predicate to rule

2012 h
LookoneaheadinRuleGeneration
Twotypesofrelations:EntityandRelationship
Oftencannotfindusefulpredicatesonrelationsofrelationship
No good predicate
Target
Relation
SolutionofCrossMine:
WhenpropagatingIDstoarelationofrelationship,propagateonemore
steptonextrelationofentity.

2012 h
distribution
analysis
repositories
Summary
2012 h
MultiRelationalandMultiDBMining
ClusteringovermultirelationsbyUserGuidance
Miningacrossmultirelationaldatabases
Miningacrossmultipleheterogeneousdataand
informationrepositories
Summary
2012 h
Motivation1:MultiRelationalClustering
Work-In Professor Open-course Course

person name course course-id
group office semester name
position instructor area
Publication
Publish
Advise title
Group author
professor year
name title
student conf
area
degree
Student
Register
name
student
Target of office
clustering course
position
semester
unit
grade
Traditionalclusteringworksonasingletable
Mostdataissemanticallylinkedwithmultiplerelations
Thusweneedinformationinmultiplerelations
December26, 382
2012 h
Motivation2:UserGuidedClustering

group office semester name
position instructor area
Publish Publication
Advise author title
Group
professor
name title year
student
area conf
degree Register
Userh int student
Student course
name semester
Target of office
unit
clustering position
grade
Userusuallyhasagoalofclustering,e.g.,clusteringstudentsbyresearcharea
UserspecifieshisclusteringgoaltoCrossClus

2012 h
ComparingwithClassification
User hint
Userspecifiedfeature(intheformof
attribute)isusedasahint,notclasslabels
Theattributemaycontaintoomanyor
toofewdistinctvalues
E.g.,ausermaywanttocluster
studentsinto20clusters
insteadof3
Additionalfeaturesneedtobeincluded
inclusteranalysis
All tuples for clustering

2012 h
ComparingwithSemisupervisedClustering
Semisupervisedclustering[Wagstaff,etal 01,Xing,etal.02]
Userprovidesatrainingsetconsistingofsimilar anddissimilar pairsof
objects
Userguidedclustering
Userspecifiesanattributeasahint,andmorerelevantfeaturesarefoundfor
clustering
Semi-supervised clustering User-guided clustering
December26,All tuples for clustering All tuples for clustering

2012 h
SemisupervisedClustering
Muchinformation(inmultiplerelations)isneededtojudgewhethertwo
tuplesaresimilar
Ausermaynotbeabletoprovideagoodtrainingset
Itismucheasierforausertospecifyanattributeasahint,suchasa
studentsresearcharea
Tom Smith SC1211 TA
Jane Chang BI205 RA
Tuples to be compared
User hint
2012 h
SearchingforPertinentFeatures
Differentfeaturesconveydifferentaspectsofinformation
Research area Academic Performances
Research group area Demographic info GPA
Conferences of papers Permanent address GRE score
Advisor Nationality Number of papers
Featuresconveyingsameaspectofinformationusuallycluster
objectsinmoresimilarways
researchgroupareasvs.conferencesofpublications
Givenuserspecifiedfeature
Findpertinentfeaturesbycomputingfeaturesimilarity

2012 h
HeuristicSearchforPertinentFeatures
Overallprocedure group office semester name
1.Startfromtheuser 2 position instructor area
specifiedfeature Advise Publication

Group Publish
2.Searchinneighborhoodof name professor
author
title
existingpertinentfeatures area student 1 title
year
degree conf
3.Expandsearchrange
Register
gradually Userh int
student
Student course
name
semester
office
Target of unit
clustering position
grade
TupleIDpropagation[Yin,etal.04]isusedtocreatemultirelationalfeatures
IDsoftargettuplescanbepropagatedalonganyjoinpath,fromwhichwecanfind
tuplesjoinablewitheachtargettuple

2012 h
Roadmap
1. Overview
2. FeaturePertinence
3. SearchingforFeatures
4. Clustering
5. ExperimentalResults

2012 h
ClusteringwithMultiRelationalFeature
GivenasetofL pertinentfeaturesf1, , fL,similaritybetween

twoobjects
L
sim (t1 , t 2 ) = sim f i (t1 , t 2 ) f i .weight
i =1
Weightofafeatureisdeterminedinfeaturesearchbyitssimilaritywith
otherpertinentfeatures
Forclustering,weuseCLARANS,ascalablekmedoids[Ng&
Han94]algorithm

2012 h
Roadmap
1. Overview
2. FeaturePertinence
3. SearchingforFeatures
4. Clustering
5. ExperimentalResults

2012 h
HowtoMeasureSimilaritybetweenClusters?
Singlelink(highestsimilaritybetweenpointsintwoclusters)?
No,becausereferencestodifferentobjectscanbeconnected.
Completelink(minimumsimilaritybetweenthem)?
No,becausereferencestothesameobjectmaybeweaklyconnected.
Averagelink(averagesimilaritybetweenpointsintwo
clusters)?
Abettermeasure

2012 h
ClusteringProcedure
Procedure
Initialization:Useeachreferenceasacluster
Keepfindingandmergingthemostsimilarpairofclusters
Untilnopairofclustersissimilarenough

2012 h
EfficientComputation
Inagglomerativehierarchicalclustering,oneneedsto
repeatedlycomputesimilaritybetweenclusters
WhenmergingclustersC1 andC2 intoC3,weneedtocomputethe
similaritybetweenC3 andanyothercluster
Veryexpensivewhenclustersarelarge
Weinventmethodstocomputesimilarityincrementally
Neighborhoodsimilarity
Randomwalkprobability

2012 h
distribution
analysis
repositories
Summary
2012 h
Summary
Knowledgeispower,butknowledgeishiddeninmassivelinks
MorestoriesthanWebpagerankandsearch
CrossMine:Classificationofmultirelationsbylinkanalysis
CrossClus:Clusteringovermultirelationsbyuserguidance
distribution
analysis
Muchmoretobeexplored!

2012 h
ReviewQuestions
Statetheimportanceofslidingwindowmodeltoanalyzestreamdata?
Writeanoteandatastreammanagementsystems(DSMS)
Statethedifferencebetweenonetimequeryandcontinuousquery.
Howdoesthelossycountryalgorithmfindfrequentitems?
Giveanoteonstreamqueryprocessing?
Whatisatimeseriesdatabase?
Definesequentialpatternmining?
Whatisperiodicityanalysis?
Distinguishbetweenfullperiodicpatternandpartialperiodicpattern
StateMarkovchainmodel
Statetheimportanceofsynopsesincontextwithscreendata?
Statetheneedforbiologicalsequenceanalysis?
Discussaboutconstraintbasedmining?
Whatisasocialnetwork?
Briefoutmultirelationdatamining?

2012 h
Bibliography

2012 h
MiningObject,Spatial,andMultimediaData
DataMining:Principlesand
12/26/2012 Algorithms 399
MiningObject,SpatialandMultiMediaData
Miningobjectdatasets
Miningspatialdatabasesanddatawarehouses
SpatialDBMS
SpatialDataWarehousing
SpatialDataMining
SpatiotemporalDataMining
Miningmultimediadata
Summary
12/26/2012 DataMining:Principlesand 400

l h
MiningComplexDataObjects:Generalizationof
StructuredData
Setvaluedattribute
Generalizationofeachvalueinthesetintoitscorrespondinghigherlevel
concepts
Derivationofthegeneralbehavioroftheset,suchasthenumberof
elementsintheset,thetypesorvaluerangesintheset,ortheweighted
averagefornumericaldata
E.g.,hobby ={tennis,hockey,chess,violin,PC_games}generalizesto
{sports,music,e_games}
Listvaluedorasequencevaluedattribute
Sameassetvaluedattributesexceptthattheorderoftheelementsin
thesequenceshouldbeobservedinthegeneralization

l h
GeneralizingSpatialandMultimediaData
Spatialdata:
Generalizedetailedgeographicpointsintoclusteredregions,suchas
business,residential,industrial,oragriculturalareas,accordingtoland
usage
Requirethemergeofasetofgeographicareasbyspatialoperations
Image data:
Extractedbyaggregationand/orapproximation
Size,color,shape,texture,orientation,andrelativepositionsand
structuresofthecontainedobjectsorregionsintheimage
Musicdata:
Summarizeitsmelody:basedontheapproximatepatternsthat
repeatedlyoccurinthesegment
Summarizeditsstyle:basedonitstone,tempo,orthemajormusical
instrumentsplayed

l h
GeneralizingObjectData
Objectidentifier
generalizetothelowestlevelofclassintheclass/subclasshierarchies
Classcompositionhierarchies
generalizeonlythosecloselyrelatedinsemantics tothecurrentone
Constructionandminingofobjectcubes
Extendtheattributeorientedinductionmethod
Applyasequenceofclassbasedgeneralizationoperatorsondifferent
attributes
Continueuntilgettingasmallnumberofgeneralizedobjectsthatcan
besummarizedasaconciseinhighlevelterms
Implementation
Examineeachattribute,generalizeittosimplevalueddata
Constructamultidimensionaldatacube(objectcube)
Problem:itisnotalwaysdesirabletogeneralizeasetofvaluesto
singlevalueddata
l h
Ex.:PlanMiningbyDivideandConquer
Plan:asequenceofactions
E.g.,Travel(flight):<traveler,departure,arrival,dtime,atime,airline,
price,seat>
Planmining:extractionofimportantorsignificantgeneralized(sequential)
patternsfromaplanbase(alargecollectionofplans)
E.g.,Discovertravelpatternsinanairflightdatabase,or
findsignificantpatternsfromthesequencesofactionsintherepairof
automobiles
Method
Attributeorientedinductiononsequencedata
Ageneralizedtravelplan:<smallbig*small>
Divide&conquer:Minecharacteristicsforeachsubsequence
E.g.,big*:sameairline,smallbig:nearbyregion
l h
ATravelDatabaseforPlanMining
Example:Miningatravelplanbase
Travelplantable
plan# action# departure depart_time arrival arrival_time airline
1 1 ALB 800 JFK 900 TWA
1 2 JFK 1000 ORD 1230 UA
1 3 ORD 1300 LAX 1600 UA
1 4 LAX 1710 SAN 1800 DAL
2 1 SPI 900 ORD 950 AA
. . . . . . . .
. . . . . . . .
. . . . . . . .
Airportinfotable
airport_code city state region airport_size
1 1 ALB 800
1 2 JFK 1000
1 3 ORD 1300
1 4 LAX 1710
2 1 SPI 900
. . . . .
. . . . .
. . . . .
l h
MultidimensionalAnalysis
AmultiDmodelfortheplanbase
Strategy
Generalizethe
planbaseindifferent
directions
Lookforsequential
patternsinthe
generalizedplans
Derivehighlevel
plans

l h
SpatialDBMS
SpatialDataMining
Summary

l h
WhatIsaSpatialDatabaseSystem?
Geometric,geographicorspatialdata:spacerelateddata
Example:Geographicspace(2Dabstractionofearthsurface),VLSI
design,modelofhumanbrain,3Dspacerepresentingthe
arrangementofchainsofproteinmolecule.
Spatialdatabasesystemvs.imagedatabasesystems.
Imagedatabasesystem:handlingdigitalrasterimage(e.g.,satellite
sensing,computertomography),mayalsocontaintechniquesfor
objectanalysisandextractionfromimagesandsomespatialdatabase
functionality.
Spatial(geometric,geographic)databasesystem:handlingobjectsin
spacethathaveidentityandwelldefinedextents,locations,and
relationships.
l h
GIS (Geographic Information System)
GIS (Geographic Information System)

Analysis and visualization of geographic data
Common analysis functions of GIS
Search (thematic search, search by region)
Location analysis (buffer, corridor, overlay)
Terrain analysis (slope/aspect, drainage network)
Flow analysis (connectivity, shortest path)
Distribution (nearest neighbor, proximity, change detection)
Spatial analysis/statistics (pattern, centrality, similarity, topology)
Measurements (distance, perimeter, shape, adjacency, direction)
l h
Spatial DBMS (SDBMS)
SDBMS is a software system that

supports spatial data models, spatial ADTs, and a
query language supporting them
supports spatial indexing, spatial operations
efficiently, and query optimization
can work with an underlying DBMS
Examples
Oracle Spatial Data Catridge
ESRI Spatial Data Engine

l h
ModelingSpatialObjects
Whatneedstoberepresented?
Twoimportantalternativeviews
Singleobjects:distinctentitiesarrangedinspaceeachof
whichhasitsowngeometricdescription
modelingcities,forests,rivers
Spatiallyrelatedcollectionofobjects:describespaceitself
(abouteverypointinspace)
modelinglanduse,partitionofacountryintodistricts

l h
ModelingSingleObjects:Point,LineandRegion
Point:locationonlybutnotextent
Line(oracurveusuallyrepresentedbyapolyline,asequenceof
linesegment):
movingthroughspace,orconnectionsinspace(roads,rivers,
cables,etc.)
Region:
Somethinghavingextentin2Dspace(country,lake,park).It
mayhaveaholeorconsistofseveraldisjointpieces.

l h
ModelingSpatiallyRelatedCollectionofObjects
Modelingspatiallyrelatedcollectionofobjects:planepartitionsandnetworks.
Apartition:asetofregionobjectsthatarerequiredtobedisjoint(e.g.,a
thematicmap).Thereexistoftenpairsofobjectswithacommonboundary
(adjacencyrelationship).
Anetwork:agraphembeddedintotheplane,consistingofasetofpoint
objects,formingitsnodes,andasetoflineobjectsdescribingthe
geometryoftheedges,e.g.,highways.rivers,powersupplylines.
Otherinterestedspatiallyrelatedcollectionofobjects:nestedpartitions,
oradigitalterrain(elevation)model.

l h
Spatial Data Types and
Models
Field-based model: raster y (0,4)
data Pine
framework: partitioning (0,2)
of space Fir Oak
Object-based model: vector

model
(0,0) (2,0) (4,0)
x
point, line, polygon, (a)
Objects, Attributes Object Viewpoint of Forest Stands Field Viewpoint of Forest Stands
"Pine," 2 x 4 ; 2 y 4
Dominant
Area-ID Area/Boundary
Tree Species
f(x,y) "Fir," 0 x 2; 0 y 2
FS1 Pine [(0,2),(4,2),(4,4),(0,4)]
"Oak," 2 x 4; 0 y 2
FS2 Fir [(0,0),(2,0),(2,2),(0,2)]
FS3 Oak [(2,0),(4,0),(4,2),(2,2)]
(b) (c)
l h
Spatial Query Language
Spatialquery language
Spatial data types, e.g. point, line segment, polygon,
Spatial operations, e.g. overlap, distance, nearest
neighbor,
Callable from a query language (e.g. SQL3) of
underlying DBMS
SELECTS.name
FROM SenatorS
WHERES.district.Area() >300
Standards
SQL3 (a.k.a. SQL 1999) is a standard for query
languages
OGIS is a standard for spatial data types and operators
Both standards enjoy wide support in industry

l h
Query Processing
Efficient algorithms to answer spatial queries

Common Strategy: filter and refine
Filter: Query Region overlaps with MBRs (minimum
bounding rectangles) of B, C, D
Refine: Query Region overlaps with B, C
MBR
A FILTER B
B
Query
Region C C
D D
REFINE
Data Object
C
Algorithms
Join Query Processing
Determining Intersection Rectangle

Plane Sweep Algorithm
Place sweep filter identifies 5 intersections for
refinement step
sweep line
(T.xu, T.yu)
S3
y-axis
y-axis
R2
S2 R1 T
R4 R3 (T.xl, T.yl)
S1
x-axis x-axis
(a) (b)
R4 S2 S1 R1 S3 R2 R3
12/26/2012 DataMining:Principlesand
(c) 417
Algorithms
File Organization and Indices
SDBMS: Dataset is in the secondary storage, e.g. disk

Space Filling Curves: An ordering on the locations in a
multi-dimensional space
Linearize a multi-dimensional space
Helps search efficiently

Algorithms
File Organization and Indices
Spatial Indexing
B-tree works on spatial data with space filling curve
R-tree: Heighted balanced extention of B+ tree
Objects are represented as MBR
provides better performance
A
A B C
e
d
C
B i d e f g h i j
g
f
h
Algorithms
Spatial Query Optimization
A spatial operation can be processed using

different strategies
Computation cost of each strategy depends
on many parameters
Query optimization is the process of
ordering operations in a query and
selecting efficient strategy for each operation
based on the details of a given dataset

l h
Spatialdatawarehouse:Integrated,subjectoriented,timevariant,and
nonvolatilespatialdatarepository
Spatialdataintegration:abigissue
Structurespecificformats (raster vs.vectorbased,OOvs.relational

models,differentstorageandindexing,etc.)
Vendorspecificformats (ESRI,MapInfo,Integraph,IDRISI,etc.)
Geospecificformats (geographicvs.equalareaprojection,etc.)
Spatialdatacube:multidimensionalspatialdatabase
Bothdimensionsandmeasuresmaycontainspatialcomponents

l h
DimensionsandMeasuresinSpatialData
Warehouse
Dimensions Measures
nonspatial numerical(e.g.monthlyrevenueof
e.g.2530degrees aregion)
generalizestohot (both
distributive(e.g.count,sum)
arestrings)
spatialtononspatial algebraic(e.g.average)
e.g.Seattlegeneralizesto holistic(e.g.median,rank)
descriptionPacific spatial
Northwest (asastring)
collectionofspatialpointers
spatialtospatial
(e.g.pointerstoallregionswith
e.g.Seattle generalizesto
PacificNorthwest (asa temperatureof2530degrees
spatialregion) inJuly)

l h
SpatialAssociationAnalysis
Spatialassociationrule: A B [s%,c%]
AandBaresetsofspatialornonspatialpredicates
Topologicalrelations:intersects,overlaps,disjoint,etc.
Spatialorientations:left_of,west_of,under, etc.
Distanceinformation:close_to,within_distance,etc.
s% isthesupportandc% istheconfidenceoftherule
Examples
1) is_a(x,large_town)^intersect(x,highway) adjacent_to(x,water)
[7%,85%]
2) Whatkindsofobjectsaretypicallylocatedclosetogolfcourses?

l h
ProgressiveRefinementMiningofSpatial
AssociationRules
Hierarchyofspatialrelationship:
g_close_to:near_by, touch, intersect,contain,etc.
Firstsearchforroughrelationshipandthenrefineit
Twostepminingofspatialassociation:
Step1:Roughspatialcomputation(asafilter)
UsingMBRorRtreeforroughestimation
Step2:Detailedspatialalgorithm(asrefinement)
Applyonlytothoseobjectswhichhavepassedtheroughspatial
associationtest(nolessthanmin_support)

l h
SpatialAutocorrelation
Spatialdatatendstobehighlyselfcorrelated
Example:Neighborhood,Temperature
Itemsinatraditionaldataareindependentofeachother,
whereaspropertiesoflocationsinamapareoftenauto
correlated.
Firstlawofgeography:
Everythingisrelatedtoeverything,butnearbythingsare
morerelatedthandistantthings.

l h
SpatialClassification
Methodsinclassification
Decisiontreeclassification,NaveBayesianclassifier+
boosting,neuralnetwork,logisticregression,etc.
Associationbasedmultidimensionalclassification
Example:classifyinghousevaluebasedonproximityto
lakes,highways,mountains,etc.
Assuminglearningsamplesareindependentofeachother
Spatialautocorrelationviolatesthisassumption!
Popularspatialclassificationmethods
Spatialautoregression(SAR)
Markovrandomfield(MRF)
l h
SpatialAutoRegression
LinearRegression
Y=X +
Spatialautoregressiveregression(SAR)
Y=WY+X +
W:neighborhoodmatrix.
modelsstrengthofspatialdependencies
errorvector
Theestimatesof and canbederivedusingmaximumlikelihood
theoryorBayesianstatistics

l h
MarkovRandomFieldBasedBayesianClassifiers
Bayesianclassifiers
MRF
Asetofrandomvariableswhoseinterdependencyrelationshipis
representedbyanundirectedgraph(i.e.,asymmetricneighborhood
matrix)iscalledaMarkovRandomField.
Pr(X | Ci, Li) Pr(Ci | Li)
Pr(Ci | X, Li) =
Pr (X)
Li denotessetoflabelsintheneighborhoodofsiexcludinglabelsatsi
Pr(Ci |Li) canbeestimatedfromtrainingdatabyexaminetheratiosof
thefrequenciesofclasslabelstothetotalnumberoflocations
Pr(X|Ci,Li) canbeestimatedusingkernelfunctionsfromtheobserved
valuesinthetrainingdataset

l h
SpatialTrendAnalysis
Function
Detectchangesandtrendsalongaspatialdimension
Studythetrendofnonspatialorspatialdatachanging
withspace
Applicationexamples
Observethetrendofchangesoftheclimateorvegetation
withincreasingdistancefromanocean
Crimerateorunemploymentratechangewithregardto
citygeodistribution

l h
SpatialClusterAnalysis
Miningclusterskmeans,kmedoids,
hierarchical,densitybased,etc.
Analysisofdistinctfeaturesoftheclusters

l h
ConstraintsBasedClustering
Constraintsonindividualobjects
Simpleselectionofrelevantobjectsbeforeclustering
Clusteringparameters asconstraints
Kmeans,densitybased:radius,min#ofpoints
ConstraintsspecifiedonclustersusingSQLaggregates
Sumoftheprofitsineachcluster>$1million
Constraintsimposedbyphysicalobstacles
Clusteringwithobstructeddistance

l h
ConstrainedClustering:PlanningATMLocations
C3
C2
C1
River
Mountain C4
Spatial data with obstacles Clustering without taking

12/26/2012
obstacles into consideration
DataMining:Principlesand 432
Algorithms
SpatialOutlierDetection
Outlier
Globaloutliers:Observationswhichisinconsistentwiththe
restofthedata
Spatialoutliers:Alocalinstabilityofnonspatialattributes
Spatialoutlierdetection
Graphicaltests
Variogramclouds
Moranscatterplots
Quantitativetests
Scatterplots
SpatialStatisticZ(S(x))
QuantitativetestsaremoreaccuratethanGraphicaltests

l h
SpatialDBMS
SpatialDataMining
Summary

l h
SimilaritySearchinMultimediaData
Descriptionbasedretrievalsystems
Buildindicesandperformobjectretrievalbasedonimage
descriptions,suchaskeywords,captions,size,andtimeof
creation
Laborintensiveifperformedmanually
Resultsaretypicallyofpoorqualityifautomated
Contentbasedretrievalsystems
Supportretrievalbasedontheimagecontent,suchas
colorhistogram,texture,shape,objects,andwavelet
transforms

l h
QueriesinContentBasedRetrievalSystems
Imagesamplebasedqueries
Findalloftheimagesthataresimilartothegivenimage
sample
Comparethefeaturevector(signature)extractedfromthe
samplewiththefeaturevectorsofimagesthathave
alreadybeenextractedandindexedintheimagedatabase
Imagefeaturespecificationqueries
Specifyorsketchimagefeatureslikecolor,texture,or
shape,whicharetranslatedintoafeaturevector
Matchthefeaturevectorwiththefeaturevectorsofthe
imagesinthedatabase

l h
ApproachesBasedonImageSignature
Colorhistogrambasedsignature
Thesignatureincludescolorhistogramsbasedoncolor
compositionofanimageregardlessofitsscaleororientation
Noinformationaboutshape,location,ortexture
Twoimageswithsimilarcolorcompositionmaycontainvery
differentshapesortextures,andthuscouldbecompletely
unrelatedinsemantics
Multifeaturecomposedsignature
Definedifferentdistancefunctionsforcolor,shape,location,
andtexture,andsubsequentlycombinethemtoderivethe
overallresult

l h
WaveletAnalysis
Waveletbasedsignature
Usethedominantwaveletcoefficientsofanimageasits
signature
Waveletscaptureshape,texture,andlocationinformation
inasingleunifiedframework
Improvedefficiencyandreducedtheneedforproviding
multiplesearchprimitives
Mayfailtoidentifyimagescontainingsimilarobjectsthat
areindifferentlocations.

l h
OneSignaturefortheEntireImage?
Walnus:[NRS99]byNatsev,Rastogi,andShim
Similarimagesmaycontainsimilarregions,butaregioninone
imagecouldbeatranslationorscalingofamatchingregionin
theother
Waveletbasedsignaturewithregionbasedgranularity
Defineregionsbyclusteringsignaturesofwindowsof
varyingsizeswithintheimage
Signatureofaregionisthecentroidofthecluster
Similarityisdefinedintermsofthefractionoftheareaof
thetwoimagescoveredbymatchingpairsofregionsfrom
twoimages
l h
MultidimensionalAnalysisofMultimediaData
Multimediadatacube
Designandconstructionsimilartothatoftraditionaldata
cubesfromrelationaldata
Containadditionaldimensionsandmeasuresformultimedia
information,suchascolor,texture,andshape
Thedatabasedoesnotstoreimagesbuttheirdescriptors
Featuredescriptor:asetofvectorsforeachvisual
characteristic
Colorvector:containsthecolorhistogram
MFC(MostFrequentColor)vector:fivecolorcentroids
MFO(MostFrequentOrientation)vector:fiveedgeorientation
centroids
Layoutdescriptor:containsacolorlayoutvectorandanedge
layoutvector

l h
MultiDimensionalSearchinMultimedia
Databases

l h
MultiDimensionalAnalysisin
MultimediaDatabases
Color histogram Texture layout

l h
MiningMultimediaDatabases
Refining or combining searches
Search for airplane in blue sky

(top layout grid is blue and
keyword = airplane)
Search for blue sky and

green meadows
Search for blue sky (top layout grid is blue
(top layout grid is blue) and bottom is green)

l h
The Data Cube and
the Sub-Space Measurements
By Size
By Format
By Format & Size
RED
WHITE
BLUE
Cross Tab By Colour & Size
JPEG GIF By Colour By Format & Colour
RED
WHITE Sum By Colour
BLUE Format of image
By Format Duration
Group By
Colour
Sum Colors
RED Textures
WHITE Keywords
BLUE
Size
Measurement Width
Sum
Height
Internet domain of image
Internet domain of parent pages
Image popularity
l h
MiningMultimediaDatabasesin

l h
Classification in
MultiMediaMiner

l h
MiningAssociationsinMultimediaData
Specialfeatures:
Need#ofoccurrencesbesidesBooleanexistence,e.g.,
Tworedsquareandonebluecircleimpliesthemeair
show
Needspatialrelationships
Blueontopofwhitesquaredobjectisassociatedwith
brownbottom
Needmultiresolutionandprogressiverefinementmining
Itisexpensivetoexploredetailedassociationsamong
objectsathighresolution
Itiscrucialtoensurethecompletenessofsearchatmulti
resolutionspace

l h
Spatial Relationships from Layout
property P1 on-top-of property P2 property P1 next-to property P2
Different Resolution Hierarchy

l h
From Coarse to Fine Resolution Mining

l h
Challenge:Curseof
Dimensionality
Difficulttoimplementadatacubeefficientlygivenalarge
numberofdimensions,especiallyseriousinthecaseof
multimediadatacubes
Manyoftheseattributesaresetorientedinsteadofsingle
valued
Restrictingnumberofdimensionsmayleadtothemodelingof
animageataratherrough,limited,andimprecisescale
Moreresearchisneededtostrikeabalancebetweenefficiency
andpowerofrepresentation

l h
Summary
Miningobjectdataneedsfeature/attributebased
generalizationmethods
Spatial,spatiotemporalandmultimediadataminingisoneof
importantresearchfrontiersindataminingwithbroad
applications
Spatialdatawarehousing,OLAPandmining facilitates
multidimensionalspatialanalysisandfindingspatial
associations,classificationsandtrends
Multimediadatamining needscontentbasedretrieval and
similaritysearch integratedwithminingmethods

l h
MiningTextandWebData

l h
Textmining,naturallanguageprocessingand
informationextraction:AnIntroduction
Textcategorizationmethods
MiningWeblinkagestructures
Summary

l h
MiningTextData:AnIntroduction
Data Mining / Knowledge Discovery
Structured Data Multimedia Free Text Hypertext

HomeLoan ( Frank Rizzo bought <a href>Frank Rizzo
Loanee: Frank Rizzo his home from Lake </a> Bought
Lender: MWF View Real Estate in <a hef>this home</a>
Agency: Lake View 1992. from <a href>Lake
Amount: $200,000 He paid $200,000 View Real Estate</a>
Term: 15 years under a15-year loan In <b>1992</b>.
) Loans($200K,[map],...) from MW Financial. <p>...

l h
BagofTokensApproaches
Documents Token Sets
Four score and seven nation 5

years ago our fathers brought civil - 1
forth on this continent, a new war 2
nation, conceived in Liberty, Feature men 2
and dedicated to the Extraction died 4
proposition that all men are people 5
created equal. Liberty 1
Now we are engaged in a God 1
great civil war, testing
whether that nation, or
Loses all order-specific information!

Severely limits context!
l h
NaturalLanguageProcessing
A dog is chasing a boy on the playground Lexical

Det Noun Aux Verb Det Noun Prep Det Noun analysis
(part-of-speech
Noun Phrase tagging)
Noun Phrase Complex Verb Noun Phrase
Prep Phrase
Semantic analysis Verb Phrase
Syntactic analysis
Dog(d1). (Parsing)
Boy(b1).
Playground(p1). Verb Phrase
Chasing(d1,b1,p1).
+ Sentence
Scared(x) if Chasing(_,x,_).
A person saying this may
be reminding another person to
get the dog back
Scared(b1)
Inference Pragmatic analysis
(speech act)
12/26/2012
(Taken from ChengXiang Zhai, CS 397cxzDataMining:Principlesand
Fall 2003)
456
l h
GeneralNLPTooDifficult!
Wordlevelambiguity
designcanbeanounoraverb (AmbiguousPOS)
roothasmultiplemeanings (Ambiguoussense)
Syntacticambiguity
naturallanguageprocessing(Modification)
Amansawaboywithatelescope. (PPAttachment)
Anaphoraresolution
JohnpersuadedBilltobuyaTVforhimself.
(himself =JohnorBill?)
Presupposition
Hehasquitsmoking.impliesthathesmokedbefore.
Humans rely on context to interpret (when possible).

This context may extend beyond a given document!
12/26/2012
(Taken from ChengXiang Zhai, CS 397cxzDataMining:Principlesand
Fall 2003)
457
l h
ShallowLinguistics
Progress on Useful Sub-Goals:

English Lexicon
Part-of-Speech Tagging
Word Sense Disambiguation
Phrase Detection / Parsing

l h
WordNet
An extensive lexical network for the English language

Contains over 138,838 words.
Several graphs, one for each part-of-speech.
Synsets (synonym sets), each defining a semantic sense.
Relationship information (antonym, hyponym, meronym )
Downloadable for free (UNIX, Windows)
Expanding to other languages (Global WordNet Association)
Funded >$3 million, mainly government (translation interest)
Founder George Miller, National Medal of Science, 1991.
watery parched
moist wet dry arid

synonym
antonym
12/26/2012 damp anhydrous 459
DataMining:Principlesand
l h
PartofSpeechTagging
Training data (Annotated text)
This sentence serves as an example of annotated text
Det N V1 P Det N P V2 N
This is a new sentence.

This is a new sentence. POS Tagger
Det Aux Det Adj N
p(w1likely
Pick the most ,..., wk , ttag
1 ,..., tk )
sequence.
p(t1 | w1 )... p(tk | wk ) p(w1 )... p(wk )

p(w1 ,..., wk , t1 ,..., tk ) = k
p(wi | ti ) p(ti | ti1 )
Independent assignment
p(t1 | w1 )... p(tk | wk ) p(iw =11 )... p( wk )
Most common tag

= k
p(wi | ti ) p(ti | ti 1 )
i =1 Partial dependency
(HMM)
12/26/2012
(Adapted from ChengXiang Zhai, CS 397cxz DataMining:Principlesand
Fall 2003)
460
l h
WordSenseDisambiguation
?
The difficulties of computational linguistics are rooted in ambiguity.
N Aux V P N
Supervised Learning
Features:
Neighboring POS tags (N Aux V P N)
Neighboring words (linguistics are rooted in ambiguity)
Stemmed form (root)
Dictionary/Thesaurus entries of neighboring words
High co-occurrence words (plant, tree, origin,)
Other senses of word within discourse
Algorithms:
Rule-based Learning (e.g. IG guided)
Statistical Learning (i.e. Nave Bayes)
Unsupervised Learning (i.e. Nearest Neighbor)
l h
Parsing
Choose most likely parse tree S Probability of this tree=0.000015
Probabilistic CFG NP VP
S NP VP 1.0 Det BNP VP PP

NP Det BNP 0.3
NP BNP 0.4 A N Aux V NP P NP
NP NP PP 0.3
Grammar BNP N dog is chasing on
VP V a boy
VP Aux V NP ... the playground

VP VP PP S Probability of this tree=0.000011
PP P NP 1.0
NP VP
V chasing 0.01 Det NP
Aux is BNP Aux V
N dog PP
0.003 A
N boy N is chasing NP
Lexicon P NP
N playground
Det the dog a boy on

Det a
P on the playground
12/26/2012 DataMining:Principlesand
(Adapted from ChengXiang Zhai, CS 397cxz Fall 2003)
462
l h
Obstacles
Ambiguity
A man saw a boy with a telescope.
Computational Intensity
Imposes a context horizon.
Text Mining NLP Approach:

1. Locate promising fragments using fast IR
methods (bag-of-tokens).
2. Only apply slow NLP techniques to promising
fragments.

l h
Summary:ShallowNLP
However, shallow NLP techniques are feasible and useful:

Lexicon machine understandable linguistic knowledge
possible senses, definitions, synonyms, antonyms, typeof, etc.
POS Tagging limit ambiguity (word/POS), entity extraction
...research interests include text mining as well as bioinformatics.
NP N
WSD stem/synonym/hyponym matches (doc and query)
Query: Foreign cars Document: Im selling a 1976 Jaguar
Parsing logical view of information (inference?, translation?)
A man saw a boy with a telescope.
Even without complete NLP, any additional knowledge extracted from
text data can only be beneficial.
Ingenuity will determine the applications.
l h
Textinformationsystemandinformation
retrieval
Summary

l h
TextDatabasesandIR
Textdatabases(documentdatabases)
Largecollectionsofdocumentsfromvarioussources:news
articles,researchpapers,books,digitallibraries,email
messages,andWebpages,librarydatabase,etc.
Datastoredisusuallysemistructured
Traditionalinformationretrievaltechniquesbecome
inadequatefortheincreasinglyvastamountsoftextdata
Informationretrieval
Afielddevelopedinparallelwithdatabasesystems
Informationisorganizedinto(alargenumberof)documents
Informationretrievalproblem:locatingrelevantdocuments
basedonuserinput,suchaskeywordsorexample
documents

l h
InformationRetrieval
TypicalIRsystems
Onlinelibrarycatalogs
Onlinedocumentmanagementsystems
Informationretrievalvs.databasesystems
SomeDBproblemsarenotpresentinIR,e.g.,update,
transactionmanagement,complexobjects
SomeIRproblemsarenotaddressedwellinDBMS,e.g.,
unstructureddocuments,approximatesearchusing
keywordsandrelevance
l h
BasicMeasuresforTextRetrieval
Relevant Relevant&
Retrieved Retrieved
AllDocuments
Precision: thepercentageofretrieveddocumentsthatareinfactrelevanttothe
query(i.e.,correctresponses)
| {Relevant} {Retrieved} |
precision =
| {Retrieved} |
Recall: thepercentageofdocumentsthatarerelevanttothequeryandwere,in
fact,retrieved
| {Relevant} {Retrieved} |
precision =
| {Relevant} |
l h
InformationRetrievalTechniques
BasicConcepts
Adocumentcanbedescribedbyasetofrepresentative
keywordscalledindexterms.
Differentindextermshavevaryingrelevancewhenusedto
describedocumentcontents.
Thiseffectiscapturedthroughtheassignmentofnumerical
weightstoeachindexterm ofadocument.(e.g.:frequency,
tfidf)
DBMSAnalogy
IndexTerms Attributes
Weights AttributeValues

l h
InformationRetrievalTechniques
IndexTerms(Attribute)Selection:
Stoplist
Wordstem
Indextermsweightingmethods
TermsU DocumentsFrequencyMatrices
InformationRetrievalModels:
BooleanModel
VectorModel
ProbabilisticModel
l h
BooleanModel
Considerthatindextermsareeitherpresentorabsentina
document
Asaresult,theindextermweightsareassumedtobeall
binaries
Aqueryiscomposedofindextermslinkedbythree
connectives:not,and,andor
e.g.:carand repair,planeor airplane
TheBooleanmodelpredictsthateachdocumentiseither
relevantornonrelevantbasedonthematchofa
documenttothequery

l h
KeywordBasedRetrieval
Adocumentisrepresentedbyastring,whichcanbeidentified
byasetofkeywords
Queriesmayuseexpressions ofkeywords
E.g.,carand repairshop,teaor coffee,DBMSbutnot Oracle
Queriesandretrievalshouldconsidersynonyms, e.g.,repair
andmaintenance
Majordifficultiesofthemodel
Synonymy:AkeywordT doesnotappearanywhereinthe
document,eventhoughthedocumentiscloselyrelatedto
T,e.g.,datamining
Polysemy:Thesamekeywordmaymeandifferentthingsin
differentcontexts,e.g.,mining

l h
SimilarityBasedRetrievalinTextData
Findssimilardocumentsbasedonasetofcommonkeywords
Answershouldbebasedonthedegreeofrelevancebasedon
thenearnessofthekeywords,relativefrequencyofthe
keywords,etc.
Basictechniques
Stoplist
Setofwordsthataredeemedirrelevant,eventhough
theymayappearfrequently
E.g.,a,the,of,for,to,with,etc.
Stoplistsmayvarywhendocumentsetvaries

l h
SimilarityBasedRetrievalinTextData
Wordstem
Severalwordsaresmallsyntacticvariantsofeachother
sincetheyshareacommonwordstem
E.g.,drug,drugs,drugged
Atermfrequencytable
Eachentry frequent_table(i,j) =#ofoccurrencesofthe
word ti indocumentdi
Usually,theratio insteadoftheabsolutenumberof
occurrencesisused
Similaritymetrics:measuretheclosenessofadocumenttoa
query(asetofkeywords)
Relativetermoccurrences v1 v2
sim(v1 , v2 ) =
Cosinedistance: | v1 || v2 |
l h
IndexingTechniques
Invertedindex
Maintainstwohash orB+treeindexedtables:
document_table:asetofdocumentrecords<doc_id,postings_list>
term_table:asetoftermrecords,<term,postings_list>
Answerquery:Findalldocsassociatedwithoneorasetofterms
+easytoimplement
donothandlewellsynonymyandpolysemy,andpostinglistscouldbe
toolong(storagecouldbeverylarge)
Signaturefile
Associateasignaturewitheachdocument
Asignatureisarepresentationofanorderedlistoftermsthatdescribethe
document
Orderisobtainedbyfrequencyanalysis,stemmingandstoplists

l h
VectorSpaceModel
Documentsanduserqueriesarerepresentedasmdimensionalvectors,
wheremisthetotalnumberofindextermsinthedocumentcollection.
Thedegreeofsimilarityofthedocumentdwithregardtothequeryqis
calculatedasthecorrelationbetweenthevectorsthatrepresentthem,
usingmeasuressuchastheEuclidiandistanceorthecosineoftheangle
betweenthesetwovectors.

l h
ProbabilisticModel
Basicassumption:Givenauserquery,thereisasetof
documentswhichcontainsexactlytherelevantdocumentsand
noother(idealanswerset)
Queryingprocessasaprocessofspecifyingthepropertiesof
anidealanswerset.Sincethesepropertiesarenotknownat
querytime,aninitialguessismade
Thisinitialguessallowsthegenerationofapreliminary
probabilisticdescriptionoftheidealanswersetwhichisused
toretrievethefirstsetofdocuments
Aninteractionwiththeuseristheninitiatedwiththepurpose
ofimprovingtheprobabilisticdescriptionoftheanswerset

l h
TypesofTextDataMining
Keywordbasedassociationanalysis
Automaticdocumentclassification
Similaritydetection
Clusterdocumentsbyacommonauthor
Clusterdocumentscontaininginformationfromacommon
source
Linkanalysis:unusualcorrelationbetweenentities
Sequenceanalysis:predictingarecurringevent
Anomalydetection:findinformationthatviolatesusual
patterns
Hypertextanalysis
Patternsinanchors/links
Anchortextcorrelationswithlinkedobjects

l h
KeywordBasedAssociationAnalysis
Motivation
Collectsetsofkeywordsortermsthatoccurfrequentlytogetherandthen
findtheassociation or correlationrelationshipsamongthem
AssociationAnalysisProcess
Preprocessthetextdatabyparsing,stemming,removingstopwords,etc.
Evokeassociationminingalgorithms
Considereachdocumentasatransaction
Viewasetofkeywordsinthedocumentasasetofitemsinthetransaction
Termlevelassociationmining
Noneedforhumaneffortintaggingdocuments
Thenumberofmeaninglessresultsandtheexecutiontimeisgreatlyreduced

l h
TextClassification
Motivation
Automaticclassificationforthelargenumberofonlinetextdocuments
(Webpages,emails,corporateintranets,etc.)
ClassificationProcess
Datapreprocessing
Definitionoftrainingsetandtestsets
Creationoftheclassificationmodelusingtheselectedclassification
algorithm
Classificationmodelvalidation
Classificationofnew/unknowntextdocuments
Textdocumentclassificationdiffersfromtheclassificationofrelational
data
Documentdatabasesarenotstructuredaccordingtoattributevalue
pairs

l h
TextClassification(2)
ClassificationAlgorithms:
SupportVectorMachines
KNearestNeighbors
NaveBayes
NeuralNetworks
DecisionTrees
Associationrulebased
Boosting

l h
DocumentClustering
Motivation
Automaticallygrouprelateddocumentsbasedontheir
contents
Nopredeterminedtrainingsetsortaxonomies
Generateataxonomyatruntime
ClusteringProcess
Datapreprocessing:removestopwords,stem,feature
extraction,lexicalanalysis,etc.
Hierarchicalclustering:computesimilaritiesapplying
clusteringalgorithms.
ModelBasedclustering(NeuralNetworkApproach):clusters
arerepresentedbyexemplars.(e.g.:SOM)
l h
TextCategorization
Pregivencategoriesandlabeleddocument
examples(Categoriesmayformhierarchy)
Classifynewdocuments
Astandardclassification(supervisedlearning)
problem Categorization
Sports
Business
System
Education

Sports
Science
Business
Education
l h
Applications
Newsarticleclassification
Automaticemailfiltering
Webpageclassification
Wordsensedisambiguation


l h
CategorizationMethods
Manual:Typicallyrulebased
Doesnotscaleup(laborintensive,ruleinconsistency)
Maybeappropriateforspecialdataonaparticulardomain
Automatic:Typicallyexploitingmachinelearningtechniques
Vectorspacemodelbased
Prototypebased(Rocchio)
Knearestneighbor(KNN)
Decisiontree(learnrules)
NeuralNetworks(learnnonlinearclassifier)
SupportVectorMachines(SVM)
Probabilisticorgenerativemodelbased
NaveBayesclassifier

l h
HowtoMeasureSimilarity?
Giventwodocument
Similaritydefinition
dotproduct
normalizeddotproduct(orcosine)

l h
IllustrativeExample
text
doc1 mining Sim(newdoc,doc1)=4.8*2.4+4.5*4.5
search
engine
text Sim(newdoc,doc2)=2.4*2.4
To whom is newdoc
more similar?
travel
text
Sim(newdoc,doc3)=0
doc2 map
travel
text mining travel map search engine govern president congress

IDF(faked) 2.4 4.5 2.8 3.3 2.1 5.4 2.2 3.2 4.3
government doc1 2(4.8) 1(4.5) 1(2.1) 1(5.4)

president doc2 1(2.4 ) 2 (5.6) 1(3.3)
doc3 congress doc3 1 (2.2) 1(3.2) 1(4.3)
newdoc 1(2.4) 1(4.5)

l h
CategorizationMethods
Vectorspacemodel
KNN
Decisiontree
Neuralnetwork
Supportvectormachine
Probabilisticmodel
NaveBayesclassifier
Many,manyothersandvariantsexist[F.S.02]
e.g.Bim,Nb,Ind,Swap1,LLSF,WidrowHoff,Rocchio,Gis
W,
l h
Evaluation(cont)
Benchmarks
Classic:Reuterscollection
Asetofnewswirestoriesclassifiedundercategoriesrelatedto
economics.
Effectiveness
Difficultiesofstrictcomparison
differentparametersetting
differentsplit(orselection)betweentrainingandtesting
variousoptimizations
Howeverwidelyrecognizable
Best:Boostingbasedcommitteeclassifier&SVM
Worst:NaveBayesclassifier
Needtoconsiderotherfactors,especiallyefficiency
l h
Summary:TextCategorization
Wideapplicationdomain
Comparableeffectivenesstoprofessionals
ManualTCisnot100%andunlikelytoimprove
substantially.
A.T.C.isgrowingatasteadypace
Prospectsandextensions
Verynoisytext,suchastextfromO.C.R.
Speechtranscripts

l h
ResearchProblemsinTextMining
Google:whatisthenextstep?
Howtofindthepagesthatmatchapproximatelythe
sohpisticateddocuments,withincorporationofuserprofiles
orpreferences?
LookbackofGoogle:invertedindicies
Constructionofindiciesforthesohpisticateddocuments,
withincorporationofuserprofilesorpreferences
Similaritysearchofsuchpagesusingsuchindicies

l h
BasedontheslidesbyDengCai
Summary

l h
Outline
BackgroundonWebSearch
VIPS(VIsionbasedPageSegmentation)
BlockbasedWebSearch
BlockbasedLinkAnalysis
WebImageSearch&Clustering
l h
SearchEngine TwoRankFunctions
Ranking based on link
Search structure analysis
Importance Ranking
Rank Functions (Link Analysis)
Similarity
based on Relevance Ranking
content or text Backward Link Web Topology
(Anchor Text) Graph
Inverted Indexer
Index
Anchor Text Web Graph
Generator Constructor
Term Dictionary Forward Forward URL

Meta Data
(Lexicon) Index Link Dictioanry
Web Page Parser
Web Pages
l h
RelevanceRanking
Invertedindex
Adatastructureforsupportingtextqueries
likeindexinabook
aalborg 3452, 11437, ..
.
.
.
indexing .
.
arm 4, 19, 29, 98, 143, ...
diskswith armada 145, 457, 789, ...
documents armadillo 678, 2134, 3970, ...
armani 90, 256, 372, 511, ...
.
.
.
.
.
zz 602, 1189, 3209, ...
invertedindex
ThePageRankAlgorithm
Basicidea
significanceofapageisdeterminedby
thesignificanceofthepageslinkingtoit
1 if page i links to page j

Moreprecisely: Aij =
Linkgraph:adjacencymatrixA,
0 otherwise
ConstructsaprobabilitytransitionmatrixM byrenormalizingeachrow
ofA tosumto1 U + (1 )M Uij = 1/ n for all i, j
Treatthewebgraphasamarkovchain(randomsurfer)
ThevectorofPageRankscoresp isthendefinedtobethestationary
distributionofthisMarkovchain.Equivalently,pistheprincipalright
eigenvectorofthetransitionmatrix ( U + (1 ) M )T
(U + (1 ) M )T p = p
l h
LayoutStructure
Comparedtoplaintext,awebpageisa2Dpresentation
Richvisualeffectscreatedbydifferenttermtypes,formats,separators,
blankareas,colors,pictures,etc
Differentpartsofapagearenotequallyimportant
Title: CNN.com International
H1: IAEA: Iran had secret nuke agenda
H3: EXPLOSIONS ROCK BAGHDAD

TEXT BODY (with position and font
type): The International Atomic Energy
Agency has concluded that Iran has
secretly produced small amounts of
nuclear materials including low enriched
uranium and plutonium that could be used
to develop nuclear weapons according to a
confidential report obtained by CNN
Hyperlink:
URL: http://www.cnn.com/...
Anchor Text: AI oaeda
Image:
URL: http://www.cnn.com/image/...
Alt & Caption: Iran nuclear
Anchor Text: CNN Homepage News

l h
WebPageBlockBetterInformationUnit
Web Page Blocks
Importance = Low
Importance = Med
Importance = High

l h
MotivationforVIPS(VIsionbasedPageSegmentation)
Problemsoftreatingawebpageasanatomicunit
Webpageusuallycontainsnotonlypurecontent
Noise:navigation,decoration,interaction,
Multipletopics
Differentpartsofapagearenotequallyimportant
Webpagehasinternalstructure
Twodimensionlogicalstructure&Visuallayout
presentation
> Freetextdocument
< Structureddocument
Layout the3rd dimensionofWebpage
1st dimension:content
2nd dimension:hyperlink
l h
IsDOMaGoodRepresentationofPageStructure?
PagesegmentationusingDOM
ExtractstructuraltagssuchasP,TABLE,UL,
TITLE,H1~H6,etc
DOMismorerelatedcontentdisplay,doesnot
necessarilyreflectsemanticstructure
HowaboutXML?
AlongwaytogotoreplacetheHTML

l h
VIPSAlgorithm
Motivation:
Inmanycases,topicscanbedistinguishedwithvisualclues.Suchas
position,distance,font,color,etc.
Goal:
Extractthesemanticstructureofawebpagebasedonitsvisual
presentation.
Procedure:
Topdownpartitionthewebpagebasedontheseparators
Result
Atreestructure,eachnodeinthetreecorrespondstoablockinthe
page.
Eachnodewillbeassignedavalue(DegreeofCoherence)toindicate
howcoherentofthecontentintheblockbasedonvisualperception.
Eachblockwillbeassignedanimportancevalue
Hierarchyorflat

l h
VIPS:AnExample
Ahierarchicalstructureoflayoutblock
ADegreeofCoherence(DOC) isdefinedfor
eachblock
Showtheintracoherenceoftheblock
DoC ofchildblockmustbenolessthan
itsparents
ThePermittedDegreeofCoherence(PDOC)
canbepredefinedtoachievedifferent
granularitiesforthecontentstructure
Thesegmentationwillstoponlywhenall
theblocksDoC isnolessthanPDoC
ThesmallerthePDoC,thecoarserthe
contentstructurewouldbe

l h
BlockbasedWebSearch
Indexblockinsteadofwholepage
Blockretrieval
CombingDocRankandBlockRank
Blockqueryexpansion
Selectexpansiontermfromrelevantblocks

l h
ASampleofUserBrowsingBehavior

l h
ImageRank
RelevanceRanking ImportanceRanking CombinedRanking

l h
ImageRankvs.PageRank
Dataset
26.5millionswebpages
11.6millionsimages
Queryset
45hotqueriesinGoogleimagesearchstatistics
Groundtruth
Fivevolunteerswerechosentoevaluatethetop100
resultsreturnedbythesystem(iFind)
Rankingmethod
s (x) = rankimportance (x) + (1 ) rankrelevance (x)

l h
ImageRankvsPageRank
ImagesearchaccuracyusingImageRankand
PageRank.Bothofthemachievedtheirbest
resultsat=0.25.
l h
ExampleonImageClustering&Embedding
1710JPGimagesin1287pagesarecrawledwithinthewebsite
http://www.yahooligans.com/content/animals/
Six Categories
Fish
Reptile
Mammal
Bird Amphibian Insect

l h
Algorithms
WebImageSearchResultPresentation
(a)
(b)
Figure 1. Top 8 returns of query pluto in Googles image search engine (a)
and AltaVistas image search engine (b)
Twodifferenttopicsinthesearchresult
Apossiblesolution:
Clustersearchresultsintodifferentsemanticgroups

l h
ThreekindsofWWWimagerepresentation
VisualFeatureBasedRepresentation
TraditionalCBIR
TextualFeatureBasedRepresentation
Surroundingtextinimageblock
LinkGraphBasedRepresentation
Imagegraphembedding

l h
HierarchicalClustering
Clusteringbasedonthreerepresentations
Visualfeature
Hardtoreflectthesemanticmeaning
Textualfeature
Semantic
Sometimesthesurroundingtextistoolittle
Linkgraph:
Semantic
Manydisconnectedsubgraph(toomanyclusters)
TwoSteps:
Usingtextsandlinkinformationtogetsemanticclusters
Foreachcluster,usingvisualfeaturetoreorganizetheimages
tofacilitateusersbrowsing
l h
OurSystem
Dataset
26.5millionswebpages
http://dir.yahoo.com/Arts/Visual_Arts/Photography/Museums_and_Galleries/
11.6millionsimages
Filterimageswhoseratiobetweenwidthandheightaregreaterthan
5orsmallerthan1/5
Removedimageswhosewidthandheightarebothsmallerthan60
pixels
Analyzepagesandindeximages
VIPS:Pages Blocks
Surroundingtextsusedtoindeximages
Anillustrativeexample
QueryPluto
Top500results
l h
ClusteringUsingVisualFeature
Figure 5. Five clusters of search results of query pluto using low level visual
feature. Each row is a cluster.
Fromtheperspectivesofcolorandtexture,theclustering
resultsarequitegood.Differentclustershavedifferent
colorsandtextures.However,fromsemanticperspective,
theseclustersmakelittlesense.
l h
ClusteringUsingTextual
Feature
0.04
0.035
0.03
0.025
0.02
0.015
0.01
0.005
0
0 5 10 15 20 25 30 35 40
Figure 6. The Eigengap curve with k for the

pluto case using textual representation
Figure 7. Six clusters of search results of query pluto

using textual feature. Each row is a cluster
Sixsemanticcategoriesarecorrectly
identifiedifwechoosek =6.

l h
Summary
Moreimprovementonwebsearchcanbe
madebyminingwebpageLayoutstructure
Leveragevisualcuesforwebinformation
analysis&informationextraction
Demos:
http://www.ews.uiuc.edu/~dengcai2
Papers
VIPSdemo&dll

l h
ReviewQuestions
Definespecialdatamining?
Whatisdocumentrankbasedonthecontextoftext
mining?
Canweconstructaspecialdatawarehouse?
Listthetwotypeofmeasuresinaspecialdatacube?
Enlistthetwotypesofmultimediaindexingandretrieval
system?
Giveanoteonmultimediadatacube?
Whatisinformationretrieval?
Listthemethodsforinformationretrieval?
Whatismeantbyauthoritativewebpage?
Whatiswebusagemining?

l h
Bibliography

l h

Data Mining PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining PDF

Uploaded by

Copyright:

Available Formats

DATAMINING/IT0467

Dataminingcoreof Pattern Evaluation

Data Warehouse Selection

data cleaning, integration, and selection

Data World-Wide Other Info

Ex.Let =54,000, =16,000.Then

Initial attribute set:

Class 1 Class 2 Class 1 Class 2

> Reduced attribute set: {A1, A4, A6}

Thereare2d possiblesubfeaturesofd features

Original Data Compressed

Step 1: -$351 -$159 profit $1,838 $4,700

(-$1,000 - 0) (0 -$ 1,000) ($1,000 - $2,000)

(-$400 - 0) ($2,000 - $5, 000)

country 15 distinct values

province_or_ state 365 distinct values

city 3567 distinct values

street 674,339 distinct values

C3 Itemset L3 Itemset sup

TID Items bought (ordered) frequent items

p:2 m:1 p fcam:2, cb:1

December26, DataMining:Conceptsand 102

Level 2 2% Milk Skim Milk Level 2

December26, DataMining:Conceptsand 103

December26, DataMining:Conceptsand 104

December26, DataMining:Conceptsand 106

December26, DataMining:Conceptsand 107

December26, DataMining:Conceptsand 108

playbasketball eatcereal [40%,66.7%]ismisleading

P( A B) Cereal 2000 1750 3750

2000 / 5000 1000 / 5000

December26, DataMining:Conceptsand 110

December26, DataMining:Conceptsand 112

December26, DataMining:Conceptsand 113

Database D itemset sup.

Database D itemset sup.

December26, DataMining:Conceptsand 121

December26, DataMining:Conceptsand 122

December26, DataMining:Conceptsand 123

December26, DataMining:Conceptsand 125

Datamatrix x 11 ... x 1f ... x 1p

December26, DataMining:Conceptsand 127

December26, DataMining:Conceptsand 128

where m f = 1n (x1 f + x2 f + ... + xnf )

December26, DataMining:Conceptsand 130

Ifq =2, disEuclideandistance:

December26, DataMining:Conceptsand 131

December26, DataMining:Conceptsand 134

December26, DataMining:Conceptsand 136

December26, DataMining:Conceptsand 140

December26, DataMining:Conceptsand 141

Partitioningmethod: ConstructapartitionofadatabaseD ofn objectsintoa

km=1tmiKm (Cm tmi ) 2

December26, DataMining:Conceptsand 143

December26, DataMining:Conceptsand 144

December26, DataMining:Conceptsand 145

Strength: Relativelyefficient:O(tkn),wheren is#objects,k is#clusters,andt

December26, DataMining:Conceptsand 146

December26, DataMining:Conceptsand 147

December26, DataMining:Conceptsand 148

December26, DataMining:Conceptsand 153

December26, DataMining:Conceptsand 155

December26, DataMining:Conceptsand 158