You are on page 1of 77

CMUSCS

Toolsforlargegraphmining
WWW2008tutorial

Part3: Matrixtoolsforgraphmining
JureLeskovecandChristosFaloutsos
MachineLearningDepartment

Jointworkwith:Deepay Chakrabarti,TamaraKolda andJimeng Sun.

CMUSCS

Tutorialoutline
Part1:Structureandmodelsfornetworks
Whatarepropertiesoflargegraphs?
Howdowemodelthem?

Part2:Dynamicsofnetworks
Diffusionandcascadingbehavior
Howdovirusesandinformationpropagate?

Part3:Matrixtoolsformininggraphs
Singularvaluedecomposition(SVD)
Randomwalks

Part4:Casestudies
240millionMSNinstantmessengernetwork
Graphprojections:howdoestheweblooklike
Leskovec&Faloutsos,WWW2008

Part32

CMUSCS

Aboutpart3
Introducematrixandtensortools through
realminingapplications
Goal:find patterns,rules,clusters,outliers,
inmatricesand
intensors

Leskovec&Faloutsos,WWW2008

Part33

CMUSCS

Whatisthispartabout?
Connectionofmatrixtoolsandnetworks
Matrixtools

SingularValueDecomposition(SVD)
PrincipalComponentAnalysis(PCA)
Webpagerankingalgorithms:HITS,PageRank
CURdecomposition
Coclustering(inpart4ofthetutorial)

Tensortools
Tuckerdecomposition

Applications
Leskovec&Faloutsos,WWW2008

Part34

CMUSCS

Whymatrices?Examples
Socialnetworks
Documentsandterms
Authorsandterms
Peter

John
John
Peter
Mary
Nick
...

0
5
...
...
...

Mary

11
0
...
...
...

Nick

22
6
...
...
...

Leskovec&Faloutsos,WWW2008

...
...
...

...

55 ...
7 ...
...
...
...
Part35

CMUSCS

Whytensors?Example
Tensor:
ndimensionalgeneralizationofmatrix

SIGMOD07

John
Peter
Mary
Nick
...

data
13
5
...
...
...

...
...
...

mining

classif.

11
4

22
6
...
...
...

Leskovec&Faloutsos,WWW2008

tree

...
...
...

...
55 ...
7 ...
...
...
...
Part36

CMUSCS

Whytensors?Example
Tensor:
ndimensionalgeneralizationofmatrix
SIGMOD05
SIGMOD06
SIGMOD07

John
Peter
Mary
Nick
...

data
13
5

...
...
...

...
...
...

mining

classif.

11
4

22
6
...
...
...

Leskovec&Faloutsos,WWW2008

tree

...
...
...

...
55 ...
7 ...
...
...
...
Part37

CMUSCS

Tensorsareusefulfor3ormoremodes
Terminology:mode(oraspect):

Mode#3

data
13
5

Mode#2

...
...
...

...
...
...

mining

classif.

11
4

22
6
...
...
...

tree

...
...
...

...
55 ...
7 ...
...
...
...

Mode (== aspect) #1


Leskovec&Faloutsos,WWW2008

Part38

CMUSCS

Motivatingapplications
Whymatricesareimportant?
Whytensorsareuseful?

Social networks

P1:socialnetworks
P2:web&textmining
P3:networkforensics
P4:sensornetworks

Sensor networks

Network forensics
500

20

450

450

400

400

15

350

destination

350

destination

25

normal traffic
value

500

abnormal traffic

destination

destination

30

300
250
200

10

300
250

200

0
0

150

150

100

100
50

100

200

source

300
source

400

500

2000

4000 6000
time (min)

8000

50
100

200

300
source

source

Leskovec&Faloutsos,WWW2008

400

500

Temperature
Part39

10000

CMUSCS

StaticDatamodel
Tensor
Formally,
Generalizationofmatrices
Representedasmultiarray,(~datacube).
Order

1st

2nd

3rd

Correspondence

Vector

Matrix

3D array

Example

Leskovec&Faloutsos,WWW2008

Part110

CMUSCS

DynamicDatamodel
TensorStreams
AsequenceofMthordertensors

where
t is increasing over time
Order

1st

2nd

3rd

Correspondence

Multiple streams

Time evolving graphs

3D arrays

time

Example

author

keyword

Leskovec&Faloutsos,WWW2008

Part111

CMUSCS

SVD:ExamplesofMatrices
Example/Intuition:Documentsandterms
Findpatterns,groups,concepts

data
Paper#1
Paper#2
Paper#3
Paper#4
...

13
5
...
...
...

...
...
...

mining

classif.

11
4

22
6
...
...
...

Leskovec&Faloutsos,WWW2008

tree

...
...
...

...
55 ...
7 ...
...
...
...
Part312

CMUSCS

SingularValueDecomposition(SVD)
X = UVT
X

VT

1
x(1) x(2)

x(M)

u1 u2

uk

v1

.
k

singular values
input data

v2

vk
right singular vectors

left singular
vectors
Leskovec&Faloutsos,WWW2008

Part313

CMUSCS

SVDasspectraldecomposition
n

1u1v1

2u2v2

VT

U
BestrankkapproximationinL2andFrobenius
SVDonlyworksforstaticmatrices(asingle2nd
ordertensor)
Leskovec&Faloutsos,WWW2008

Part314

CMUSCS

Vectorouterproduct intuition:
owner
age
car type

VW
Volvo
BMW

2-d histogram

20; 30; 40
20; 30; 40

VW
Volvo
BMW

1-d histograms +
independence assumption

Leskovec&Faloutsos,WWW2008

Part315

CMUSCS

SVD Example
A =U VT example:
retrieval
inf.
lung
brain
data

CS

MD

1
2
1
5
0
0
0

1
2
1
5
0
0
0

1
2
1
5
0
0
0

0
0
0
0
2
3
1

0
0
0
0
2
3
1

0.18
0.36
0.18
0.90
0
0
0

0
0
0
0
0.53
0.80
0.27

9.64 0
0
5.29

Leskovec&Faloutsos,WWW2008

0.58 0.58 0.58 0


0
0
0
0
0.71 0.71
Part316

CMUSCS

SVD Example
A =U VT example:
retrieval CS-concept
inf.
lung
MD-concept
brain
data

CS

MD

1
2
1
5
0
0
0

1
2
1
5
0
0
0

1
2
1
5
0
0
0

0
0
0
0
2
3
1

0
0
0
0
2
3
1

0.18
0.36
0.18
0.90
0
0
0

0
0
0
0
0.53
0.80
0.27

9.64 0
0
5.29

Leskovec&Faloutsos,WWW2008

0.58 0.58 0.58 0


0
0
0
0
0.71 0.71
Part317

CMUSCS

SVD Example
A =U VT example:

doc-to-concept
similarity matrix
retrieval CS-concept
inf.
MD-concept
brain lung

data

CS

MD

1
2
1
5
0
0
0

1
2
1
5
0
0
0

1
2
1
5
0
0
0

0
0
0
0
2
3
1

0
0
0
0
2
3
1

0.18
0.36
0.18
0.90
0
0
0

0
0
0
0
0.53
0.80
0.27

9.64 0
0
5.29

Leskovec&Faloutsos,WWW2008

0.58 0.58 0.58 0


0
0
0
0
0.71 0.71
Part318

CMUSCS

SVD Example
A =U VT example:
retrieval
inf.
lung
brain
data

CS

MD

1
2
1
5
0
0
0

1
2
1
5
0
0
0

1
2
1
5
0
0
0

0
0
0
0
2
3
1

0
0
0
0
2
3
1

strength of CS-concept
0.18
0.36
0.18
0.90
0
0
0

0
0
0
0
0.53
0.80
0.27

9.64 0
0
5.29

Leskovec&Faloutsos,WWW2008

0.58 0.58 0.58 0


0
0
0
0
0.71 0.71
Part319

CMUSCS

SVD Example
A =U VT example:

term-to-concept
similarity matrix

retrieval
inf.
lung
brain
data

CS

MD

1
2
1
5
0
0
0

1
2
1
5
0
0
0

1
2
1
5
0
0
0

0
0
0
0
2
3
1

0
0
0
0
2
3
1

0.18
0.36
0.18
0.90
0
0
0

0
0
0
0
0.53
0.80
0.27

CS-concept
x

9.64 0
0
5.29

Leskovec&Faloutsos,WWW2008

0.58 0.58 0.58 0


0
0
0
0
0.71 0.71
Part320

CMUSCS

SVD Example
A =U VT example:

term-to-concept
similarity matrix

retrieval
inf.
lung
brain
data

CS

MD

1
2
1
5
0
0
0

1
2
1
5
0
0
0

1
2
1
5
0
0
0

0
0
0
0
2
3
1

0
0
0
0
2
3
1

0.18
0.36
0.18
0.90
0
0
0

0
0
0
0
0.53
0.80
0.27

CS-concept
x

9.64 0
0
5.29

Leskovec&Faloutsos,WWW2008

0.58 0.58 0.58 0


0
0
0
0
0.71 0.71
Part321

CMUSCS

SVD Interpretation
documents,termsandconcepts:
Q: if Aisthedocumenttotermmatrix,what
isAT A?
A: termtoterm([mxm])similaritymatrix
Q: AAT ?
A: documenttodocument([nxn])similarity
matrix

Leskovec&Faloutsos,WWW2008

Part322

CMUSCS

SVDproperties
V aretheeigenvectorsofthecovariance
matrix ATA

U aretheeigenvectorsoftheGram(inner
product)matrix AAT

Further reading:
1. Ian T. Jolliffe, Principal Component Analysis (2nd ed), Springer, 2002.
2. Gilbert Strang, Linear Algebra and
Its Applications (4th ed), Brooks Cole, 2005.
Leskovec&Faloutsos,WWW2008
Part323

CMUSCS

PrincipalComponentAnalysis(PCA)
SVD
n

kR

kR

VT
m

kR

Loading

PCs

PCAisanimportantapplicationofSVD
NotethatUandVaredenseandmayhavenegative
entries
Leskovec&Faloutsos,WWW2008

Part324

CMUSCS

PCAinterpretation
bestaxistoprojecton:(best=minsumof
squaresofprojectionerrors)
Term2 (lung)

Term1 (data)
Leskovec&Faloutsos,WWW2008

Part325

CMUSCS

PCA interpretation
U

VT

Term2 (retrieval)

first singular vector

PCA projects points


Onto the best axis
v1
minimumRMSerror

Leskovec&Faloutsos,WWW2008

Term1 (data)

Part126

CMUSCS

KleinbergsalgorithmHITS
Problemdefinition:
giventhewebandaquery
findthemostauthoritativewebpagesforthis
query
Step0:findallpagescontainingthequeryterms
Step1:expandbyonemoveforwardandbackward

Further reading:
Leskovec&Faloutsos,WWW2008
1. J. Kleinberg. Authoritative sources
in a hyperlinked environment. SODA 1998

227

CMUSCS

KleinbergsalgorithmHITS
Step1:expandbyonemoveforwardand
backward

Leskovec&Faloutsos,WWW2008

228

CMUSCS

KleinbergsalgorithmHITS
ontheresultinggraph,givehighscore(=
authorities)tonodesthatmanyimportant
nodespointto
givehighimportancescore(hubs)tonodes
thatpointtogoodauthorities

hubs

authorities

Leskovec&Faloutsos,WWW2008

229

CMUSCS

KleinbergsalgorithmHITS
observations
recursivedefinition!
eachnode(say,ithnode)hasbothan
authoritativenessscoreai andahubnessscore
hi

Leskovec&Faloutsos,WWW2008

230

CMUSCS

Kleinbergsalgorithm:HITS
LetA betheadjacencymatrix:
the(i,j)entryis1iftheedgefromi toj exists

Leth anda be[nx1]vectorswiththehubness


andauthoritativinessscores.
Then:

Leskovec&Faloutsos,WWW2008

231

CMUSCS

Kleinbergsalgorithm:HITS
Then:
ai =hk +hl +hm
k
l

i
m

thatis
ai =Sum(hj)overallj that(j,i)
edgeexists
or
a =AT h

Leskovec&Faloutsos,WWW2008

232

CMUSCS

Kleinbergsalgorithm:HITS
i

n
p
q

symmetrically,forthehubness:
hi =an +ap +aq
thatis
hi =Sum(qj)overallj that(i,j)
edgeexists
or
h =A a

Leskovec&Faloutsos,WWW2008

233

CMUSCS

Kleinbergsalgorithm:HITS
Inconclusion,wewantvectorsh anda such
that:
h =A a
a =AT h
Thatis:
a =ATAa

Leskovec&Faloutsos,WWW2008

234

CMUSCS

Kleinbergsalgorithm:HITS
a isarightsingularvector oftheadjacency
matrixA(bydfn!),a.k.atheeigenvector of
ATA
Startingfromrandom aanditerating,well
eventuallyconverge
Q:towhichofalltheeigenvectors?why?
A:totheoneofthestrongesteigenvalue,
k
T
k
(A A ) a=1 a
Leskovec&Faloutsos,WWW2008

235

CMUSCS

Kleinbergsalgorithm discussion
authorityscorecanbeusedtofindsimilar
pages(how?)
closelyrelatedtocitationanalysis,social
networks/smallworldphenomena

Leskovec&Faloutsos,WWW2008

236

CMUSCS

Motivatingproblem:PageRank
Givenadirectedgraph,finditsmost
interesting/centralnode

A node is important,
if it is connected
with important nodes
(recursive, but OK!)
Leskovec&Faloutsos,WWW2008

237

CMUSCS

Motivatingproblem PageRanksolution
Givenadirectedgraph,finditsmost
interesting/centralnode
Proposedsolution:Randomwalk;spotmost
popularnode(>steadystateprob.(ssp))
A node has high ssp,
if it is connected
with high ssp nodes
(recursive, but OK!)
Leskovec&Faloutsos,WWW2008

238

CMUSCS

(Simplified)PageRankalgorithm
LetA bethetransitionmatrix(=adjacency
matrix);letB bethetranspose,columnnormalized then
From
To

1
4

B
1

1
1/2

1/2

Leskovec&Faloutsos,WWW2008

p1

p1

p2

p2

1/2

p3

1/2

p4

p4

p5

p5

p3

239

CMUSCS

(Simplified)PageRankalgorithm
Bp=p
B

p =

1
1/2

1/2

Leskovec&Faloutsos,WWW2008

p1

p1

p2

p2

1/2

p3

1/2

p4

p4

p5

p5

p3

240

CMUSCS

(Simplified)PageRankalgorithm
Bp=1* p
thus, pistheeigenvector thatcorrespondsto
thehighesteigenvalue(=1,sincethematrixiscolumn
normalized)
Whydoessuchapexist?
pexistsifB isnxn,nonnegative,irreducible
[PerronFrobeniustheorem]

Leskovec&Faloutsos,WWW2008

241

CMUSCS

(Simplified)PageRankalgorithm
Inshort:imagineaparticlerandomlymoving
alongtheedges
computeitssteadystateprobabilities(ssp)
Fullversionofalgo:withoccasionalrandom
jumps
Why?Tomakethematrixirreducible

Leskovec&Faloutsos,WWW2008

242

CMUSCS

FullAlgorithm
Withprobability1c,flyouttoarandomnode
Then,wehave
p =cB p +(1c)/n1=>
p=(1c)/n [I c B]1 1

Leskovec&Faloutsos,WWW2008

243

CMUSCS

Leskovec&Faloutsos,WWW2008

Part344

CMUSCS

MotivationofCURorCMD
SVD,PCAalltransformdataintosome
abstractspace(specifiedbyasetbasis)
Interpretabilityproblem
Lossofsparsity

Leskovec&Faloutsos,WWW2008

245

CMUSCS

PCA interpretation
Term2 (retrieval)

first singular vector

PCA projects points


Onto the best axis
v1
minimumRMSerror

Leskovec&Faloutsos,WWW2008

Term1 (data)

246

CMUSCS

CUR
Examplebasedprojection:useactualrowsand
columnstospecifythesubspace
GivenamatrixARmn,findthreematricesC Rmc,U
Rcr,R Rr n,suchthat||ACUR||issmall

U is the pseudo-inverse of X

Leskovec&Faloutsos,WWW2008

Orthogonal
projection

247

CMUSCS

CUR
Examplebasedprojection:useactualrowsand
columnstospecifythesubspace
GivenamatrixARmn,findthreematricesC Rmc,U
Rcr,R Rr n,suchthat||ACUR||issmall

Example-based

U is the pseudo-inverse of X:
U = X = (UT U )-1 UT
Leskovec&Faloutsos,WWW2008

248

CMUSCS

CUR(cont.)
Keyquestion:
Howtoselect/samplethecolumnsandrows?

Uniformsampling
Biasedsampling
CURw/absoluteerrorbound
CURw/relativeerrorbound
Reference:
1. Tutorial: Randomized Algorithms for Matrices and Massive Datasets, SDM06
2. Drineas et al. Subspace Sampling and Relative-error Matrix Approximation: ColumnRow-Based Methods, ESA2006
3. Drineas et al., Fast Monte Carlo Algorithms for Matrices III: Computing a
Leskovec&Faloutsos,WWW2008
Compressed Approximate Matrix
Decomposition, SIAM Journal on Computing,249
2006.

CMUSCS

Thesparsityproperty pictorially:
SVD/PCA:
Destroys sparsity

U VT
CUR: maintains sparsity
=

C U R
Leskovec&Faloutsos,WWW2008

250

CMUSCS

Thesparsityproperty
sparse and small

SVD: A = U
Big but sparse

T
V

Big and dense


dense but small

CUR: A = C U R
Big but sparse

Big but sparse


Leskovec&Faloutsos,WWW2008

251

CMUSCS

Matrixtools summary
SVD:
optimalforL2 VERYpopular(HITS,PageRank,
KarhunenLoeve,LatentSemanticIndexing,PCA,
etcetc)

CUR(CMDetc)
nearoptimal;sparsity;interpretability

Leskovec&Faloutsos,WWW2008

252

CMUSCS

TENSORS

Leskovec&Faloutsos,WWW2008

Part353

CMUSCS

Reminder:SVD
n

VT

U
BestrankkapproximationinL2

Leskovec&Faloutsos,WWW2008

354

CMUSCS

Reminder:SVD
n

1u1v1

2u2v2

BestrankkapproximationinL2

Leskovec&Faloutsos,WWW2008

355

CMUSCS

Goal:extensionto>=3modes
IxJxK

IxR

JxR
B

++

RxRxR

Leskovec&Faloutsos,WWW2008

356

CMUSCS

Tensors:Mainpoints
2majortypesoftensordecompositions:
Kruskal andTucker
bothcanbesolvedwith``alternatingleast
squares(ALS)
Detailsfollow westartwithterminology:

Leskovec&Faloutsos,WWW2008

357

CMUSCS

Kruskals Decomposion intuition


IxJxK

IxR

JxR
B

++

RxRxR

Leskovec&Faloutsos,WWW2008

358

CMUSCS

TuckerDecomposition intuition
IxJxK

IxR

JxS

RxSxT

authorxkeywordxconference
A:authorxauthorgroup
B:keywordxkeywordgroup
C:conf.xconfgroup
G:howgroupsrelatetoeachother
Leskovec&Faloutsos,WWW2008

359

CMUSCS

2danalogofTuckerdecomposition
e.g., terms x documents

.5
.5
0
00
0

0
0
.5
.5
0
0

.5

.5
0
0
0
0

.3 0
l .36

k 0 .3
.2 .2 0

.05 .05 .05 0 0 0


.05 .05 .05 0 0 0
.05 .05
00 00 00 ..05

05 .05 .05
.04 .04 0 .04 .04 .04
.04 .04 .04 0 .04 .04
n
.36 .28 0
0

]=

.28 .36 .36

.054
.054
00
.036
.036

Leskovec&Faloutsos,WWW2008

.036

.036

.054 .042 0
0
0
.054 .042 0
0
0
0
0 .042 .054 .054
0
0 .042 .054 .054
.036 028 .028 .036
.036 .028 .028 .036

Part460

CMUSCS

med. doc
cs doc

.05 .05 .05 0 0 0


.05 .05 .05 0 0 0
.05 .05
00 00 00 ..05

05 .05 .05
.04 .04 0 .04 .04 .04
.04 .04 .04 0 .04 .04

term group x
doc. group

.5
.5
0
00
0

0
0
.5
.5
0
0

0
0
0
0
.5
.5

.03 .03
.2 .2

.36 .36 .28 0


0

]=

.28 .36 .36

doc x
doc group

med. terms
cs terms
common terms

.054
.054
00
.036
.036

.054
.054
0
0
.036
.036

.042
.042
0
0
028
.028

0
0
.042
.042
.028
.028

0
0
.054
.054
.036
.036

0
0
.054
.054
.036
.036

term x
term-group
Leskovec&Faloutsos,WWW2008

Part461

CMUSCS

Tensortools summary
Twomaintools
PARAFAC
Tucker

Bothfindrow,column,tubegroups
butinPARAFACthethreegroupsareidentical

Tosolve:AlternatingLeastSquares
Toolbox:fromTamaraKolda:
http://csmr.ca.sandia.gov/~tgkolda/TensorToolbox/
Leskovec&Faloutsos,WWW2008

362

CMUSCS

30

600

25

500

20

400

value

value

P1:Environmentalsensormonitoring
15

300
200

10

100

5
0
0

0
0

2000

4000 6000
time (min)

8000

2000

4000 6000
time (min)

8000

10000

2000

4000 6000
time (min)

8000

10000

10000

Light

Temperature

2.5
40

2
value

value

30

1.5

20

10

0.5

0
0

2000

4000 6000
time (min)

8000

0
0

10000

Voltage

Humidity
Leskovec&Faloutsos,WWW2008

463

CMUSCS

P1:sensormonitoring
0.3

0.03

0.25

0.4

0.02

0.2

0.2

0.01

0.15

0
0.01
0.02
0

500
1000
time (min)

0.1

0.2

0.05

0.4

0.6

type

0.6

value

value

Time

0.04

value

Scaling factor 250

Location

location

time

1st factor

20

40

60

0.8

Volt

location

1st factorconsistsofthemaintrends:

Humid Temp
type

voltage

Light

light

hum.

Dailyperiodicityontime
temp.
Uniformonalllocations
Temp,LightandVoltarepositivelycorrelatedwhile
negativelycorrelatedwithHumid
Leskovec&Faloutsos,WWW2008

464

CMUSCS

P1:sensormonitoring
factor

0.04

value

Scaling factor 154

time

location

0.03

0.6

0.02

0.4

0.01

0.2

0
0.2

0.01

0.4

0.02
0

type

0.8

value

2nd

500
1000
time (min)

2nd factorcapturesanatypicaltrend:
Uniformlyacrossalltime
Concentratingon3locations
Mainlyduetovoltage

0.6

voltage

Volt

Humid Temp
type

Light

light

hum.
temp.

Interpretation:twosensorshavelowbattery,and
theotheronehashighbattery.
Leskovec&Faloutsos,WWW2008

465

CMUSCS

P3:Socialnetworkanalysis
Multiwaylatentsemanticindexing(LSI)
Monitorthechangeofthecommunitystructure
overtime
Philip Yu
Michael
Stonebreaker

Pattern

Query

Leskovec&Faloutsos,WWW2008

466

CMUSCS

P3:Socialnetworkanalysis(cont.)
Authors

Keywords

Year

michael carey, michael


stonebreaker, h. jagadish,
hector garcia-molina

queri,parallel,optimization,concurr,
objectorient

1995

surajit chaudhuri,mitch
cherniack,michael
stonebreaker,ugur etintemel

distribut,systems,view,storage,servic,process,
cache

2004

jiawei han,jian pei,philip s. yu,


jianyong wang,charu c. aggarwal

streams,pattern,support, cluster,
index,gener,queri

2004

DB

DM

Two groups are correctly identified: Databases and Data


mining
People and concepts are drifting over time
Leskovec&Faloutsos,WWW2008

467

CMUSCS

P4:Networkanomalydetection
500

50

450
400

40

300

30

error

destination

350

250
200

20

150
100

10

50
100

200

300

400

source

Abnormal traffic

500

200

400

600
hours

800

1000

Reconstruction error
over time

1200

Normal traffic

Reconstructionerrorgivesindicationofanomalies.
Prominentdifferencebetweennormalandabnormalonesis
mainlyduetotheunusualscanningactivity(confirmedbythe
campusadmin).

Leskovec&Faloutsos,WWW2008

468

CMUSCS

P5:Webgraphmining
Howtoordertheimportanceofwebpages?
KleinbergsalgorithmHITS
PageRank
TensorextensiononHITS(TOPHITS)

Leskovec&Faloutsos,WWW2008

469

CMUSCS

KleinbergsHubsandAuthorities
(theHITSmethod)
Sparse adjacency matrix and its SVD:

authority scores
for 1st topic

authority scores
for 2nd topic

from

to

hub scores
for 1st topic
Kleinberg, JACM, 1999 Leskovec&Faloutsos,WWW2008

hub scores
for 2nd topic
470

CMUSCS

HITSAuthoritiesonSampleData
.97
.24
.08
.05
.02
.01
.01

1st Principal Factor


www.ibm.com
www.alphaworks.ibm.com
2nd Principal Factor
www-128.ibm.com
We started our crawl from
.99 www.lehigh.edu
www.developer.ibm.com
http://www-neos.mcs.anl.gov/neos,
.11 www2.lehigh.edu
3rd Principal Factor
www.research.ibm.com
and crawled 4700 pages,
.06 www.lehighalumni.com
www.redbooks.ibm.com
.75 java.sun.com
resulting in 560
.06 www.lehighsports.com
news.com.com
.38 www.sun.com
cross-linked hosts.
.02 www.bethlehem-pa.gov
.36 developers.sun.com 4th Principal Factor
.02 www.adobe.com
.24 see.sun.com
.60 www.pueblo.gsa.gov
.02 lewisweb.cc.lehigh.edu
.16 www.samag.com
.45 www.whitehouse.gov
.02 www.leo.lehigh.edu
.13 docs.sun.com .35 www.irs.gov
.02 www.distance.lehigh.edu
.12 blogs.sun.com .31 travel.state.gov 6th Principal Factor
.02 fp1.cc.lehigh.edu
.08 sunsolve.sun.com.22 www.gsa.gov.97 mathpost.asu.edu
.08 www.sun-catalogue.com
.20 www.ssa.gov.18 math.la.asu.edu
.08 news.com.com .16 www.census.gov
.17 www.asu.edu

authority scores
authority scores for 2nd topic
for 1st topic
from

to

hub scores
for 1st topic

hub scores
for 2nd topic

.04 www.act.org
.14 www.govbenefits.gov
.03 www.eas.asu.edu
.13 www.kids.gov
.13 www.usdoj.gov
.02 archives.math.utk.edu
.02 www.geom.uiuc.edu
.02 www.fulton.asu.edu
.02 www.amstat.org
.02 www.maa.org

Leskovec&Faloutsos,WWW2008

471

CMUSCS

ThreeDimensionalViewoftheWeb

Observe that this


tensor is very sparse!

Kolda, Bader,
Kenny, ICDM05
Leskovec&Faloutsos,WWW2008

472

CMUSCS

TopicalHITS(TOPHITS)
Main Idea: Extend the idea behind the HITS model to incorporate
term (i.e., topical) information.

term scores
for 1st topic

term scores
for 2nd topic

from

to

authority scores
for 1st topic
hub scores
for 1st topic
Leskovec&Faloutsos,WWW2008

authority scores
for 2nd topic
hub scores
for 2nd topic
473

CMUSCS

TopicalHITS(TOPHITS)
Main Idea: Extend the idea behind the HITS model to incorporate
term (i.e., topical) information.

term scores
for 1st topic

term scores
for 2nd topic

from

to

authority scores
for 1st topic
hub scores
for 1st topic
Leskovec&Faloutsos,WWW2008

authority scores
for 2nd topic
hub scores
for 2nd topic
474

CMUSCS

.23
.18
.17
.16
.16
.15
.15
.14
.12
.12

1st Principal Factor


JAVA
.86 java.sun.com
SUN
.38 developers.sun.com
2nd Principal Factor
PLATFORM
.16 docs.sun.com
TOPHITS uses 3D analysis to find
.20 NO-READABLE-TEXT .99 www.lehigh.edu
SOLARIS
.14 see.sun.com
the dominant groupings of web
.16 FACULTY
.063rd
www2.lehigh.edu
Principal Factor
DEVELOPER
.14 www.sun.com
.16 SEARCH
.03 www.lehighalumni.com
.15 NO-READABLE-TEXT
pages and terms.
EDITION
.09 www.samag.com .97 www.ibm.com
.16 NEWS.15 IBM
DOWNLOAD
.07 developer.sun.com.18 www.alphaworks.ibm.com
.16 LIBRARIES
Principal Factor
.12 SERVICES
www-128.ibm.com
INFO
.06 sunsolve.sun.com .07 4th
.16 COMPUTING
.26 INFORMATION
.87 www.pueblo.gsa.gov
.12 WEBSPHERE
.05 www.developer.ibm.com
SOFTWARE
.05 access1.sun.com
.12 LEHIGH
.24 FEDERAL .02 www.redbooks.ibm.com
.24 www.irs.gov
.12 WEB
NO-READABLE-TEXT
.05 iforce.sun.com
.23 CITIZEN .01 www.research.ibm.com
.23 6th
www.whitehouse.gov
.11 DEVELOPERWORKS
Principal Factor
wk = # unique links using term k
.11 LINUX .22 OTHER.26 PRESIDENT.19 travel.state.gov
.87 www.whitehouse.gov
.19 CENTER
.18 www.gsa.gov
.11 RESOURCES
.25 NO-READABLE-TEXT
.18 www.irs.gov
.19 LANGUAGES
.09 www.consumer.gov
.11 TECHNOLOGIES
.25 BUSH
.16 12th
travel.state.gov
Principal Factor
.15 U.S
.09 www.kids.gov
.10 DOWNLOADS
.25 WELCOME
.10
www.gsa.gov
.35 www.palisade.com
.15 PUBLICATIONS .75 OPTIMIZATION
.07 www.ssa.gov
.17 WHITE.58 SOFTWARE.08 www.ssa.gov
.35 www.solver.com
.14 CONSUMER
.05 www.forms.gov
.16 U.S
.05 www.govbenefits.gov
Principal Factor
.08 DECISION
.33 13th
plato.la.asu.edu
.13 FREE
.04 www.govbenefits.gov
.15 HOUSE
.04
www.census.gov
.99 www.adobe.com
.07 NEOS .46 ADOBE
.29 www.mat.univie.ac.at
.13 BUDGET
.04 www.usdoj.gov
.06 TREE .45 READER
.28 www.ilog.com
.13 PRESIDENTS
.04 www.kids.gov
16th Principal Factor
.05 GUIDE .45 ACROBAT .26 www.dashoptimization.com
.11 OFFICE
.02 www.forms.gov
.50 WEATHER
.30 FREE
.05 SEARCH
.26 www.grabitech.com.81 www.weather.gov
.24 OFFICE
.30 NO-READABLE-TEXT
.05 ENGINE
.25 www-fp.mcs.anl.gov.41 www.spc.noaa.gov
.30 lwf.ncdc.noaa.gov
.29 HERE .23 CENTER
.05 CONTROL
.22 www.spyderopts.com
Principal Factor
.19
NO-READABLE-TEXT
.15 19th
www.cpc.ncep.noaa.gov
.29
COPY
.05
ILOG
.17
www.mosek.com
term scores
term scores
nd
for 1st topic
.22 TAX
.73 www.irs.gov
for 2 topic
.05 DOWNLOAD
.17 ORGANIZATION
.14 www.nhc.noaa.gov
.43 travel.state.gov
.15 NWS .17 TAXES
.09 www.prh.noaa.gov
.15
CHILD
.22
www.ssa.gov
to
.15 SEVERE
.07 aviationweather.gov
.15
RETIREMENT
.08
www.govbenefits.gov
.15 FIRE
.06 www.nohrsc.nws.gov
authority scores
authority scores
.14
BENEFITS
.06
www.usdoj.gov
nd
.15 POLICY
.06 www.srh.noaa.gov
for 2 topic
for 1st topic
.14 STATE
.03 www.census.gov
.14 CLIMATE
hub scores
.14
INCOME
.03 www.usmint.gov
for 2nd topic
hub scores
.13
SERVICE
.02 www.nws.noaa.gov
for 1st topic
.13
REVENUE
.02
www.gsa.gov 475
Leskovec&Faloutsos,WWW2008
.12 CREDIT
.01 www.annualcreditreport.com

Tensor

from

TOPHITSTerms&Authorities
onSampleData

CMUSCS

Conclusions
Realdataareofteninhighdimensionswith
multipleaspects(modes)
Matricesandtensorsprovideeleganttheory
andalgorithms
Severalresearchproblemsarestillopen
skeweddistribution,anomalydetection,
streamingalgorithms,distributed/parallel
algorithms,efficientoutofcoreprocessing

Leskovec&Faloutsos,WWW2008

476

CMUSCS

References
SlidesborrowedfromSIGMOD07tutorialby
Falutsos,Kolda andSun.

Leskovec&Faloutsos,WWW2008

Part377

You might also like