You are on page 1of 60

Note to other teachers and users of these slides: We would be delighted if you found this our

material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify
them to fit your own needs. If you make use of a significant portion of these slides in your own
lecture, please include this message, or a link to our web site: http://www.mmds.org

Analysis of Large
Graphs:
Link Analysis,
PageRank
Mining of Massive Datasets
Jure Leskovec, Anand Rajaraman, Jeff
Ullman Stanford University

http://www.mmds.org

New Topic: Graph Data!


Machin
e
learnin
g

Apps

Filtering
data
streams

SVM

Recomme
nder
systems

Communi
ty
Detection

Web
advertisin
g

Decision
Trees

Associati
on Rules

Spam
Detection

Queries
on
streams

Perceptro
n, kNN

Duplicate
document
detection

High
dim.
data

Graph
data

Infinite
data

Locality
sensitive
hashing

PageRan
k,
SimRank

Clusterin
g
Dimensio
nality
reduction

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Graph Data: Social


Networks

Facebook social graph


4-degrees of separation [Backstrom-Boldi-RosaJ. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Graph Data: Media


Networks

Connections between political blogs


Polarization of the network [Adamic-Glance, 2005]
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Graph Data: Information


Nets

Citation networks and Maps of science


[Brner et al., 2012]
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Graph Data: Communication Nets

domain2

domain1

router

domain3

Internet
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Graph Data: Technological Networks

Seven Bridges of
Knigsberg
[Euler, 1735]
Return to the starting point by
traveling each link of the graph once
andJ. only
J. Leskovec, A. Rajaraman,
Ullman:once.
Mining of Massive Datasets, http://www.mmds.org

Web as a Graph
Web

as a directed graph:

Nodes: Webpages
Edges: Hyperlinks
I teach a
class on
Networks.

CS224W:
Classes
are in the
Gates
building

Computer
Science
Departme
nt at
Stanford

Stanford
University

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Web as a Graph
Web

as a directed graph:

Nodes: Webpages
Edges: Hyperlinks
I teach a
class on
Networks.

CS224W:
Classes
are in the
Gates
building

Computer
Science
Departme
nt at
Stanford

Stanford
University

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Web as a Directed Graph

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

10

Broad Question
How

to organize the Web?

First

try: Human curated


Web directories
Yahoo, DMOZ, LookSmart

Second

try: Web Search

Information Retrieval investigates:


Find relevant docs in a small
and trusted set
Newspaper articles, Patents, etc.

But: Web is huge, full of untrusted


documents, random things, web spam, etc.
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

11

Web Search: 2
Challenges
2 challenges of web search:
(1) Web contains many sources of
information
Who to trust?
Trick: Trustworthy pages may point to each other!
(2)

What is the best answer to query


newspaper?
No single right answer
Trick: Pages that actually know about newspapers
might all be pointing to many newspapers

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

12

Ranking Nodes on the


Graph
All

web pages are not equally


important
www.joe-schmoe.com vs.
www.stanford.edu

There

is large diversity
in the web-graph
node connectivity.
Lets rank the pages by
the link structure!
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

13

Link Analysis Algorithms


We

will cover the following Link


Analysis approaches for
computing importances
of nodes in a graph:
Page Rank
Topic-Specific (Personalized) Page Rank
Web Spam Detection Algorithms

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

14

PageRank:
The Flow
Formulation

Links as Votes
Idea:

Links as votes

Page is more important if it has


more links
In-coming links? Out-going links?

Think

of in-links as votes:

www.stanford.edu has 23,400 in-links


www.joe-schmoe.com has 1 in-link

Are

all in-links are equal?

Links from important pages count


more
Recursive question!
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

16

Example: PageRank
Scores
A
3.3

B
38.4

C
34.3

D
3.9

E
8.1

1.6

1.6

1.6

F
3.9

1.6

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

1.6
17

Simple Recursive
Formulation
Each

links vote is proportional to the


importance of its source page

If

page j with importance rj has n


out-links, each link gets rj / n votes

Page

js own importance is the sum


k
i
r /3 r /4
of the votes on its in-links
i

rj = ri/3+rk/4

rj/3

rj/3 rj/3

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

18

PageRank: The Flow


Model
A

vote from an
important page is
worth more
A page is important if
it is pointed to by
other important
pages
r
i
Define
a
rank
rj for
rj
page j
i j di

out-degree of node

The web in 1839

y/2
y
a/2
a

y/2
m
a/2

Flow equations:

ry = ry /2 + ra /2
ra = ry /2 + rm
rm = ra /2

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

19

Solving the Flow


Equations

equations, 3 unknowns,
no constants

Flow equations:

ry = ry /2 + ra /2
ra = ry /2 + rm

No unique solution
rm = ra /2
All solutions equivalent modulo the scale factor
Additional

constraint forces uniqueness:

Solution:
Gaussian

elimination method works for


small examples, but we need a better
method for large web-size graphs
We need a new formulation!

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

20

PageRank: Matrix
Formulation
Stochastic

adjacency matrix

Let page has out-links


If , then
else

is a column stochastic matrix


Columns sum to 1

Rank

vector : vector with an entry


per page
is the importance score of page

ri
The flow equations can be rj
d
i

j
i
written
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

21

Example
ri

rj
the flow equation:
Remember

i j di
Flow equation in the matrix form
Suppose page i links to 3 pages, including j

i
j

ri

rj

1/3

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

22

Eigenvector Formulation

The

flow equations can be written

So

the rank vector r is an eigenvector


of the stochastic web matrix M
In fact, its first or principal eigenvector,
with corresponding eigenvalue 1

NOTE:
x is an

eigenvector with
the corresponding
eigenvalue if:

Largest eigenvalue of M is 1 since M is


column stochastic (with non-negative entries)
We know r is unit length and each column of M
sums to one, so

We

can now efficiently solve for r!


The method is called Power iteration
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

23

Example: Flow Equations


&M
y
a

y
y
a
m 0

m
0
1
0

r = Mr
ry = ry /2 + ra /2
ra = ry /2 + rm
rm = ra /2

y
0
a = 0 1
m
0 0

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

y
a
m

24

Power Iteration Method


Given

a web graph with n nodes,


where the nodes are pages and
edges are hyperlinks
Power iteration: a simple iterative
(t )
r
( t 1)
scheme
rj
i

di

Suppose there are N web pages


d . out-degree of node i
Initialize: r(0) = [1/N,.,1/N]T
Iterate:
=LM
r(t)
|x|1 = 1iNr|x(t+1)
1 norm
i| is the
Can use any other(t+1)
(t)
Stop
when |r vector
rnorm,
|1 e.g.,
< Euclidean
i j

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

25

PageRank: How to
solve?
Power

Iteration:

Set /N
1:
2:
Goto 1

y
a

ry = ry /2 + ra /2
ra = ry /2 + rm

Example:

rm = ra /2

ry
1/3 1/3 5/12 9/24
6/15
ra = 1/3 3/6 1/3 11/24 6/15
rm
1/3 1/6 3/12 1/6
3/15
Iteration 0, 1, 2,
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

26

PageRank: How to
solve?
Power

Iteration:

Set /N
1:
2:
Goto 1

y
a

ry = ry /2 + ra /2
ra = ry /2 + rm

Example:

rm = ra /2

ry
1/3 1/3 5/12 9/24
6/15
ra = 1/3 3/6 1/3 11/24 6/15
rm
1/3 1/6 3/12 1/6
3/15
Iteration 0, 1, 2,
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

27

Why Power Iteration


works? (1)

Details!

Power

iteration:
A method for finding dominant
eigenvector (the vector corresponding
to the largest eigenvalue)

Claim:

Sequence approaches the dominant


eigenvector of
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

28

Why Power Iteration


works? (2)

Details!

Claim:

Sequence approaches the dominant


eigenvector of
Proof:
Assume M has n linearly independent eigenvectors,
with corresponding eigenvalues , where
Vectors form a basis and thus we can write:

Repeated multiplication on both sides


produces

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

29

Why Power Iteration


works? (3)

Details!

Claim:

Sequence approaches the


dominant eigenvector of
Proof (continued):
Repeated multiplication on both sides
produces

Since then fractions


and so as (for all ).

Thus:
Note if then the method wont converge
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

30

Random Walk
Interpretation
Imagine

i1

a random web surfer:

i2

i3

At any time , surfer is on some page


At time , the surfer follows an
j
ri
out-link from uniformly at randomr

j
Ends up on some page linked from i j d out (i)
Process repeats indefinitely
Let:

vector whose th coordinate is the


prob. that the surfer is at page at time
So, is a probability distribution over
pages
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

31

The Stationary
Distribution
Where

i1

i2

is the surfer at time t+1?

Follows a link uniformly at random


Suppose

i3

p (t 1) M p (t )

the random walk reaches a

state
then is stationary distribution of a
random walk
Our

original rank vector satisfies

So, is a stationary distribution for


the random walk
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

32

Existence and
Uniqueness
A

central result from the theory


of random walks (a.k.a. Markov
processes):
For graphs that satisfy certain
conditions,
the stationary distribution is
unique and eventually will be
reached no matter what the initial
probability distribution at time t = 0
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

33

PageRank:
The Google
Formulation

PageRank: Three
Questions
rj

( t 1)

(t )

ri

i j di

Does

or
equivalently

r Mr

this converge?

Does

it converge to what we
want?

Are

results reasonable?
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

35

Does this converge?

rj

( t 1)

(t )

ri

i j di

Example:

ra
rb

1 0 1 0
=
0 1 0 1

Iteration 0, 1, 2,

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

36

Does it converge to what we want?

rj

( t 1)

(t )

ri

i j di

Example:

ra
rb

1 0 0 0
=
0 1 0 0
Iteration 0, 1, 2,

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

37

PageRank: Problems
2 problems:
(1) Some pages are
dead ends (have no out-links)

Dead end

Random walk has nowhere to go to


Such pages cause importance to leak Sout
pider

trap

(2)

Spider traps:
(all out-links are within the group)
Random walked gets stuck in a trap
And eventually spider traps absorb all
importance
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

38

Problem: Spider Traps


Power

Iteration:

Set

y
a

And iterate

m is a spider trap

ry = ry /2 + ra /2
ra = ry /2

Example:
ry
1/3 2/6 3/12
ra = 1/3 1/6 2/12
rm
1/3 3/6 7/12

rm = ra /2 + rm

5/24
0
3/24 0
16/24
1

Iteration 0, 1, 2,

All the PageRank score gets trapped in


J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

39

Solution: Teleports!
The

Google solution for spider


traps: At each time step, the
random surfer has two options
With prob. , follow a link at random
With prob. 1- , jump to some random page
Common values for are in the range 0.8
to 0.9

Surfer

will teleport out of spider

y
trap
within a few time steps
a

y
a

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

m
40

Problem: Dead Ends


Power

Iteration:

Set

y
a

And iterate

ry = ry /2 + ra /2
ra = ry /2

Example:
ry
1/3 2/6 3/12
ra = 1/3 1/6 2/12
rm
1/3 1/6 1/12

rm = ra /2

5/24
3/24
2/24

0
0
0

Iteration 0, 1, 2,

Here the PageRank leaks out since the matrix is not


J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

41

Solution: Always
Teleport!
Teleports:

Follow random teleport


links with probability 1.0 from deadends
Adjust matrix accordingly

y
a

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

42

Why Teleports Solve the Problem?


Why are dead-ends and spider traps a
problem
and why do teleports solve the problem?
Spider-traps are not a problem, but with
traps PageRank scores are not what we want
Solution: Never get stuck in a spider trap by
teleporting out of it in a finite number of steps
Dead-ends

are a problem

The matrix is not column stochastic so our initial


assumptions are not met
Solution: Make matrix column stochastic by
always teleporting when there is nowhere else to
go
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

43

Solution: Random
Teleports
Googles

solution that does it all:


At each step, random surfer has two
options:
With probability , follow a link at random
With probability 1- , jump to some
random page

PageRank

di out-degree
of node i

equation [Brin-Page, 98]

This formulation assumes that has no dead ends. We can either


preprocess matrix to remove all dead ends or explicitly follow random
teleport links with probability 1.0 from dead-ends.
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

44

The Google Matrix


PageRank
The
We

equation [Brin-Page, 98]

Google Matrix A:
[1/N]NxNN by N matrix
where all entries are 1/N

have a recursive problem:


And the Power method still
works!
What is ?

In practice =0.8,0.9 (make 5 steps on


avg., jump)
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

45

Random Teleports (
)
[1/N]
M

7/15

7/1
5

1/2 1/2 0
0.8 1/2 0 0
0 1/2 1

15
1/

15
1/

7/1
5

7/15

1/15

1/
15

y
a =
m

1/3 1/3 1/3


+ 0.2 1/3 1/3 1/3
1/3 1/3 1/3

y 7/15 7/15 1/15


a 7/15 1/15 1/15
m 1/15 7/15 13/15

13/15

A
1/3
1/3
1/3

0.33
0.20
0.46

0.24
0.20
0.52

0.26
0.18
0.56

...

NxN

7/33
5/33
21/33

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

46

How do we actually
compute the
PageRank?

Computing Page Rank


Key

step is matrix-vector
multiplication
rnew = A rold

Easy

if we have enough main memory


to hold A, rold, rnew
Say N = 1 billion pages A = M + (1-) [1/N]N
0
1/3 1/3 1/3
We need 4 bytes for
A =0.8 0 0 +0.2 1/3 1/3 1/3
each entry (say)
0 1
1/3 1/3 1/3
2 billion entries for
vectors, approx 8GB
7/15 7/15 1/15
Matrix A has N2 entries
= 7/15 1/15 1/15
1018 is a large number!
1/15 7/15 13/15
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

48

Matrix Formulation
Suppose there are N pages
Consider page i, with di out-links
We have Mji = 1/|di| when i j

and Mji = 0 otherwise


The random teleport is equivalent
to:
Adding a teleport link from i to every other
page and setting transition probability to (1 )/N
Reducing the probability of following each
out-link from 1/|di| to /|di|
Equivalent: Tax each page a fraction (1- ) of
its score and redistribute evenly
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

49

Rearranging the
Equation
,

where

since
So

we get:

Note: Here we assumed M


has no dead-ends

[x]N a vector of length N with all entries x


J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

50

Sparse Matrix
Formulation

We

just rearranged the PageRank


equation
where [(1-)/N]N is a vector with all N entries (1-)/N

is a sparse matrix!

(with no dead-ends)

10 links per node, approx 10N entries


So

in each iteration, we need to:

Compute rnew = M rold


Add a constant value (1-)/N to each entry in rnew
Note if M contains dead-ends then and
we also have to renormalize rnew so that it sums to
1
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

51

PageRank: The Complete Algorithm


Input:

Graph and parameter

Directed graph (can have spider traps


and dead ends)
Parameter
Output:

PageRank vector

Set:
repeat until convergence:

if in-degree
of
is 0

where:
Now re-insert the leaked PageRank:
If the graph has no dead-ends then the amount of leaked PageRank is 1-. But since we have dead-ends
the amount of leaked PageRank may be larger. We have to explicitly account for it by computing S.
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

52

Sparse Matrix Encoding


Encode

sparse matrix using only


nonzero entries
Space proportional roughly to number of
links
Say 10N, or 4*10*1 billion = 40GB
Still wont
source fit in memory, but will fit
degree destination nodes
node
on disk
0

1, 5, 7

17, 64, 113, 117, 245

13, 23

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

53

Basic Algorithm: Update


Step
Assume

enough RAM to fit rnew into memory

Store rold and matrix M on disk


1

step of power-iteration is:

Initialize all entries of rnew = (1-) / N


For each page i (of out-degree di):
Read into memory: i, di, dest1, , destd , rold(i)
For j = 1di
rnew(destj) += rold(i) / di
i

0
1
2
3
4
5
6

rnew

source degree destination

0
1
2

3
4
2

rold

1, 5, 6
17, 64, 113, 117
13, 23

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

0
1
2
3
4
5
6

54

Analysis
Assume

enough RAM to fit rnew


into memory
Store rold and matrix M on disk

In

each iteration, we have to:

Read rold and M


Write rnew back to disk
Cost per iteration of Power method:
= 2|r| + |M|
Question:

What if we could not even fit rnew in


J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

55

Block-based Update
Algorithm
rnew
0
1
2
3

src

0
1
2

degreedestination

4
2
2

rold

0, 1, 3, 5
0, 5
3, 4

0
1
2
3
4
5

4
5

Break rnew into k blocks that fit in


memory
Scan M and rold once for each block
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

56

Analysis of Block Update


Similar

to nested-loop join in
databases
Break rnew into k blocks that fit in memory
Scan M and rold once for each block

Total

cost:

k scans of M and rold


Cost per iteration of Power method:
k(|M| + |r|) + |r| = k|M| + (k+1)|r|
Can

we do better?

Hint: M is much bigger than r (approx 1020x), so we must avoid reading it k times per
iteration
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

57

Block-Stripe Update
Algorithm
r

new

src

degree destination

0
1

0
1
2

4
3
2

0, 1
0
1

2
3

0
2

4
2

1
2

3
2

5
5
4

4
5

rold

0
1
2
3
4
5

Break M into stripes! Each stripe contains only


destination nodes in the corresponding block of rnew
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

58

Block-Stripe Analysis
Break

M into stripes

Each stripe contains only destination


nodes
in the corresponding block of rnew
Some

additional overhead per stripe

But it is usually worth it


Cost

per iteration of Power


method:
=|M|(1+) + (k+1)|r|

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

59

Some Problems with


Page Rank
Measures

generic popularity of a

page
Biased against topic-specific authorities
Solution: Topic-Specific PageRank (next)
Uses

a single measure of importance

Other models of importance


Solution: Hubs-and-Authorities
Susceptible

to Link spam

Artificial link topographies created in order to


boost page rank
Solution: TrustRank
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

60

You might also like