ch05 Linkanalysis1

Note to other teachers and users of these slides: We would be delighted if you found this our
material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify
them to fit your own needs. If you make use of a significant portion of these slides in your own
lecture, please include this message, or a link to our web site: http://www.mmds.org
Analysis of Large
Graphs:
Link Analysis,
PageRank
Mining of Massive Datasets
Jure Leskovec, Anand Rajaraman, Jeff
Ullman Stanford University
http://www.mmds.org
New Topic: Graph Data!

Machin
e
learnin
g
Apps
Filtering
data
streams
SVM
Recomme
nder
systems
Communi
ty
Detection
Web
advertisin
g
Decision
Trees
Associati
on Rules
Spam
Detection
Queries
on
streams
Perceptro
n, kNN
Duplicate
document
detection
High
dim.
data
Graph
data
Infinite
data
Locality
sensitive
hashing
PageRan
k,
SimRank
Clusterin
g
Dimensio
nality
reduction
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Graph Data: Social

Networks
Facebook social graph

4-degrees of separation [Backstrom-Boldi-RosaJ. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Graph Data: Media

Networks
Connections between political blogs

Polarization of the network [Adamic-Glance, 2005]
Graph Data: Information

Nets
Citation networks and Maps of science

[Brner et al., 2012]
Graph Data: Communication Nets
domain2
domain1
router
domain3
Internet
Graph Data: Technological Networks
Seven Bridges of
Knigsberg
[Euler, 1735]
Return to the starting point by
traveling each link of the graph once
andJ. only
J. Leskovec, A. Rajaraman,
Ullman:once.
Mining of Massive Datasets, http://www.mmds.org
Web as a Graph
Web
as a directed graph:
Nodes: Webpages
Edges: Hyperlinks
I teach a
class on
Networks.
CS224W:
Classes
are in the
Gates
building
Computer
Science
Departme
nt at
Stanford
Stanford
University
Web as a Graph
Web
as a directed graph:
Nodes: Webpages
Edges: Hyperlinks
I teach a
class on
Networks.
CS224W:
Classes
are in the
Gates
building
Computer
Science
Departme
nt at
Stanford
Stanford
University
Web as a Directed Graph
10
Broad Question
How
to organize the Web?
First
try: Human curated

Web directories
Yahoo, DMOZ, LookSmart
Second
try: Web Search
Information Retrieval investigates:

Find relevant docs in a small
and trusted set
Newspaper articles, Patents, etc.
But: Web is huge, full of untrusted

documents, random things, web spam, etc.
11
Web Search: 2
Challenges
2 challenges of web search:
(1) Web contains many sources of
information
Who to trust?
Trick: Trustworthy pages may point to each other!
(2)
What is the best answer to query

newspaper?
No single right answer
Trick: Pages that actually know about newspapers
might all be pointing to many newspapers
12
Ranking Nodes on the

Graph
All
web pages are not equally

important
www.joe-schmoe.com vs.
www.stanford.edu
There
is large diversity
in the web-graph
node connectivity.
Lets rank the pages by
the link structure!
13
Link Analysis Algorithms

We
will cover the following Link

Analysis approaches for
computing importances
of nodes in a graph:
Page Rank
Topic-Specific (Personalized) Page Rank
Web Spam Detection Algorithms
14
PageRank:
The Flow
Formulation
Links as Votes
Idea:
Links as votes
Page is more important if it has

more links
In-coming links? Out-going links?
Think
of in-links as votes:
www.stanford.edu has 23,400 in-links

www.joe-schmoe.com has 1 in-link
Are
all in-links are equal?
Links from important pages count

more
Recursive question!
16
Example: PageRank
Scores
A
3.3
B
38.4
C
34.3
D
3.9
E
8.1
1.6
1.6
1.6
F
3.9
1.6
1.6
17
Simple Recursive
Formulation
Each
links vote is proportional to the

importance of its source page
If
page j with importance rj has n

out-links, each link gets rj / n votes
Page
js own importance is the sum

k
i
r /3 r /4
of the votes on its in-links
i
rj = ri/3+rk/4
rj/3
rj/3 rj/3
18
PageRank: The Flow

Model
A
vote from an
important page is
worth more
A page is important if
it is pointed to by
other important
pages
r
i
Define
a
rank
rj for
rj
page j
i j di
out-degree of node
The web in 1839
y/2
y
a/2
a
y/2
m
a/2
Flow equations:
ry = ry /2 + ra /2
ra = ry /2 + rm
rm = ra /2
19
Solving the Flow

Equations
equations, 3 unknowns,
no constants
Flow equations:
ry = ry /2 + ra /2
ra = ry /2 + rm
No unique solution
rm = ra /2
All solutions equivalent modulo the scale factor
Additional
constraint forces uniqueness:
Solution:
Gaussian
elimination method works for

small examples, but we need a better
method for large web-size graphs
We need a new formulation!
20
PageRank: Matrix
Formulation
Stochastic
adjacency matrix
Let page has out-links

If , then
else
is a column stochastic matrix

Columns sum to 1
Rank
vector : vector with an entry

per page
is the importance score of page
ri
The flow equations can be rj
d
i
j
i
written
21
Example
ri
rj
the flow equation:
Remember
i j di
Flow equation in the matrix form
Suppose page i links to 3 pages, including j
i
j
ri
rj
1/3
22
Eigenvector Formulation
The
flow equations can be written
So
the rank vector r is an eigenvector

of the stochastic web matrix M
In fact, its first or principal eigenvector,
with corresponding eigenvalue 1
NOTE:
x is an
eigenvector with
the corresponding
eigenvalue if:
Largest eigenvalue of M is 1 since M is

column stochastic (with non-negative entries)
We know r is unit length and each column of M
sums to one, so
We
can now efficiently solve for r!

The method is called Power iteration
23
Example: Flow Equations

&M
y
a
y
y
a
m 0
m
0
1
0
r = Mr
ry = ry /2 + ra /2
ra = ry /2 + rm
rm = ra /2
y
0
a = 0 1
m
0 0
y
a
m
24
Power Iteration Method

Given
a web graph with n nodes,

where the nodes are pages and
edges are hyperlinks
Power iteration: a simple iterative
(t )
r
( t 1)
scheme
rj
i
di
Suppose there are N web pages

d . out-degree of node i
Initialize: r(0) = [1/N,.,1/N]T
Iterate:
=LM
r(t)
|x|1 = 1iNr|x(t+1)
1 norm
i| is the
Can use any other(t+1)
(t)
Stop
when |r vector
rnorm,
|1 e.g.,
< Euclidean
i j
25
PageRank: How to
solve?
Power
Iteration:
Set /N
1:
2:
Goto 1
y
a
ry = ry /2 + ra /2
ra = ry /2 + rm
Example:
rm = ra /2
ry
1/3 1/3 5/12 9/24
6/15
ra = 1/3 3/6 1/3 11/24 6/15
rm
1/3 1/6 3/12 1/6
3/15
Iteration 0, 1, 2,
26
PageRank: How to
solve?
Power
Iteration:
Set /N
1:
2:
Goto 1
y
a
ry = ry /2 + ra /2
ra = ry /2 + rm
Example:
rm = ra /2
ry
1/3 1/3 5/12 9/24
6/15
ra = 1/3 3/6 1/3 11/24 6/15
rm
1/3 1/6 3/12 1/6
3/15
Iteration 0, 1, 2,
27
Why Power Iteration

works? (1)
Details!
Power
iteration:
A method for finding dominant
eigenvector (the vector corresponding
to the largest eigenvalue)
Claim:
Sequence approaches the dominant

eigenvector of
28
Why Power Iteration

works? (2)
Details!
Claim:
Sequence approaches the dominant

eigenvector of
Proof:
Assume M has n linearly independent eigenvectors,
with corresponding eigenvalues , where
Vectors form a basis and thus we can write:
Repeated multiplication on both sides

produces
29
Why Power Iteration

works? (3)
Details!
Claim:
Sequence approaches the

dominant eigenvector of
Proof (continued):
Repeated multiplication on both sides
produces
Since then fractions

and so as (for all ).
Thus:
Note if then the method wont converge
30
Random Walk
Interpretation
Imagine
i1
a random web surfer:
i2
i3
At any time , surfer is on some page

At time , the surfer follows an
j
ri
out-link from uniformly at randomr
j
Ends up on some page linked from i j d out (i)
Process repeats indefinitely
Let:
vector whose th coordinate is the

prob. that the surfer is at page at time
So, is a probability distribution over
pages
31
The Stationary
Distribution
Where
i1
i2
is the surfer at time t+1?
Follows a link uniformly at random

Suppose
i3
p (t 1) M p (t )
the random walk reaches a
state
then is stationary distribution of a
random walk
Our
original rank vector satisfies
So, is a stationary distribution for

the random walk
32
Existence and
Uniqueness
A
central result from the theory

of random walks (a.k.a. Markov
processes):
For graphs that satisfy certain
conditions,
the stationary distribution is
unique and eventually will be
reached no matter what the initial
probability distribution at time t = 0
33
PageRank:
The Google
Formulation
PageRank: Three
Questions
rj
( t 1)
(t )
ri
i j di
Does
or
equivalently
r Mr
this converge?
Does
it converge to what we
want?
Are
results reasonable?
35
Does this converge?
rj
( t 1)
(t )
ri
i j di
Example:
ra
rb
1 0 1 0
=
0 1 0 1
Iteration 0, 1, 2,
36
Does it converge to what we want?
rj
( t 1)
(t )
ri
i j di
Example:
ra
rb
1 0 0 0
=
0 1 0 0
Iteration 0, 1, 2,
37
PageRank: Problems
2 problems:
(1) Some pages are
dead ends (have no out-links)
Dead end
Random walk has nowhere to go to

Such pages cause importance to leak Sout
pider
trap
(2)
Spider traps:
(all out-links are within the group)
Random walked gets stuck in a trap
And eventually spider traps absorb all
importance
38
Problem: Spider Traps

Power
Iteration:
Set
y
a
And iterate
m is a spider trap
ry = ry /2 + ra /2
ra = ry /2
Example:
ry
1/3 2/6 3/12
ra = 1/3 1/6 2/12
rm
1/3 3/6 7/12
rm = ra /2 + rm
5/24
0
3/24 0
16/24
1
Iteration 0, 1, 2,
All the PageRank score gets trapped in

39
Solution: Teleports!
The
Google solution for spider

traps: At each time step, the
random surfer has two options
With prob. , follow a link at random
With prob. 1- , jump to some random page
Common values for are in the range 0.8
to 0.9
Surfer
will teleport out of spider
y
trap
within a few time steps
a
y
a
m
40
Problem: Dead Ends

Power
Iteration:
Set
y
a
And iterate
ry = ry /2 + ra /2
ra = ry /2
Example:
ry
1/3 2/6 3/12
ra = 1/3 1/6 2/12
rm
1/3 1/6 1/12
rm = ra /2
5/24
3/24
2/24
0
0
0
Iteration 0, 1, 2,
Here the PageRank leaks out since the matrix is not

41
Solution: Always
Teleport!
Teleports:
Follow random teleport

links with probability 1.0 from deadends
Adjust matrix accordingly
y
a
42
Why Teleports Solve the Problem?

Why are dead-ends and spider traps a
problem
and why do teleports solve the problem?
Spider-traps are not a problem, but with
traps PageRank scores are not what we want
Solution: Never get stuck in a spider trap by
teleporting out of it in a finite number of steps
Dead-ends
are a problem
The matrix is not column stochastic so our initial

assumptions are not met
Solution: Make matrix column stochastic by
always teleporting when there is nowhere else to
go
43
Solution: Random
Teleports
Googles
solution that does it all:

At each step, random surfer has two
options:
With probability , follow a link at random
With probability 1- , jump to some
random page
PageRank
di out-degree
of node i
equation [Brin-Page, 98]
This formulation assumes that has no dead ends. We can either

preprocess matrix to remove all dead ends or explicitly follow random
teleport links with probability 1.0 from dead-ends.
44
The Google Matrix

PageRank
The
We
equation [Brin-Page, 98]
Google Matrix A:
[1/N]NxNN by N matrix
where all entries are 1/N
have a recursive problem:

And the Power method still
works!
What is ?
In practice =0.8,0.9 (make 5 steps on

avg., jump)
45
Random Teleports (
)
[1/N]
M
7/15
7/1
5
1/2 1/2 0
0.8 1/2 0 0
0 1/2 1
15
1/
15
1/
7/1
5
7/15
1/15
1/
15
y
a =
m
1/3 1/3 1/3

+ 0.2 1/3 1/3 1/3
1/3 1/3 1/3
y 7/15 7/15 1/15

a 7/15 1/15 1/15
m 1/15 7/15 13/15
13/15
A
1/3
1/3
1/3
0.33
0.20
0.46
0.24
0.20
0.52
0.26
0.18
0.56
...
NxN
7/33
5/33
21/33
46
How do we actually
compute the
PageRank?
Computing Page Rank

Key
step is matrix-vector
multiplication
rnew = A rold
Easy
if we have enough main memory

to hold A, rold, rnew
Say N = 1 billion pages A = M + (1-) [1/N]N
0
1/3 1/3 1/3
We need 4 bytes for
A =0.8 0 0 +0.2 1/3 1/3 1/3
each entry (say)
0 1
1/3 1/3 1/3
2 billion entries for
vectors, approx 8GB
7/15 7/15 1/15
Matrix A has N2 entries
= 7/15 1/15 1/15
1018 is a large number!
1/15 7/15 13/15
48
Matrix Formulation
Suppose there are N pages
Consider page i, with di out-links
We have Mji = 1/|di| when i j
and Mji = 0 otherwise

The random teleport is equivalent
to:
Adding a teleport link from i to every other
page and setting transition probability to (1 )/N
Reducing the probability of following each
out-link from 1/|di| to /|di|
Equivalent: Tax each page a fraction (1- ) of
its score and redistribute evenly
49
Rearranging the
Equation
,
where
since
So
we get:
Note: Here we assumed M

has no dead-ends
[x]N a vector of length N with all entries x

50
Sparse Matrix
Formulation
We
just rearranged the PageRank

equation
where [(1-)/N]N is a vector with all N entries (1-)/N
is a sparse matrix!
(with no dead-ends)
10 links per node, approx 10N entries

So
in each iteration, we need to:
Compute rnew = M rold

Add a constant value (1-)/N to each entry in rnew
Note if M contains dead-ends then and
we also have to renormalize rnew so that it sums to
1
51
PageRank: The Complete Algorithm

Input:
Graph and parameter
Directed graph (can have spider traps

and dead ends)
Parameter
Output:
PageRank vector
Set:
repeat until convergence:
if in-degree
of
is 0
where:
Now re-insert the leaked PageRank:
If the graph has no dead-ends then the amount of leaked PageRank is 1-. But since we have dead-ends
the amount of leaked PageRank may be larger. We have to explicitly account for it by computing S.
52
Sparse Matrix Encoding

Encode
sparse matrix using only

nonzero entries
Space proportional roughly to number of
links
Say 10N, or 4*10*1 billion = 40GB
Still wont
source fit in memory, but will fit
degree destination nodes
node
on disk
0
1, 5, 7
17, 64, 113, 117, 245
13, 23
53
Basic Algorithm: Update

Step
Assume
enough RAM to fit rnew into memory
Store rold and matrix M on disk

1
step of power-iteration is:
Initialize all entries of rnew = (1-) / N

For each page i (of out-degree di):
Read into memory: i, di, dest1, , destd , rold(i)
For j = 1di
rnew(destj) += rold(i) / di
i
0
1
2
3
4
5
6
rnew
source degree destination
0
1
2
3
4
2
rold
1, 5, 6
17, 64, 113, 117
13, 23
0
1
2
3
4
5
6
54
Analysis
Assume
enough RAM to fit rnew

into memory
Store rold and matrix M on disk
In
each iteration, we have to:
Read rold and M

Write rnew back to disk
Cost per iteration of Power method:
= 2|r| + |M|
Question:
What if we could not even fit rnew in

55
Block-based Update
Algorithm
rnew
0
1
2
3
src
0
1
2
degreedestination
4
2
2
rold
0, 1, 3, 5
0, 5
3, 4
0
1
2
3
4
5
4
5
Break rnew into k blocks that fit in

memory
Scan M and rold once for each block
56
Analysis of Block Update

Similar
to nested-loop join in
databases
Break rnew into k blocks that fit in memory
Scan M and rold once for each block
Total
cost:
k scans of M and rold

Cost per iteration of Power method:
k(|M| + |r|) + |r| = k|M| + (k+1)|r|
Can
we do better?
Hint: M is much bigger than r (approx 1020x), so we must avoid reading it k times per
iteration
57
Block-Stripe Update
Algorithm
r
new
src
degree destination
0
1
0
1
2
4
3
2
0, 1
0
1
2
3
0
2
4
2
1
2
3
2
5
5
4
4
5
rold
0
1
2
3
4
5
Break M into stripes! Each stripe contains only

destination nodes in the corresponding block of rnew
58
Block-Stripe Analysis
Break
M into stripes
Each stripe contains only destination

nodes
in the corresponding block of rnew
Some
additional overhead per stripe
But it is usually worth it

Cost
per iteration of Power

method:
=|M|(1+) + (k+1)|r|
59
Some Problems with

Page Rank
Measures
generic popularity of a
page
Biased against topic-specific authorities
Solution: Topic-Specific PageRank (next)
Uses
a single measure of importance
Other models of importance

Solution: Hubs-and-Authorities
Susceptible
to Link spam
Artificial link topographies created in order to

boost page rank
Solution: TrustRank
60

ch05 Linkanalysis1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ch05 Linkanalysis1

Uploaded by

Copyright:

Available Formats

Note to other teachers and users of these slides: We would be delighted if you found this our

New Topic: Graph Data!

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Graph Data: Social

Facebook social graph

Graph Data: Media

Connections between political blogs

Graph Data: Information

Citation networks and Maps of science

Graph Data: Communication Nets

Graph Data: Technological Networks

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Web as a Directed Graph

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

to organize the Web?

try: Human curated

try: Web Search

Information Retrieval investigates:

But: Web is huge, full of untrusted

What is the best answer to query

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Ranking Nodes on the

web pages are not equally

Link Analysis Algorithms

will cover the following Link

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Page is more important if it has

www.stanford.edu has 23,400 in-links

all in-links are equal?

Links from important pages count

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

links vote is proportional to the

page j with importance rj has n

js own importance is the sum

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

PageRank: The Flow

The web in 1839

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Solving the Flow

constraint forces uniqueness:

elimination method works for

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Let page has out-links

is a column stochastic matrix

vector : vector with an entry

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

flow equations can be written

the rank vector r is an eigenvector

Largest eigenvalue of M is 1 since M is

can now efficiently solve for r!

Example: Flow Equations

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Power Iteration Method

a web graph with n nodes,

Suppose there are N web pages

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Why Power Iteration

Sequence approaches the dominant

Why Power Iteration

Sequence approaches the dominant

Repeated multiplication on both sides

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org