Professional Documents
Culture Documents
material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify
them to fit your own needs. If you make use of a significant portion of these slides in your own
lecture, please include this message, or a link to our web site: http://www.mmds.org
Analysis of Large
Graphs:
Link Analysis,
PageRank
Mining of Massive Datasets
Jure Leskovec, Anand Rajaraman, Jeff
Ullman Stanford University
http://www.mmds.org
Apps
Filtering
data
streams
SVM
Recomme
nder
systems
Communi
ty
Detection
Web
advertisin
g
Decision
Trees
Associati
on Rules
Spam
Detection
Queries
on
streams
Perceptro
n, kNN
Duplicate
document
detection
High
dim.
data
Graph
data
Infinite
data
Locality
sensitive
hashing
PageRan
k,
SimRank
Clusterin
g
Dimensio
nality
reduction
domain2
domain1
router
domain3
Internet
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Seven Bridges of
Knigsberg
[Euler, 1735]
Return to the starting point by
traveling each link of the graph once
andJ. only
J. Leskovec, A. Rajaraman,
Ullman:once.
Mining of Massive Datasets, http://www.mmds.org
Web as a Graph
Web
as a directed graph:
Nodes: Webpages
Edges: Hyperlinks
I teach a
class on
Networks.
CS224W:
Classes
are in the
Gates
building
Computer
Science
Departme
nt at
Stanford
Stanford
University
Web as a Graph
Web
as a directed graph:
Nodes: Webpages
Edges: Hyperlinks
I teach a
class on
Networks.
CS224W:
Classes
are in the
Gates
building
Computer
Science
Departme
nt at
Stanford
Stanford
University
10
Broad Question
How
First
Second
11
Web Search: 2
Challenges
2 challenges of web search:
(1) Web contains many sources of
information
Who to trust?
Trick: Trustworthy pages may point to each other!
(2)
12
There
is large diversity
in the web-graph
node connectivity.
Lets rank the pages by
the link structure!
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
13
14
PageRank:
The Flow
Formulation
Links as Votes
Idea:
Links as votes
Think
of in-links as votes:
Are
16
Example: PageRank
Scores
A
3.3
B
38.4
C
34.3
D
3.9
E
8.1
1.6
1.6
1.6
F
3.9
1.6
1.6
17
Simple Recursive
Formulation
Each
If
Page
rj = ri/3+rk/4
rj/3
rj/3 rj/3
18
vote from an
important page is
worth more
A page is important if
it is pointed to by
other important
pages
r
i
Define
a
rank
rj for
rj
page j
i j di
out-degree of node
y/2
y
a/2
a
y/2
m
a/2
Flow equations:
ry = ry /2 + ra /2
ra = ry /2 + rm
rm = ra /2
19
equations, 3 unknowns,
no constants
Flow equations:
ry = ry /2 + ra /2
ra = ry /2 + rm
No unique solution
rm = ra /2
All solutions equivalent modulo the scale factor
Additional
Solution:
Gaussian
20
PageRank: Matrix
Formulation
Stochastic
adjacency matrix
Rank
ri
The flow equations can be rj
d
i
j
i
written
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
21
Example
ri
rj
the flow equation:
Remember
i j di
Flow equation in the matrix form
Suppose page i links to 3 pages, including j
i
j
ri
rj
1/3
22
Eigenvector Formulation
The
So
NOTE:
x is an
eigenvector with
the corresponding
eigenvalue if:
We
23
y
y
a
m 0
m
0
1
0
r = Mr
ry = ry /2 + ra /2
ra = ry /2 + rm
rm = ra /2
y
0
a = 0 1
m
0 0
y
a
m
24
di
25
PageRank: How to
solve?
Power
Iteration:
Set /N
1:
2:
Goto 1
y
a
ry = ry /2 + ra /2
ra = ry /2 + rm
Example:
rm = ra /2
ry
1/3 1/3 5/12 9/24
6/15
ra = 1/3 3/6 1/3 11/24 6/15
rm
1/3 1/6 3/12 1/6
3/15
Iteration 0, 1, 2,
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
26
PageRank: How to
solve?
Power
Iteration:
Set /N
1:
2:
Goto 1
y
a
ry = ry /2 + ra /2
ra = ry /2 + rm
Example:
rm = ra /2
ry
1/3 1/3 5/12 9/24
6/15
ra = 1/3 3/6 1/3 11/24 6/15
rm
1/3 1/6 3/12 1/6
3/15
Iteration 0, 1, 2,
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
27
Details!
Power
iteration:
A method for finding dominant
eigenvector (the vector corresponding
to the largest eigenvalue)
Claim:
28
Details!
Claim:
29
Details!
Claim:
Thus:
Note if then the method wont converge
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
30
Random Walk
Interpretation
Imagine
i1
i2
i3
j
Ends up on some page linked from i j d out (i)
Process repeats indefinitely
Let:
31
The Stationary
Distribution
Where
i1
i2
i3
p (t 1) M p (t )
state
then is stationary distribution of a
random walk
Our
32
Existence and
Uniqueness
A
33
PageRank:
The Google
Formulation
PageRank: Three
Questions
rj
( t 1)
(t )
ri
i j di
Does
or
equivalently
r Mr
this converge?
Does
it converge to what we
want?
Are
results reasonable?
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
35
rj
( t 1)
(t )
ri
i j di
Example:
ra
rb
1 0 1 0
=
0 1 0 1
Iteration 0, 1, 2,
36
rj
( t 1)
(t )
ri
i j di
Example:
ra
rb
1 0 0 0
=
0 1 0 0
Iteration 0, 1, 2,
37
PageRank: Problems
2 problems:
(1) Some pages are
dead ends (have no out-links)
Dead end
trap
(2)
Spider traps:
(all out-links are within the group)
Random walked gets stuck in a trap
And eventually spider traps absorb all
importance
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
38
Iteration:
Set
y
a
And iterate
m is a spider trap
ry = ry /2 + ra /2
ra = ry /2
Example:
ry
1/3 2/6 3/12
ra = 1/3 1/6 2/12
rm
1/3 3/6 7/12
rm = ra /2 + rm
5/24
0
3/24 0
16/24
1
Iteration 0, 1, 2,
39
Solution: Teleports!
The
Surfer
y
trap
within a few time steps
a
y
a
m
40
Iteration:
Set
y
a
And iterate
ry = ry /2 + ra /2
ra = ry /2
Example:
ry
1/3 2/6 3/12
ra = 1/3 1/6 2/12
rm
1/3 1/6 1/12
rm = ra /2
5/24
3/24
2/24
0
0
0
Iteration 0, 1, 2,
41
Solution: Always
Teleport!
Teleports:
y
a
42
are a problem
43
Solution: Random
Teleports
Googles
PageRank
di out-degree
of node i
44
Google Matrix A:
[1/N]NxNN by N matrix
where all entries are 1/N
45
Random Teleports (
)
[1/N]
M
7/15
7/1
5
1/2 1/2 0
0.8 1/2 0 0
0 1/2 1
15
1/
15
1/
7/1
5
7/15
1/15
1/
15
y
a =
m
13/15
A
1/3
1/3
1/3
0.33
0.20
0.46
0.24
0.20
0.52
0.26
0.18
0.56
...
NxN
7/33
5/33
21/33
46
How do we actually
compute the
PageRank?
step is matrix-vector
multiplication
rnew = A rold
Easy
48
Matrix Formulation
Suppose there are N pages
Consider page i, with di out-links
We have Mji = 1/|di| when i j
49
Rearranging the
Equation
,
where
since
So
we get:
50
Sparse Matrix
Formulation
We
is a sparse matrix!
(with no dead-ends)
51
PageRank vector
Set:
repeat until convergence:
if in-degree
of
is 0
where:
Now re-insert the leaked PageRank:
If the graph has no dead-ends then the amount of leaked PageRank is 1-. But since we have dead-ends
the amount of leaked PageRank may be larger. We have to explicitly account for it by computing S.
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
52
1, 5, 7
13, 23
53
0
1
2
3
4
5
6
rnew
0
1
2
3
4
2
rold
1, 5, 6
17, 64, 113, 117
13, 23
0
1
2
3
4
5
6
54
Analysis
Assume
In
55
Block-based Update
Algorithm
rnew
0
1
2
3
src
0
1
2
degreedestination
4
2
2
rold
0, 1, 3, 5
0, 5
3, 4
0
1
2
3
4
5
4
5
56
to nested-loop join in
databases
Break rnew into k blocks that fit in memory
Scan M and rold once for each block
Total
cost:
we do better?
Hint: M is much bigger than r (approx 1020x), so we must avoid reading it k times per
iteration
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
57
Block-Stripe Update
Algorithm
r
new
src
degree destination
0
1
0
1
2
4
3
2
0, 1
0
1
2
3
0
2
4
2
1
2
3
2
5
5
4
4
5
rold
0
1
2
3
4
5
58
Block-Stripe Analysis
Break
M into stripes
59
generic popularity of a
page
Biased against topic-specific authorities
Solution: Topic-Specific PageRank (next)
Uses
to Link spam
60