Professional Documents
Culture Documents
Bryan Perozzi
Rami Al-Rfou
Steven Skiena
ABSTRACT
We present DeepWalk, a novel approach for learning latent
representations of vertices in a network. These latent representations encode social relations in a continuous vector
space, which is easily exploited by statistical models. DeepWalk generalizes recent advancements in language modeling
and unsupervised feature learning (or deep learning) from
sequences of words to graphs.
DeepWalk uses local information obtained from truncated
random walks to learn latent representations by treating
walks as the equivalent of sentences. We demonstrate DeepWalks latent representations on several multi-label network
classification tasks for social networks such as BlogCatalog,
Flickr, and YouTube. Our results show that DeepWalk
outperforms challenging baselines which are allowed a global
view of the network, especially in the presence of missing
information. DeepWalks representations can provide F1
scores up to 10% higher than competing methods when labeled data is sparse. In some experiments, DeepWalks
representations are able to outperform all baseline methods
while using 60% less training data.
DeepWalk is also scalable. It is an online learning algorithm which builds useful incremental results, and is trivially
parallelizable. These qualities make it suitable for a broad
class of real world applications such as network classification,
and anomaly detection.
Keywords
social networks; deep learning; latent representations; learning with partial labels; network classification; online learning
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
KDD14, August 2427, 2014, New York, NY, USA.
Copyright is held by the owner/author(s). Publication rights licensed to ACM.
ACM 978-1-4503-2956-9/14/08 ...$15.00.
http://dx.doi.org/10.1145/2623330.2623732
222
2
118
8
221
1
115
5
0.6
7
14
117
7
6
111
1
220
0
331
1
9
114
4
223
3
16
15
119
9
18
1.2
19
2 8
12
29
1.0
17
33
13
32
3 42 8
27
24
30
113
3
110
0
22
23
333
3
334
4
5
1
11
21
1
116
6
20
31
10
0.8
1.4
332
2
330
0
229
9
227
7
1.6
224
4
228
8
26
1.8
1.0
226
6
25
0.5
0.0
0.5
1.0
1.5
2.0
2.5
225
5
Figure 1: Our proposed method learns a latent space representation of social interactions in Rd . The learned representation encodes community structure so it can be easily exploited by standard classification methods. Here, our method
is used on Zacharys Karate network [44] to generate a latent representation in R2 . Note the correspondence between
community structure in the input graph and the embedding.
Vertex colors represent a modularity-based clustering of the
input graph.
1.
INTRODUCTION
2.
PROBLEM DEFINITION
3.
FrequencymofmVert
exmOccurrenceminmShort mRandom mWalks
5
10
10
10
10
10
# of Words
k mofmVert ices
10
10
10
10
10
10
Vert exmvisit at ionmcount
10
10
10
10
10
10
10
10
10
3.3
10
10
10
10
10
10
Word mention count
10
10
3.1
Random Walks
3.2
Language Modeling
Algorithm 1 DeepWalk(G, w, d, , t)
Input: graph G(V, E)
window size w
embedding size d
walks per vertex
walk length t
Output: matrix of vertex representations R|V |d
1: Initialization: Sample from U |V |d
2: Build a binary Tree T from V
3: for i = 0 to do
4:
O = Shuffle(V )
5:
for each vi O do
6:
Wvi = RandomW alk(G, vi ,t)
7:
SkipGram(, Wvi , w)
8:
end for
9: end for
4.2.1
SkipGram
SkipGram is a language model that maximizes the cooccurrence probability among the words that appear within
a window, w, in a sentence. It approximates the conditional
probability in Equation 3 using an independence assumption
as the following
i+w
Y
Pr {viw , , vi+w } \ vi | (vi ) =
Pr(vj |(vi ))
j=iw
j6=i
4.
METHOD
4.1
Overview
4.2
Algorithm: DeepWalk
(4)
Algorithm 2 iterates over all possible collocations in random walk that appear within the window w (lines 1-2). For
each, we map each vertex vj to its current representation
vector (vj ) Rd (See Figure 3b). Given the representation of vj , we would like to maximize the probability of its
neighbors in the walk (line 3). We can learn such a posterior
distribution using several choices of classifiers. For example, modeling the previous problem using logistic regression
would result in a huge number of labels (that is equal to
|V |) which could be in millions or billions. Such models
require vast computational resources which could span a
whole cluster of computers [4]. To avoid this necessity and
speed up the training time, we instead use the Hierarchical
Softmax [30, 31] to approximate the probability distribution.
4.2.2
Hierarchical Softmax
Pr(uk | (vj )) =
Pr(bl | (vj ))
(5)
l=1
(6)
Figure 3: Overview of DeepWalk. We slide a window of length 2w + 1 over the random walk Wv4 , mapping the central
vertex v1 to its representation (v1 ). Hierarchical Softmax factors out Pr(v3 | (v1 )) and Pr(v5 | (v1 )) over sequences of
probability distributions corresponding to the paths starting at the root and ending at v3 and v5 . The representation is
updated to maximize the probability of v1 co-occurring with its context {v3 , v5 }.
4.3
Algorithm Variants
4.4.1
BlogCat alog
Flickr
-1
-2
-3
Streaming
One interesting variant of this method is a streaming approach, which could be implemented without knowledge of
the entire graph. In this variant small walks from the graph
are passed directly to the representation learning code, and
the model is updated directly. Some modifications to the
learning process will also be necessary. First, using a decaying learning rate may no longer be desirable as it assumes the
knowledge of the total corpus size. Instead, we can initialize
BlogCat alog
0.02
Flickr
0.01
0.00
0.01
0.02
# of Workers
Parallelizability
4.4
Optimization
4.2.3
# of Workers
(b) Performance
4.4.2
Non-random walks
Some graphs are created as a by-product of agents interacting with a sequence of elements (e.g. users navigation
of pages on a website). When a graph is created by such a
stream of non-random walks, we can use this process to feed
the modeling phase directly. Graphs sampled in this way will
not only capture information related to network structure,
but also to the frequency at which paths are traversed.
In our view, this variant also encompasses language modeling. Sentences can be viewed as purposed walks through
an appropriately designed language network, and language
models like SkipGram are designed to capture this behavior.
This approach can be combined with the streaming variant
(Section 4.4.1) to train features on a continually evolving
network without ever explicitly constructing the entire graph.
Maintaining representations with this technique could enable
web-scale classification without the hassles of dealing with a
web-scale graph.
Name
|V |
|E|
|Y|
Labels
BlogCatalog
10,312
333,983
39
Interests
Flickr
80,513
5,899,882
195
Groups
YouTube
1,138,499
2,990,443
47
Groups
5.
EXPERIMENTAL DESIGN
5.1
Datasets
5.2
Baseline Methods
http://bit.ly/deepwalk
6.
EXPERIMENTS
6.1
Multi-Label Classification
6.1.1
BlogCatalog
6.1.2
Flickr
6.1.3
YouTube
% Labeled Nodes
10%
20%
30%
40%
50%
60%
70%
80%
90%
Micro-F1(%)
DeepWalk
SpectralClustering
EdgeCluster
Modularity
wvRN
Majority
36.00
31.06
27.94
27.35
19.51
16.51
38.20
34.95
30.76
30.74
24.34
16.66
39.60
37.27
31.85
31.77
25.62
16.61
40.30
38.93
32.99
32.97
28.82
16.70
41.00
39.97
34.12
34.09
30.37
16.91
41.30
40.99
35.00
36.13
31.81
16.99
41.50
41.66
34.63
36.08
32.19
16.92
41.50
42.42
35.99
37.23
33.33
16.49
42.00
42.62
36.29
38.18
34.28
17.26
Macro-F1(%)
DeepWalk
SpectralClustering
EdgeCluster
Modularity
wvRN
Majority
21.30
19.14
16.16
17.36
6.25
2.52
23.80
23.57
19.16
20.00
10.13
2.55
25.30
25.97
20.48
20.80
11.64
2.52
26.30
27.46
22.00
21.85
14.24
2.58
27.30
28.31
23.00
22.65
15.86
2.58
27.60
29.46
23.64
23.41
17.18
2.63
27.90
30.13
23.82
23.89
17.98
2.61
28.20
31.38
24.61
24.20
18.86
2.48
28.90
31.78
24.92
24.97
19.57
2.62
% Labeled Nodes
Micro-F1(%)
Macro-F1(%)
1%
2%
3%
4%
5%
6%
7%
8%
9%
10%
DeepWalk
SpectralClustering
EdgeCluster
Modularity
wvRN
Majority
32.4
27.43
25.75
22.75
17.7
16.34
34.6
30.11
28.53
25.29
14.43
16.31
35.9
31.63
29.14
27.3
15.72
16.34
36.7
32.69
30.31
27.6
20.97
16.46
37.2
33.31
30.85
28.05
19.83
16.65
37.7
33.95
31.53
29.33
19.42
16.44
38.1
34.46
31.75
29.43
19.22
16.38
38.3
34.81
31.76
28.89
21.25
16.62
38.5
35.14
32.19
29.17
22.51
16.67
38.7
35.41
32.84
29.2
22.73
16.71
DeepWalk
SpectralClustering
EdgeCluster
Modularity
wvRN
Majority
14.0
13.84
10.52
10.21
1.53
0.45
17.3
17.49
14.10
13.37
2.46
0.44
19.6
19.44
15.91
15.24
2.91
0.45
21.1
20.75
16.72
15.11
3.47
0.46
22.1
21.60
18.01
16.14
4.95
0.47
22.9
22.36
18.54
16.64
5.56
0.44
23.6
23.01
19.54
17.02
5.82
0.45
24.1
23.36
20.18
17.1
6.59
0.47
24.6
23.82
20.78
17.14
8.00
0.47
25.0
24.05
20.85
17.12
7.26
0.47
% Labeled Nodes
Micro-F1(%)
Macro-F1(%)
1%
2%
3%
4%
5%
6%
7%
8%
9%
10%
DeepWalk
SpectralClustering
EdgeCluster
Modularity
wvRN
Majority
37.95
23.90
26.79
24.90
39.28
31.68
29.18
24.84
40.08
35.53
33.1
25.25
40.78
36.76
32.88
25.23
41.32
37.81
35.76
25.22
41.72
38.63
37.38
25.33
42.12
38.94
38.21
25.31
42.48
39.46
37.75
25.34
42.78
39.92
38.68
25.38
43.05
40.07
39.42
25.38
DeepWalk
SpectralClustering
EdgeCluster
Modularity
wvRN
Majority
29.22
19.48
13.15
6.12
31.83
25.01
15.78
5.86
33.06
28.15
19.66
6.21
33.90
29.17
20.9
6.1
34.35
29.82
23.31
6.07
34.66
30.65
25.43
6.19
34.96
30.75
27.08
6.17
35.22
31.23
26.48
6.16
35.42
31.45
28.33
6.18
35.67
31.54
28.89
6.19
0.40
0.38
0.35
0.32
0.30
0.30
0.05
0.09
2
(a1) Flickr, = 30
M i cr o F 1
0.40
0.38
0.36
Training
0.1
0.2
0.32
0.5
0.9
128
256
0.40
0.40
0.40
0.35
0.35
0.25
1
3
10
0.20
30
50
90
0.30
0.25
d
16
32
64
0.20
128
256
0.15
2
0.05
0.09
2
0.45
0.30
Training
0.01
0.02
0.25
0.45
(a3) BlogCatalog, = 30
16
32
64
2
0.15
2
0.25
30
50
90
0.42
0.34
0.30
1
3
10
M i cr o F 1
0.25
M i cr o F 1
Training
0.01
0.02
0.30
M i cr o F 1
M i cr o F 1
0.34
M i cr o F 1
0.35
M i cr o F 1
M i cr o F 1
0.36
0.35
0.35
0.30
0.25
Training
0.1
0.2
0.20
0.5
0.9
0.15
2
6.2
Parameter Sensitivity
In order to evaluate how changes to the parameterization of DeepWalk effect its performance on classification
tasks, we conducted experiments on two multi-label classifications tasks (Flickr, and BlogCatalog). In the interest
of brevity, we have fixed the window size and the walk length
to emphasize local structure (w = 10, t = 40). We then
vary the number of latent dimensions (d), the number of
walks started per vertex (), and the amount of training
data available (TR ) to determine their impact on the network
classification performance.
6.2.1
Effect of Dimensionality
6.2.2
7.
RELATED WORK
7.1
Relational Learning
Relational classification (or collective classification) methods [15, 25, 32] use links between data items as part of the
classification process. Exact inference in the collective classification problem is NP-hard, and solutions have focused on
the use of approximate inference algorithm which may not
be guaranteed to converge [37].
The most relevant relational classification algorithms to
our work incorporate community information by learning
clusters [33], by adding edges between nearby nodes [14], by
using PageRank [24], or by extending relational classification
to take additional features into account [43]. Our work takes
a substantially different approach. Instead of a new approximation inference algorithm, we propose a procedure which
learns representations of network structure which can then
be used by existing inference procedure (including iterative
ones).
A number of techniques for generating features from graphs
have also been proposed [13, 17, 3941]. In contrast to these
methods, we frame the feature creation procedure as a representation learning problem.
Graph Kernels [42] have been proposed as a way to use
relational data as part of the classification process, but are
quite slow unless approximated [20]. Our approach is complementary; instead of encoding the structure as part of a
kernel function, we learn a representation which allows them
to be used directly as features for any classification method.
7.2
8.
CONCLUSIONS
Acknowledgements
The authors thank the reviewers for their helpful comments. This
research was partially supported by NSF Grants DBI-1060572 and
IIS-1017181, and a Google Faculty Research Award.
9.
REFERENCES
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]