You are on page 1of 53

Beyond Triangles: A Distributed Framework for

Estimating 3-profiles of Large Graphs


Ethan R. Elenberg, Karthikeyan Shanmugam,
Michael Borokhovich, Alexandros G. Dimakis
University of Texas, Austin, USA

August 12, 2015

E. R. Elenberg

Beyond Triangles

1/20

Introduction

Perform analytics on large graphs


- World Wide Web, social networks, bioinformatics
More descriptive than triangle count, clustering coefficient
Scalable, distributed algorithms

E. R. Elenberg

Beyond Triangles

2/20

3-profile

Count the induced subgraphs formed by selecting all triples of

vertices

H3

E. R. Elenberg

Beyond Triangles

3/20

3-profile

Count the induced subgraphs formed by selecting all triples of

vertices

H0

E. R. Elenberg

H1

H2

Beyond Triangles

H3

3/20

3-profile

Count the induced subgraphs formed by selecting all triples of

vertices

H0

H1

H2

H3

Definition
Let ni be the number of Hi s in a graph G. The vector
n(G) = [n0 , n1 , n2 , n3 ] is called the 3-profile of G.

- Always sums to |V3 | , the total number of 3-subgraphs

E. R. Elenberg

Beyond Triangles

3/20

Examples

4-clique: n(K4 ) = [0, 0, 0, 4]

H3

E. R. Elenberg

Beyond Triangles

4/20

Examples

4-clique: n(K4 ) = [0, 0, 0, 4]

H3

E. R. Elenberg

Beyond Triangles

4/20

Examples

4-clique: n(K4 ) = [0, 0, 0, 4]

H3

E. R. Elenberg

Beyond Triangles

4/20

Examples

4-clique: n(K4 ) = [0, 0, 0, 4]

H3

E. R. Elenberg

Beyond Triangles

4/20

Examples

5-cycle: n(C5 ) = [?, ?, ?, ?]

E. R. Elenberg

Beyond Triangles

5/20

Examples

5-cycle: n(C5 ) = [0, ?, ?, ?]

H0

E. R. Elenberg

Beyond Triangles

5/20

Examples

5-cycle: n(C5 ) = [0, 5, ?, ?]

H1

E. R. Elenberg

Beyond Triangles

5/20

Examples

5-cycle: n(C5 ) = [0, 5, 5, ?]

H2

E. R. Elenberg

Beyond Triangles

5/20

Examples

5-cycle: n(C5 ) = [0, 5, 5, 0]

H3

E. R. Elenberg

Beyond Triangles

5/20

Related Terms

For each v V :
Definition
The local 3-profile counts how many times v participates in each
Hi with 2 other vertices.

E. R. Elenberg

Beyond Triangles

6/20

Related Terms

For each v V :
Definition
The local 3-profile counts how many times v participates in each
Hi with 2 other vertices.
Definition
The ego 3-profile is the 3-profile of ego graph N (v).
- Graph induced by set of neighbors (v)

E. R. Elenberg

Beyond Triangles

6/20

Motivation

Global 3-profile concisely describes local connectivity


- Molecule classification
Local and ego 3-profiles are feature vectors for each vertex
- Spam detection
- Generative models

E. R. Elenberg

Beyond Triangles

7/20

Introduction

Problem: Compute (or approximate) 3-profile quantities for a

large graph

E. R. Elenberg

Beyond Triangles

8/20

Introduction

Problem: Compute (or approximate) 3-profile quantities for a

large graph

Approach: Edge sub-sampling and distributed implementation

E. R. Elenberg

Beyond Triangles

8/20

Contributions

Derive a 3-profile sparsifier with provable guarantees

Design distributed, graph engine algorithms to calculate local


and ego 3-profiles

Evaluate performance on real-world datasets

E. R. Elenberg

Beyond Triangles

9/20

Related Work

Well studied across several communities:


Graph sub-sampling

[Kim, Vu 00] [Tsourakakis, et al. 08 -11] [Ahmed, et al. 14]

Large-scale triangle counting

[Satish, et al. 14] [Shank 07] [Suri, Vassilvitskii 11]

Subgraph counting

[Alon, et al. 97] [Kloks, et al. 00] [Kowaluk, et al. 13]

Graphlets

[Przulj 07] [Shervashidze, et al. 09]

E. R. Elenberg

Beyond Triangles

10/20

Outline

Introduction

3-profile Sparsifier
Edge Sub-sampling Process
Concentration Bound

3-PROF Algorithm

Experiments

Conclusions

E. R. Elenberg

Beyond Triangles

10/20

Edge Sub-sampling Process

Sub-sample each edge in the graph independently with

probability p

Relate the original and sub-sampled graphs via a 1-step

Markov chain

E. R. Elenberg

Beyond Triangles

11/20

Edge Sub-sampling Process

Original

Sub-sampled
2.5

1.5

0.5

0.5

1.5
2

5
2.5

1.5

0.5

0.5

1.5

2.5
2.5

2.5

1.5

0.5

0.5

1.5

p3

E. R. Elenberg

Beyond Triangles

12/20

2.5

Edge Sub-sampling Process

Original

Sub-sampled
2.5

1.5

0.5

0.5

1.5
2

5
2.5

1.5

0.5

0.5

1.5

2.5
2.5

2.5

1.5

0.5

0.5

1.5

p3

E. R. Elenberg

Beyond Triangles

12/20

2.5

Edge Sub-sampling Process

Original

Sub-sampled
2.5

1.5

0.5

0.5
1

p2

1.5
2

5
2.5

1.5

0.5

0.5

1.5

2.5
2.5

2.5

1.5

0.5

0.5

1.5

p3

E. R. Elenberg

Beyond Triangles

12/20

2.5

Edge Sub-sampling Process

Original

Sub-sampled
2.5

1.5

0.5

0.5
1

p2

2 (1

5
2.5

p)

3p
2

1.5

0.5

0.5

1.5

2.5

1.5
2
2.5
2.5

1.5

0.5

0.5

1.5

p3

E. R. Elenberg

Beyond Triangles

12/20

2.5

Edge Sub-sampling Process

p)
2

p)

Sub-sampled

p) 3

(1

(1

(1

Original

2.5
2

1.5

2p (

p)

(1
3p

p)

0
0.5

p2

2 (1

5
2.5

1
0.5

p)

3p
2

1.5

0.5

0.5

1.5

2.5

1.5
2
2.5
2.5

1.5

0.5

0.5

1.5

p3

E. R. Elenberg

Beyond Triangles

12/20

2.5

Edge Sub-sampling Process

p)

(1

p)

Sub-sampled

p) 3

(1

(1

Original

2.5
2

1.5

2p (

p)

(1
3p

p)

0
0.5

p2

2 (1

5
2.5

1
0.5

p)

3p
2

1.5

0.5

0.5

1.5

2.5

1.5
2
2.5
2.5

1.5

0.5

0.5

1.5

p3

1
0

Estimator =
0
0

E. R. Elenberg

1p
p
0
0

(1 p)2
2p(1 p)
p2
0

(1 p)3
2
3p(1 p)

Sub-sampled
3p2 (1 p)
3
p

Beyond Triangles

12/20

2.5

Main Result
Theorem (3-profile sparsifiers)
For all (,p)-balanced graphs , thel -norm of the 3-profile
sparsifier error is bounded by  |V3 | with high probability.

E. R. Elenberg

Beyond Triangles

13/20

Main Result
Theorem (3-profile sparsifiers)
For all (,p)-balanced graphs , thel -norm of the 3-profile
sparsifier error is bounded by  |V3 | with high probability.
Definition
A graph is (,p)-balanced if the majority of triangles, wedges,
or single-edges do not depend on one common edge.

E. R. Elenberg

Beyond Triangles

13/20

Main Result
Theorem (3-profile sparsifiers)
For all (,p)-balanced graphs , thel -norm of the 3-profile
sparsifier error is bounded by  |V3 | with high probability.
Definition
A graph is (,p)-balanced if the majority of triangles, wedges,
or single-edges do not depend on one common edge.
Proof Sketch:
- Apply multivariate polynomial concentration inequalities [Kim,
Vu 00] to each estimator
f (G, p) = e1 e2 e4 + e4 e5 e6 + . . .

E. R. Elenberg

Beyond Triangles

13/20

Outline

Introduction

3-profile Sparsifier
Edge Sub-sampling Process
Concentration Bound

3-PROF Algorithm

Experiments

Conclusions

E. R. Elenberg

Beyond Triangles

13/20

3-PROF
Vertex program in the Gather-Apply-Scatter framework

E. R. Elenberg

Beyond Triangles

14/20

3-PROF
Vertex program in the Gather-Apply-Scatter framework
1

E. R. Elenberg

For each vertex v: Gather and Apply vertex IDs to store (v)

Beyond Triangles

14/20

3-PROF
Vertex program in the Gather-Apply-Scatter framework
1

For each vertex v: Gather and Apply vertex IDs to store (v)

For each edge va: Scatter

E. R. Elenberg

n3,va = |(v) (a)|,

nc2,va = |(v)| |(v) (a)| 1, . . .

Beyond Triangles

14/20

3-PROF
Vertex program in the Gather-Apply-Scatter framework
1

For each vertex v: Gather and Apply vertex IDs to store (v)

For each edge va: Scatter

n3,va = |(v) (a)|,

nc2,va = |(v)| |(v) (a)| 1, . . .

For each vertex v: Gather and Apply


1
2

n3,v =
nc2,v =
E. R. Elenberg

1
2

a(v) n3,va

c
a(v) n2,va ,

...

Beyond Triangles

a
14/20

Outline

Introduction

3-profile Sparsifier
Edge Sub-sampling Process
Concentration Bound

3-PROF Algorithm

Experiments

Conclusions

E. R. Elenberg

Beyond Triangles

14/20

Implementation
GraphLab PowerGraph v2.2
Multicore server
256 GB RAM, 72 logical cores
EC2 cluster (Amazon Web Services)
20 c3.8xlarge, 60 GB RAM, 32 logical cores each

E. R. Elenberg

Beyond Triangles

15/20

Implementation
GraphLab PowerGraph v2.2
Multicore server
256 GB RAM, 72 logical cores
EC2 cluster (Amazon Web Services)
20 c3.8xlarge, 60 GB RAM, 32 logical cores each
Datasets

Name
Twitter
PLD
LiveJournal
Wikipedia
DBLP

E. R. Elenberg

Vertices
41, 652, 230
39, 497, 204
4, 846, 609
3, 515, 067
317, 080

Edges (undirected)
1, 202, 513, 046
582, 567, 291
42, 851, 237
42, 375, 912
1, 049, 866

Beyond Triangles

15/20

Results: 3-profile Sparsifier Accuracy, 5 runs

1.015

PLD, Accuracy, 3-profiles


triangles
wedges

edge
empty

Accuracy [exact/approx]

1.010

1.005

1.000

0.995

0.990

0.985

E. R. Elenberg

p=0.7

p=0.4

p=0.1

Beyond Triangles

p=0.01

16/20

Results: Multicore, 3 runs


Compare 3-PROF to GraphLabs default triangle count
Twitter and PLD, Multicore (p=1)
3-prof

Trian

600

Running time [sec]

500

400

300

200

100

E. R. Elenberg

Twitter

PLD

Beyond Triangles

17/20

Results: AWS, 5 runs


Compare EGO-PAR to naive, serial algorithm (EGO-SER )
LiveJournal, AWS c3 8xlarge
Ego-ser 12 nodes

Ego-par 12 nodes

105

>10000 sec

Running time [sec]

104

>1000 sec

103

102

101

100

101

E. R. Elenberg

100 egos

1K egos

Beyond Triangles

10K egos

18/20

Results: AWS, 5 runs


LiveJournal, AWS c3 8xlarge
14

Ego-par 12 nodes

Ego-par 16 nodes

Ego-par 20 nodes

Running time [sec]

12

10

E. R. Elenberg

10k egos

Beyond Triangles

19/20

Outline

Introduction

3-profile Sparsifier
Edge Sub-sampling Process
Concentration Bound

3-PROF Algorithm

Experiments

Conclusions

E. R. Elenberg

Beyond Triangles

19/20

Summary

Edge sub-sampling produces fast, accurate 3-profile estimates

3-profile counting consumes roughly the same resources as


triangle counting

Distributed algorithms scale well over large data and large


computing clusters
github.com/eelenberg/3-profiles

E. R. Elenberg

Beyond Triangles

20/20

(Backup) Edge Pivot Equations

n3,va 
a(v)
2

E. R. Elenberg

+
F2 (v)

Beyond Triangles

3F3 (v)

20/20

(Backup) Edge Pivot Equations

n3,va 
a(v)
2

E. R. Elenberg

+
F2 (v)

Beyond Triangles

3F3 (v)

20/20

(Backup) Edge Pivot Equations

n3,va 
a(v)
2

E. R. Elenberg

+
F2 (v)

Beyond Triangles

3F3 (v)

20/20

(Backup) Results: 3-profile Sparsifier Accuracy, 5 runs


Twitter, Accuracy, 3-profiles

Accuracy [exact/approx]

1.004

triangles
wedges

edge
empty

1.002

1.000

0.998

0.996

p=0.7

E. R. Elenberg

p=0.5

p=0.3

Beyond Triangles

p=0.1

20/20

(Backup) Results: 3-PROF vs. TRIAN, AWS, 3 runs

LiveJournal, AWS c3 8xlarge

3-prof p=1
3-prof p=0.5

3-prof p=0.1
Trian p=1

PLD, AWS c3 8xlarge


Trian p=0.5
Trian p=0.1

120

3-prof p=1
3-prof p=0.5

3-prof p=0.1
Trian p=1

Trian p=0.5
Trian p=0.1

6
100

Running time [sec]

Running time [sec]

80

60

40

20

12 nodes

16 nodes

20 nodes

LiveJournal Running Time

E. R. Elenberg

12 nodes

16 nodes

20 nodes

PLD Running Time

Beyond Triangles

20/20

(Backup) Results: 3-PROF vs. TRIAN, AWS, 3 runs

1010
1.0

LiveJournal, AWS c3 8xlarge


3-prof p=1
3-prof p=0.5

3-prof p=0.1
Trian p=1

1.2

Trian p=0.5
Trian p=0.1

1011

PLD, AWS c3 8xlarge


3-prof p=1
3-prof p=0.5

3-prof p=0.1
Trian p=1

Trian p=0.5
Trian p=0.1

0.8

Network sent [bytes]

Network sent [bytes]

1.0

0.6

0.4

0.2

0.0

0.6

0.4

0.2

12 nodes

16 nodes

20 nodes

LiveJournal Network Usage

E. R. Elenberg

0.8

0.0

12 nodes

16 nodes

20 nodes

PLD Network Usage

Beyond Triangles

20/20

(Backup) Results: AWS, 5 runs

120

LiveJournal, AWS c3 8xlarge


Ego-ser 12 nodes

Ego-ser 16 nodes

LiveJournal, AWS c3 8xlarge


Ego-ser 20 nodes

Ego-par 12 nodes

Ego-par 16 nodes

Ego-par 20 nodes

12
100

Running time [sec]

Running time [sec]

10
80

60

40

20

100 egos

EGO-SER

E. R. Elenberg

100 egos

EGO-PAR

Beyond Triangles

20/20

You might also like