On Sampling from Massive Graph Streams

Joint work with:
Nick Duffield -‐ Texas A&M University
Ted Willke – Intel Labs
Ryan Rossi – PARC research
VLDB’17, Germany
August 31st, 2017
Nesreen K. Ahmed
Research Scientist, Intel Labs

-‐
-‐
-‐
-‐
-‐
Social network
Human Disease Network
[Barabasi 2007]
Food Web [2007]
Terrorist Network
[Krebs 2002]Internet (AS) [2005]
Gene Regulatory Network
[Decourty 2008]
Protein Interactions
[breast cancer]
Political blogs
Power grid

Social Network Internet (AS)
BiologicalPolitical Blogs
Graph
Mining

Studying and analyzing complex networks
is a challenging and computationally intensive task
Ø Today’s networks are dynamic/streaming over time
-‐ e.g., Twitter streams, email communications
Ø Today’s networks are massive in size
-‐ e.g., online social networks have billions of users

Due to these challenges, we usually need to sampleDue to these challenges, we usually need to sample
Statistical
Sampling
Graph G Sample S
e.g. Uniform Random
Sampling

Given a large graph G represented as a stream of edges e1, e2, e3…
We show how to efficiently sample from G while limiting memory
space to calculate unbiased estimates of various graph properties

No. TrianglesNo. Wedges
Frequent connected subsets of edges

Frequent connected subsets of edges
Transitivity
No. TrianglesNo. Wedges

§ Random Sampling
• Uniform random sampling – [Tsourakakis et. al KDD’09]
— Graph Sparsification with probability p
— Chance of sampling a subgraph (e.g., triangle) is very low
— Estimates suffer from high variance
• Wedge Sampling – [Seshadhri et. al SDM’13]
— Sample vertices, then sample pairs of incident edges (wedges)
— Output the estimate of the closed wedges (triangles)

§ Random Sampling
• Uniform random sampling – [Tsourakakis et. al KDD’09]
— Graph Sparsification with probability p
— Chance of sampling a subgraph (e.g., triangle) is very low
— Estimates suffer from high variance
• Wedge Sampling – [Seshadhri et. al SDM’13]
— Sample vertices, then sample pairs of incident edges (wedges)
— Output the estimate of the closed wedges (triangles)
Assume we’ve access to the full graph
Not a good fit for massive streaming graphs

§ Assume specific order of the stream – [Buriol et. al 2006]
• Incidence stream model– vertex neighbors arrive together in the stream
§ Use multiple passes over the stream – [Becchetti et. al KDD’08]
§ Single-‐pass algorithms for arbitrary-‐ordered graph streams

§ Single-‐pass algorithms for arbitrary-‐ordered graph streams
• Streaming-‐Triangles – [Jha et. al KDD’13]
— Sample edges using reservoir sampling, then sample pairs of incident
edges (wedges), and finally scan for closed wedges (triangles)
• Neighborhood Sampling – [Pavan et. al VLDB’13]
— Sampling vectors of wedge estimators, scan the stream for closed wedges
(triangles)
• TRIEST– [De Stefani et. al KDD’16]
— Uses standard reservoir sampling to maintain the edge sample
• MASCOT– [Lim et. al KDD’15]
— Independent edge sampling with probability p
• Graph Sample & Hold– [Ahmed et. al KDD’14]
— Conditionally independent edge sampling

Summary of previous work
Sampling designs for specific graph properties (triangles)
Not generally applicable to other properties
Uniform-‐based Sampling
Obtain variable-‐size sample
We propose a generic unbiased sampling framework: Graph Priority Sampling
• Weight-‐sensitive
• Fixed-‐size sample
• Single-‐pass
• Applicable for general graph properties
• Use topological information that we wish to estimate as auxiliary variables
• Variance-‐optimal sampling (cost optimization approach)

Input
Graph Priority Sampling Framework
GPS(m)
Output
Edge stream
k1, k2, ..., k, ...
Sampled Edge stream ˆK
Stored State m = O(| ˆK|)

Input
GPS(m)
Output
For each edge k
Generate a random number
u(k) ⇠ Uni(0, 1]
Edge stream
k1, k2, ..., k, ...
Compute edge weight
w(k) = W(k, ˆK)
Compute edge priority
r(k) = w(k)/u(k)
ˆK = ˆK [ {k}

Input
GPS(m)
Output
For each edge k
Edge stream
k1, k2, ..., k, ...
Find edge with lowest priority
k⇤
= arg mink02 ˆK r(k0
)
Update sample threshold
z⇤
= max{z⇤
, r(k⇤
)}
Remove lowest priority edge
ˆK = ˆK{k⇤
}
Use a priority queue with O(log m) updates

§ We use edge weights to express the role of the arriving
edge in the sampled graph
• e.g., no. subgraphs completed by the arriving edge, and/or other
auxiliary variables
§ Computational feasibility
• Efficient implementation by using a priority queue
• Implemented as a Min-‐heap with O(log m) insertion/deletion
• O(1) access to the edge with minimum priority
w(k) = W(k, ˆK)

For each edge i,
we construct a sequence of edge estimators ˆSi,t
We achieve unbiasedness by
establishing that the sequence is a Martingale (Theorem 1)
E[ ˆSi,t] = Si,t
ˆSi,t = I(i 2 ˆKt)/min{1, wi/z⇤
}
where ˆSi,t are unbiased estimators of the corresponding edge
ˆKt is the sample at time t
Edge Estimation

For each subgraph J ⇢ [t],
we deﬁne the sequence of subgraph estimators as
ˆSJ,t =
Q
i2J
ˆSi,t
E[ ˆSJ,t] = SJ,t
We prove the sequence is a Martingale (Theorem 2)
Subgraph Estimation

Subgraph Counting
For any set J of subgraphs of G,
ˆNt(J ) =
P
J2J :J⇢Kt
ˆSJ,t
is an unbiased estimator of Nt(J ) = |Jt|
(Theorem 2)

§ We provide a cost minimization approach
• inspired by IPPS sampling [Duffield et. al 2005]
§ By minimizing the conditional variance of the increment
incurred by the arriving edge in
How the ranks ri,t should be distributed in order to minimize
the variance of the unbiased estimator of Nt(J )?
Nt(J )

§ Post-‐stream Estimation
• enables retrospective subgraph queries
• after any number t of edge arrivals have taken place, we can
compute an unbiased estimator for any subgraph
§ In-‐stream Estimation
• we can take “snapshots” of estimates of specific sampled subgraphs
at arbitrary times during the stream
• Still Unbiased!
• Lightweight online/incremental update of unbiased estimates of
subgraph counts
• Same sampling procedure
• Using stopped Martingale

Input
GPS(m)
Output
For each edge k
Edge stream
k1, k2, ..., k, ...
Compute edge priority
r(k) = w(k)/u(k)
Update the sample
Update unbiased estimates
of subgraph counts

In-stream Estimation
We deﬁne a snapshot as an edge subset J, with a family of
stopping times T such that T = {Tj : j 2 J}
We prove the sequence is a stopped Martingale (Theorem 4)
ˆST
J,t =
Q
j2J
ˆS
Tj
j,t =
Q
j2J
ˆSj,min{Tj ,t}
E[ ˆST
J,t] = SJ,t

§ We use GPS for the estimation of
• Triangle counts
• Wedge counts
• Global clustering coefficient
• And their unbiased variance (Theorem 3 in the paper)
• Weight function
• Used a large set of graphs from a variety of domains (social, we,
tech, etc) -‐ data is available on http://networkrepository.com/
— Up to 49B edges
W(k, ˆK) = 9 ⇤ ˆ4(k) + 1
where ˆ4(k) is the number of triangles
completed by edge k and whose edges in ˆK

- GPS accurately estimates various properties simultaneously
- Consistent performance across graphs from various domains
- A key advantage for GPS in-stream has smaller variance and tight confidence bounds

Results for triangle counts
Using massive real-world and synthetic graphs of up to 49B edges
GPS is shown to be accurate with <0.01 error
Sample size = 1M edges, in-stream estimation
95% confidence intervals

10
4
10
5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
soc−twitter−2010
Sample Size |K|
x/x
10
4
10
5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
Sample Size |K|
x/x
Global Clustering Coeff
10
4
10
5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
Sample Size |K|
x/x
10
4
10
5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
Sample Size |K|
x/x
Triangle Count
10
4
10
5
0.9
0.92
0.94
0.96
0.98
1
1.02
1.04
1.06
1.08
1.1
Sample Size |K|
x/x
10
4
10
5
0.9
0.92
0.94
0.96
0.98
1
1.02
1.04
1.06
1.08
1.1
Sample Size |K|
x/x
Wedge Count
Actual
Estimated/Actual
Confidence Upper & Lower Bounds
Sample Size = 40K edges
Accurate estimates for large Twitter graph ~ 265M edges, and 17.2B triangles

Global Clustering CoeffTriangle Count Wedge Count
Actual
Estimated/Actual
Confidence Upper & Lower Bounds
Sample Size = 40K edges
Accurate estimates for large social network Orkut ~ 120M edges, and 630M triangles
10
4
10
5
10
6
0.9
0.92
0.94
0.96
0.98
1
1.02
1.04
1.06
1.08
1.1
soc−orkut
Sample Size |K|
x/x
10
4
10
5
10
6
0.9
0.92
0.94
0.96
0.98
1
1.02
1.04
1.06
1.08
1.1
soc−orkut
Sample Size |K|
x/x
10
4
10
5
10
6
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
soc−orkut
Sample Size |K|
x/x
10
4
10
5
10
6
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
soc−orkut
Sample Size |K|
x/x
10
4
10
5
10
6
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
soc−orkut
Sample Size |K|
x/x
10
4
10
5
10
6
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
soc−orkut
Sample Size |K|
x/x

0 2 4 6 8 10 12
x 10
7
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
x 10
8
Stream Size at time t (|Kt|)
Trianglesattimet(xt)
soc−orkut
Actual
Estimate
Upper Bound
Lower Bound
0 2 4 6 8 10 12
x 10
7
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
x 10
8
Trianglesattimet(xt)
soc−orkut
0 2 4 6 8 10 12
x 10
7
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
ClusteringCoeﬀ.attimet(xt)
soc−orkut
Actual
Estimate
Upper Bound
Lower Bound
0 2 4 6 8 10 12
x 10
7
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
ClusteringCoeﬀ.attimet(xt)
soc−orkut
GPS in-stream estimates over time
Sample size = 80K edges

0.994 0.996 0.998 1 1.002 1.004 1.006
0.994
0.996
0.998
1
1.002
1.004
1.006
ca-hollywood-2009
com-amazon
higgs-social-network
soc-flickr
soc-youtube-snap
socfb-Indiana69
socfb-Penn94
socfb-Texas84
socfb-UF21
tech-as-skitter
web-BerkStan
web-google
GPS In-stream Estimation, sample size 100K edges
GPS accurately estimates both triangle and wedge counts
simultaneously with a single sample

On Sampling from Massive Graph Streams

We observe accurate results with no significant difference in error between
the ordering schemes

§ We used three schemes for weighting edges during sampling
§ Goal: estimate triangle counts for Friendster social network
with sample size=1M (0.1% of the graph)
1. triangle-‐based weights (3% relative error)
2. wedge-‐based weights (25% relative error)
3. uniform weights for all incoming edges (43% relative error)
-‐ this is equivalent to simple random sampling
The estimator variance was 3.8x higher using wedge-based weights, and
6.2x higher using uniform weights compared to triangle-based weights.

§ A sample is representative if graph properties of interest can be
estimated with a known degree of accuracy
§ We proposed a generic framework Graph Priority Sampling (GPS)
-‐ GPS is an efficient single-‐pass streaming framework
-‐ GPS selects a representative sample and computes unbiased estimates of
counts of connected subsets of edges (e.g., triangles, wedges …)
-‐ Theoretical properties of GPS are supported by empirical analysis
§ GPS admits generalizations by allowing the dependence of the
sampling process as a function of the stored state and/or auxiliary
variables
§ GPS is variance minimizing sampling approach
§ GPS has a relative estimation error < 1%

On Sampling from Massive Graph Streams

More Related Content

Similar to On Sampling from Massive Graph Streams (20)

Recently uploaded (20)

On Sampling from Massive Graph Streams