SlideShare a Scribd company logo
Joint  work  with:
Nick  Duffield  -­‐ Texas  A&M  University
Ted  Willke – Intel  Labs
Ryan  Rossi  – PARC  research
VLDB’17,  Germany
August  31st,  2017
Nesreen  K.  Ahmed
Research  Scientist,  Intel  Labs
-­‐
-­‐
-­‐
-­‐
-­‐
Social  network  
Human  Disease  Network  
[Barabasi 2007]
Food  Web  [2007]
Terrorist  Network
[Krebs  2002]Internet  (AS)  [2005]
Gene  Regulatory  Network  
[Decourty 2008]
Protein  Interactions  
[breast  cancer]
Political  blogs
Power  grid
Social  Network Internet  (AS)
BiologicalPolitical  Blogs
Graph
Mining
Studying  and  analyzing  complex  networks
is  a  challenging and  computationally  intensive task
Studying  and  analyzing  complex  networks
is  a  challenging and  computationally  intensive task
Ø Today’s  networks  are  dynamic/streaming  over  time
-­‐ e.g.,  Twitter  streams,  email  communications  
Ø Today’s  networks  are  massive  in  size  
-­‐ e.g.,  online  social  networks  have  billions  of  users
Ø Today’s  networks  are  dynamic/streaming  over  time
-­‐ e.g.,  Twitter  streams,  email  communications  
Ø Today’s  networks  are  massive  in  size  
-­‐ e.g.,  online  social  networks  have  billions  of  users
Studying  and  analyzing  complex  networks
is  a  challenging and  computationally  intensive task
Studying  and  analyzing  complex  networks
is  a  challenging and  computationally  intensive task
Due  to  these  challenges,  we  usually  need  to  sampleDue  to  these  challenges,  we  usually  need  to  sample
Statistical  
Sampling
Graph  G Sample  S
e.g. Uniform Random
Sampling
Ø Today’s  networks  are  dynamic/streaming  over  time
-­‐ e.g.,  Twitter  streams,  email  communications  
Ø Today’s  networks  are  massive  in  size  
-­‐ e.g.,  online  social  networks  have  billions  of  users
Ø Today’s  networks  are  dynamic/streaming  over  time
-­‐ e.g.,  Twitter  streams,  email  communications  
Ø Today’s  networks  are  massive  in  size  
-­‐ e.g.,  online  social  networks  have  billions  of  users
Given a large graph G represented as a stream of edges e1, e2, e3…
We show how to efficiently sample from G while limiting memory
space to calculate unbiased estimates of various graph properties
Given a large graph G represented as a stream of edges e1, e2, e3…
We show how to efficiently sample from G while limiting memory
space to calculate unbiased estimates of various graph properties
Given a large graph G represented as a stream of edges e1, e2, e3…
We show how to efficiently sample from G while limiting memory
space to calculate unbiased estimates of various graph properties
Given a large graph G represented as a stream of edges e1, e2, e3…
We show how to efficiently sample from G while limiting memory
space to calculate unbiased estimates of various graph properties
Given a large graph G represented as a stream of edges e1, e2, e3…
We show how to efficiently sample from G while limiting memory
space to calculate unbiased estimates of various graph properties
Given a large graph G represented as a stream of edges e1, e2, e3…
We show how to efficiently sample from G while limiting memory
space to calculate unbiased estimates of various graph properties
No.  TrianglesNo.  Wedges
Frequent  connected  subsets  of  edges
Given a large graph G represented as a stream of edges e1, e2, e3…
We show how to efficiently sample from G while limiting memory
space to calculate unbiased estimates of various graph properties
Given a large graph G represented as a stream of edges e1, e2, e3…
We show how to efficiently sample from G while limiting memory
space to calculate unbiased estimates of various graph properties
Frequent  connected  subsets  of  edges
Transitivity
No.  TrianglesNo.  Wedges
§ Random  Sampling
• Uniform  random  sampling – [Tsourakakis et.  al  KDD’09]    
— Graph  Sparsification with  probability  p  
— Chance  of  sampling  a  subgraph (e.g.,  triangle)  is  very  low
— Estimates  suffer  from  high  variance  
• Wedge  Sampling – [Seshadhri et.  al  SDM’13]  
— Sample  vertices,  then  sample  pairs  of  incident  edges  (wedges)
— Output  the  estimate  of  the  closed  wedges  (triangles)  
§ Random  Sampling
• Uniform  random  sampling – [Tsourakakis et.  al  KDD’09]    
— Graph  Sparsification with  probability  p  
— Chance  of  sampling  a  subgraph (e.g.,  triangle)  is  very  low
— Estimates  suffer  from  high  variance  
• Wedge  Sampling – [Seshadhri et.  al  SDM’13]  
— Sample  vertices,  then  sample  pairs  of  incident  edges  (wedges)
— Output  the  estimate  of  the  closed  wedges  (triangles)  
Assume  we’ve  access  to  the  full  graph
Not  a  good  fit  for  massive  streaming  graphs
§ Assume  specific  order  of  the  stream  – [Buriol et.  al  2006]  
• Incidence  stream  model– vertex  neighbors  arrive  together  in  the  stream
§ Use  multiple  passes  over  the  stream  – [Becchetti et.  al  KDD’08]
§ Single-­‐pass  algorithms  for  arbitrary-­‐ordered  graph  streams
§ Single-­‐pass  algorithms  for  arbitrary-­‐ordered  graph  streams
• Streaming-­‐Triangles  – [Jha et.  al  KDD’13]
— Sample  edges  using  reservoir  sampling,  then  sample  pairs  of  incident  
edges  (wedges),  and  finally  scan  for  closed  wedges  (triangles)
• Neighborhood  Sampling  – [Pavan et.  al  VLDB’13]
— Sampling  vectors  of  wedge  estimators,  scan  the  stream  for  closed  wedges  
(triangles)
• TRIEST– [De  Stefani  et.  al  KDD’16]
— Uses  standard  reservoir  sampling  to  maintain  the  edge  sample
• MASCOT– [Lim  et.  al  KDD’15]
— Independent  edge  sampling  with  probability  p
• Graph  Sample  &  Hold– [Ahmed  et.  al  KDD’14]
— Conditionally  independent  edge  sampling
Summary  of  previous  work
Sampling  designs  for  specific  graph  properties  (triangles)  
Not  generally  applicable  to  other  properties
Uniform-­‐based  Sampling
Obtain  variable-­‐size  sample  
We  propose  a  generic  unbiased  sampling  framework:  Graph  Priority  Sampling
• Weight-­‐sensitive
• Fixed-­‐size  sample
• Single-­‐pass
• Applicable  for  general  graph  properties
• Use  topological  information  that  we  wish  to  estimate  as  auxiliary  variables
• Variance-­‐optimal  sampling  (cost  optimization  approach)
Input
Graph Priority Sampling Framework
GPS(m)
Output
Edge stream
k1, k2, ..., k, ...
Sampled Edge stream ˆK
Stored State m = O(| ˆK|)
Input
Graph Priority Sampling Framework
GPS(m)
Output
For each edge k
Generate a random number
u(k) ⇠ Uni(0, 1]
Edge stream
k1, k2, ..., k, ...
Sampled Edge stream ˆK
Stored State m = O(| ˆK|)
Compute edge weight
w(k) = W(k, ˆK)
Compute edge priority
r(k) = w(k)/u(k)
ˆK = ˆK [ {k}
Input
Graph Priority Sampling Framework
GPS(m)
Output
For each edge k
Edge stream
k1, k2, ..., k, ...
Sampled Edge stream ˆK
Stored State m = O(| ˆK|)
Find edge with lowest priority
k⇤
= arg mink02 ˆK r(k0
)
Update sample threshold
z⇤
= max{z⇤
, r(k⇤
)}
Remove lowest priority edge
ˆK = ˆK{k⇤
}
Use a priority queue with O(log m) updates
§ We  use  edge  weights  to  express  the  role  of  the  arriving  
edge  in  the  sampled  graph
• e.g.,  no.  subgraphs completed  by  the  arriving  edge,  and/or  other  
auxiliary  variables
§ Computational  feasibility  
• Efficient  implementation  by  using  a  priority  queue  
• Implemented  as  a  Min-­‐heap  with  O(log  m)  insertion/deletion
• O(1)  access  to  the  edge  with  minimum  priority    
w(k) = W(k, ˆK)
For each edge i,
we construct a sequence of edge estimators ˆSi,t
We achieve unbiasedness by
establishing that the sequence is a Martingale (Theorem 1)
E[ ˆSi,t] = Si,t
ˆSi,t = I(i 2 ˆKt)/min{1, wi/z⇤
}
where ˆSi,t are unbiased estimators of the corresponding edge
ˆKt is the sample at time t
Edge Estimation
For each subgraph J ⇢ [t],
we define the sequence of subgraph estimators as
ˆSJ,t =
Q
i2J
ˆSi,t
E[ ˆSJ,t] = SJ,t
We prove the sequence is a Martingale (Theorem 2)
Subgraph Estimation
Subgraph Counting
For any set J of subgraphs of G,
ˆNt(J ) =
P
J2J :J⇢Kt
ˆSJ,t
is an unbiased estimator of Nt(J ) = |Jt|
(Theorem 2)
§ We  provide  a  cost  minimization  approach  
• inspired  by  IPPS  sampling  [Duffield  et.  al  2005]    
§ By  minimizing  the  conditional  variance  of  the  increment  
incurred  by  the  arriving  edge  in
How the ranks ri,t should be distributed in order to minimize
the variance of the unbiased estimator of Nt(J )?
Nt(J )
§ Post-­‐stream  Estimation
• enables  retrospective  subgraph queries
• after  any  number  t of  edge  arrivals  have  taken  place,  we  can  
compute  an  unbiased  estimator  for  any  subgraph
§ In-­‐stream  Estimation
• we  can  take  “snapshots”  of  estimates  of  specific  sampled  subgraphs
at  arbitrary  times  during  the  stream
• Still  Unbiased!
• Lightweight  online/incremental  update  of  unbiased  estimates  of  
subgraph counts
• Same  sampling  procedure
• Using  stopped  Martingale
Input
Graph Priority Sampling Framework
GPS(m)
Output
For each edge k
Edge stream
k1, k2, ..., k, ...
Sampled Edge stream ˆK
Stored State m = O(| ˆK|)
Compute edge priority
r(k) = w(k)/u(k)
Update the sample
Update unbiased estimates
of subgraph counts
In-stream Estimation
We define a snapshot as an edge subset J, with a family of
stopping times T such that T = {Tj : j 2 J}
We prove the sequence is a stopped Martingale (Theorem 4)
ˆST
J,t =
Q
j2J
ˆS
Tj
j,t =
Q
j2J
ˆSj,min{Tj ,t}
E[ ˆST
J,t] = SJ,t
§ We  use  GPS  for  the  estimation  of  
• Triangle  counts
• Wedge  counts
• Global  clustering  coefficient
• And  their  unbiased  variance    (Theorem  3  in  the  paper)
• Weight  function
• Used    a  large  set  of  graphs  from  a  variety  of  domains  (social,  we,  
tech,  etc)    -­‐ data  is  available  on  http://networkrepository.com/
— Up  to  49B  edges
W(k, ˆK) = 9 ⇤ ˆ4(k) + 1
where ˆ4(k) is the number of triangles
completed by edge k and whose edges in ˆK
-­ GPS  accurately  estimates  various  properties  simultaneously
-­ Consistent  performance  across  graphs  from  various  domains
-­ A  key  advantage  for  GPS  in-­stream  has  smaller  variance  and  tight  confidence  bounds
Results  for  triangle  counts  
Using  massive  real-­world  and  synthetic  graphs  of  up  to  49B  edges
GPS  is  shown  to  be  accurate  with  <0.01  error  
Sample  size  =  1M  edges,  in-­stream  estimation
95%  confidence  intervals
10
4
10
5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
soc−twitter−2010
Sample Size |K|
x/x
10
4
10
5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
soc−twitter−2010
Sample Size |K|
x/x
Global  Clustering  Coeff
10
4
10
5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
soc−twitter−2010
Sample Size |K|
x/x
10
4
10
5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
soc−twitter−2010
Sample Size |K|
x/x
Triangle  Count
10
4
10
5
0.9
0.92
0.94
0.96
0.98
1
1.02
1.04
1.06
1.08
1.1
soc−twitter−2010
Sample Size |K|
x/x
10
4
10
5
0.9
0.92
0.94
0.96
0.98
1
1.02
1.04
1.06
1.08
1.1
soc−twitter−2010
Sample Size |K|
x/x
Wedge  Count
Actual
Estimated/Actual
Confidence  Upper  &  Lower  Bounds  
Sample  Size  =  40K  edges
Accurate  estimates  for  large  Twitter  graph  ~  265M  edges,  and  17.2B  triangles
95%  confidence  intervals
Global  Clustering  CoeffTriangle  Count Wedge  Count
Actual
Estimated/Actual
Confidence  Upper  &  Lower  Bounds  
Sample  Size  =  40K  edges
Accurate  estimates  for  large  social  network  Orkut  ~  120M  edges,  and  630M  triangles
95%  confidence  intervals
10
4
10
5
10
6
0.9
0.92
0.94
0.96
0.98
1
1.02
1.04
1.06
1.08
1.1
soc−orkut
Sample Size |K|
x/x
10
4
10
5
10
6
0.9
0.92
0.94
0.96
0.98
1
1.02
1.04
1.06
1.08
1.1
soc−orkut
Sample Size |K|
x/x
10
4
10
5
10
6
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
soc−orkut
Sample Size |K|
x/x
10
4
10
5
10
6
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
soc−orkut
Sample Size |K|
x/x
10
4
10
5
10
6
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
soc−orkut
Sample Size |K|
x/x
10
4
10
5
10
6
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
soc−orkut
Sample Size |K|
x/x
0 2 4 6 8 10 12
x 10
7
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
x 10
8
Stream Size at time t (|Kt|)
Trianglesattimet(xt)
soc−orkut
Actual
Estimate
Upper Bound
Lower Bound
0 2 4 6 8 10 12
x 10
7
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
x 10
8
Stream Size at time t (|Kt|)
Trianglesattimet(xt)
soc−orkut
0 2 4 6 8 10 12
x 10
7
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
Stream Size at time t (|Kt|)
ClusteringCoeff.attimet(xt)
soc−orkut
Actual
Estimate
Upper Bound
Lower Bound
0 2 4 6 8 10 12
x 10
7
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
Stream Size at time t (|Kt|)
ClusteringCoeff.attimet(xt)
soc−orkut
GPS  in-­stream  estimates  over  time
Sample  size  =  80K  edges
95%  confidence  intervals
0.994 0.996 0.998 1 1.002 1.004 1.006
0.994
0.996
0.998
1
1.002
1.004
1.006
ca-hollywood-2009
com-amazon
higgs-social-network
soc-flickr
soc-youtube-snap
socfb-Indiana69
socfb-Penn94
socfb-Texas84
socfb-UF21
tech-as-skitter
web-BerkStan
web-google
GPS  In-­stream  Estimation,  sample  size  100K  edges
GPS  accurately  estimates  both  triangle  and  wedge  counts  
simultaneously  with  a  single  sample
On Sampling from Massive Graph Streams
We  observe  accurate  results  with  no  significant  difference  in  error  between  
the  ordering  schemes
§ We  used  three  schemes  for  weighting  edges  during  sampling
§ Goal:  estimate  triangle  counts  for  Friendster  social  network  
with  sample  size=1M  (0.1%  of  the  graph)
1. triangle-­‐based  weights  (3%  relative  error)
2. wedge-­‐based  weights  (25%  relative  error)
3. uniform  weights  for  all  incoming  edges  (43%  relative  error)
-­‐ this  is  equivalent  to  simple  random  sampling
The  estimator  variance  was  3.8x  higher  using  wedge-­based weights,  and  
6.2x  higher  using  uniform  weights  compared  to  triangle-­based  weights.
§ A  sample  is  representative if  graph  properties  of  interest  can  be  
estimated  with  a  known  degree  of  accuracy  
§ We  proposed  a  generic  framework  Graph  Priority  Sampling  (GPS)
-­‐ GPS  is  an  efficient single-­‐pass  streaming  framework
-­‐ GPS  selects  a  representative sample  and  computes  unbiased estimates  of  
counts  of  connected  subsets  of  edges  (e.g.,  triangles,  wedges  …)    
-­‐ Theoretical  properties  of  GPS  are  supported  by  empirical  analysis      
§ GPS  admits  generalizations  by  allowing  the  dependence of  the  
sampling  process  as  a  function  of  the  stored  state  and/or  auxiliary  
variables
§ GPS  is  variance  minimizing  sampling  approach  
§ GPS  has  a  relative  estimation  error  <  1%
Thank  you!
Questions?

More Related Content

PPTX
Caries dental
PPT
Development of mandible
PPTX
Charla salud oral
PDF
Revisão em técnicas restauradoras e adesividade 2012 1
PPTX
Habitos orales
KEY
Periodontia em odontopediatria
PPTX
Patologías orales comunes en niños de 5 a 10 años
PPT
Trajectories and rotations /certified fixed orthodontic courses by Ind...
Caries dental
Development of mandible
Charla salud oral
Revisão em técnicas restauradoras e adesividade 2012 1
Habitos orales
Periodontia em odontopediatria
Patologías orales comunes en niños de 5 a 10 años
Trajectories and rotations /certified fixed orthodontic courses by Ind...

Similar to On Sampling from Massive Graph Streams (20)

PDF
Sampling from Massive Graph Streams: A Unifying Framework
PDF
Graph Sample and Hold: A Framework for Big Graph Analytics
PPTX
Leveraging Multiple GPUs and CPUs for Graphlet Counting in Large Networks
PPTX
141222 graphulo ingraphblas
 
PPTX
141205 graphulo ingraphblas
PDF
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
PDF
Modeling and Roll, Pitch and Yaw Simulation of Quadrotor.
PPTX
unit-4-dynamic programming
PDF
[Paper reading] L-SHAPLEY AND C-SHAPLEY: EFFICIENT MODEL INTERPRETATION FOR S...
PDF
Seq2Seq (encoder decoder) model
PDF
DAA Notes.pdf
PDF
Graph Evolution Models
PPT
Mit15 082 jf10_lec01
PDF
Finding Dense Subgraphs
PPTX
SEMINAR ON SHORTEST PATH ALGORITHMS.pptx
PPTX
Asymptotics 140510003721-phpapp02
PPTX
Efficient anomaly detection via matrix sketching
PDF
대용량의 동적인 그래프 및 텐서 마이닝 (Mining Large Dynamic Graphs and Tensors)
PPTX
DimensionalityReduction.pptx
PDF
High-Performance Graph Analysis and Modeling
Sampling from Massive Graph Streams: A Unifying Framework
Graph Sample and Hold: A Framework for Big Graph Analytics
Leveraging Multiple GPUs and CPUs for Graphlet Counting in Large Networks
141222 graphulo ingraphblas
 
141205 graphulo ingraphblas
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Modeling and Roll, Pitch and Yaw Simulation of Quadrotor.
unit-4-dynamic programming
[Paper reading] L-SHAPLEY AND C-SHAPLEY: EFFICIENT MODEL INTERPRETATION FOR S...
Seq2Seq (encoder decoder) model
DAA Notes.pdf
Graph Evolution Models
Mit15 082 jf10_lec01
Finding Dense Subgraphs
SEMINAR ON SHORTEST PATH ALGORITHMS.pptx
Asymptotics 140510003721-phpapp02
Efficient anomaly detection via matrix sketching
대용량의 동적인 그래프 및 텐서 마이닝 (Mining Large Dynamic Graphs and Tensors)
DimensionalityReduction.pptx
High-Performance Graph Analysis and Modeling
Ad

Recently uploaded (20)

PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
Mega Projects Data Mega Projects Data
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
1_Introduction to advance data techniques.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPT
Quality review (1)_presentation of this 21
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Computer network topology notes for revision
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
Mega Projects Data Mega Projects Data
Qualitative Qantitative and Mixed Methods.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Business Acumen Training GuidePresentation.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
Fluorescence-microscope_Botany_detailed content
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
1_Introduction to advance data techniques.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Miokarditis (Inflamasi pada Otot Jantung)
Quality review (1)_presentation of this 21
.pdf is not working space design for the following data for the following dat...
Computer network topology notes for revision
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
IB Computer Science - Internal Assessment.pptx
Ad

On Sampling from Massive Graph Streams

  • 1. Joint  work  with: Nick  Duffield  -­‐ Texas  A&M  University Ted  Willke – Intel  Labs Ryan  Rossi  – PARC  research VLDB’17,  Germany August  31st,  2017 Nesreen  K.  Ahmed Research  Scientist,  Intel  Labs
  • 2. -­‐ -­‐ -­‐ -­‐ -­‐ Social  network   Human  Disease  Network   [Barabasi 2007] Food  Web  [2007] Terrorist  Network [Krebs  2002]Internet  (AS)  [2005] Gene  Regulatory  Network   [Decourty 2008] Protein  Interactions   [breast  cancer] Political  blogs Power  grid
  • 3. Social  Network Internet  (AS) BiologicalPolitical  Blogs Graph Mining
  • 4. Studying  and  analyzing  complex  networks is  a  challenging and  computationally  intensive task Studying  and  analyzing  complex  networks is  a  challenging and  computationally  intensive task Ø Today’s  networks  are  dynamic/streaming  over  time -­‐ e.g.,  Twitter  streams,  email  communications   Ø Today’s  networks  are  massive  in  size   -­‐ e.g.,  online  social  networks  have  billions  of  users Ø Today’s  networks  are  dynamic/streaming  over  time -­‐ e.g.,  Twitter  streams,  email  communications   Ø Today’s  networks  are  massive  in  size   -­‐ e.g.,  online  social  networks  have  billions  of  users
  • 5. Studying  and  analyzing  complex  networks is  a  challenging and  computationally  intensive task Studying  and  analyzing  complex  networks is  a  challenging and  computationally  intensive task Due  to  these  challenges,  we  usually  need  to  sampleDue  to  these  challenges,  we  usually  need  to  sample Statistical   Sampling Graph  G Sample  S e.g. Uniform Random Sampling Ø Today’s  networks  are  dynamic/streaming  over  time -­‐ e.g.,  Twitter  streams,  email  communications   Ø Today’s  networks  are  massive  in  size   -­‐ e.g.,  online  social  networks  have  billions  of  users Ø Today’s  networks  are  dynamic/streaming  over  time -­‐ e.g.,  Twitter  streams,  email  communications   Ø Today’s  networks  are  massive  in  size   -­‐ e.g.,  online  social  networks  have  billions  of  users
  • 6. Given a large graph G represented as a stream of edges e1, e2, e3… We show how to efficiently sample from G while limiting memory space to calculate unbiased estimates of various graph properties Given a large graph G represented as a stream of edges e1, e2, e3… We show how to efficiently sample from G while limiting memory space to calculate unbiased estimates of various graph properties
  • 7. Given a large graph G represented as a stream of edges e1, e2, e3… We show how to efficiently sample from G while limiting memory space to calculate unbiased estimates of various graph properties Given a large graph G represented as a stream of edges e1, e2, e3… We show how to efficiently sample from G while limiting memory space to calculate unbiased estimates of various graph properties
  • 8. Given a large graph G represented as a stream of edges e1, e2, e3… We show how to efficiently sample from G while limiting memory space to calculate unbiased estimates of various graph properties Given a large graph G represented as a stream of edges e1, e2, e3… We show how to efficiently sample from G while limiting memory space to calculate unbiased estimates of various graph properties No.  TrianglesNo.  Wedges Frequent  connected  subsets  of  edges
  • 9. Given a large graph G represented as a stream of edges e1, e2, e3… We show how to efficiently sample from G while limiting memory space to calculate unbiased estimates of various graph properties Given a large graph G represented as a stream of edges e1, e2, e3… We show how to efficiently sample from G while limiting memory space to calculate unbiased estimates of various graph properties Frequent  connected  subsets  of  edges Transitivity No.  TrianglesNo.  Wedges
  • 10. § Random  Sampling • Uniform  random  sampling – [Tsourakakis et.  al  KDD’09]     — Graph  Sparsification with  probability  p   — Chance  of  sampling  a  subgraph (e.g.,  triangle)  is  very  low — Estimates  suffer  from  high  variance   • Wedge  Sampling – [Seshadhri et.  al  SDM’13]   — Sample  vertices,  then  sample  pairs  of  incident  edges  (wedges) — Output  the  estimate  of  the  closed  wedges  (triangles)  
  • 11. § Random  Sampling • Uniform  random  sampling – [Tsourakakis et.  al  KDD’09]     — Graph  Sparsification with  probability  p   — Chance  of  sampling  a  subgraph (e.g.,  triangle)  is  very  low — Estimates  suffer  from  high  variance   • Wedge  Sampling – [Seshadhri et.  al  SDM’13]   — Sample  vertices,  then  sample  pairs  of  incident  edges  (wedges) — Output  the  estimate  of  the  closed  wedges  (triangles)   Assume  we’ve  access  to  the  full  graph Not  a  good  fit  for  massive  streaming  graphs
  • 12. § Assume  specific  order  of  the  stream  – [Buriol et.  al  2006]   • Incidence  stream  model– vertex  neighbors  arrive  together  in  the  stream § Use  multiple  passes  over  the  stream  – [Becchetti et.  al  KDD’08] § Single-­‐pass  algorithms  for  arbitrary-­‐ordered  graph  streams
  • 13. § Single-­‐pass  algorithms  for  arbitrary-­‐ordered  graph  streams • Streaming-­‐Triangles  – [Jha et.  al  KDD’13] — Sample  edges  using  reservoir  sampling,  then  sample  pairs  of  incident   edges  (wedges),  and  finally  scan  for  closed  wedges  (triangles) • Neighborhood  Sampling  – [Pavan et.  al  VLDB’13] — Sampling  vectors  of  wedge  estimators,  scan  the  stream  for  closed  wedges   (triangles) • TRIEST– [De  Stefani  et.  al  KDD’16] — Uses  standard  reservoir  sampling  to  maintain  the  edge  sample • MASCOT– [Lim  et.  al  KDD’15] — Independent  edge  sampling  with  probability  p • Graph  Sample  &  Hold– [Ahmed  et.  al  KDD’14] — Conditionally  independent  edge  sampling
  • 14. Summary  of  previous  work Sampling  designs  for  specific  graph  properties  (triangles)   Not  generally  applicable  to  other  properties Uniform-­‐based  Sampling Obtain  variable-­‐size  sample   We  propose  a  generic  unbiased  sampling  framework:  Graph  Priority  Sampling • Weight-­‐sensitive • Fixed-­‐size  sample • Single-­‐pass • Applicable  for  general  graph  properties • Use  topological  information  that  we  wish  to  estimate  as  auxiliary  variables • Variance-­‐optimal  sampling  (cost  optimization  approach)
  • 15. Input Graph Priority Sampling Framework GPS(m) Output Edge stream k1, k2, ..., k, ... Sampled Edge stream ˆK Stored State m = O(| ˆK|)
  • 16. Input Graph Priority Sampling Framework GPS(m) Output For each edge k Generate a random number u(k) ⇠ Uni(0, 1] Edge stream k1, k2, ..., k, ... Sampled Edge stream ˆK Stored State m = O(| ˆK|) Compute edge weight w(k) = W(k, ˆK) Compute edge priority r(k) = w(k)/u(k) ˆK = ˆK [ {k}
  • 17. Input Graph Priority Sampling Framework GPS(m) Output For each edge k Edge stream k1, k2, ..., k, ... Sampled Edge stream ˆK Stored State m = O(| ˆK|) Find edge with lowest priority k⇤ = arg mink02 ˆK r(k0 ) Update sample threshold z⇤ = max{z⇤ , r(k⇤ )} Remove lowest priority edge ˆK = ˆK{k⇤ } Use a priority queue with O(log m) updates
  • 18. § We  use  edge  weights  to  express  the  role  of  the  arriving   edge  in  the  sampled  graph • e.g.,  no.  subgraphs completed  by  the  arriving  edge,  and/or  other   auxiliary  variables § Computational  feasibility   • Efficient  implementation  by  using  a  priority  queue   • Implemented  as  a  Min-­‐heap  with  O(log  m)  insertion/deletion • O(1)  access  to  the  edge  with  minimum  priority     w(k) = W(k, ˆK)
  • 19. For each edge i, we construct a sequence of edge estimators ˆSi,t We achieve unbiasedness by establishing that the sequence is a Martingale (Theorem 1) E[ ˆSi,t] = Si,t ˆSi,t = I(i 2 ˆKt)/min{1, wi/z⇤ } where ˆSi,t are unbiased estimators of the corresponding edge ˆKt is the sample at time t Edge Estimation
  • 20. For each subgraph J ⇢ [t], we define the sequence of subgraph estimators as ˆSJ,t = Q i2J ˆSi,t E[ ˆSJ,t] = SJ,t We prove the sequence is a Martingale (Theorem 2) Subgraph Estimation
  • 21. Subgraph Counting For any set J of subgraphs of G, ˆNt(J ) = P J2J :J⇢Kt ˆSJ,t is an unbiased estimator of Nt(J ) = |Jt| (Theorem 2)
  • 22. § We  provide  a  cost  minimization  approach   • inspired  by  IPPS  sampling  [Duffield  et.  al  2005]     § By  minimizing  the  conditional  variance  of  the  increment   incurred  by  the  arriving  edge  in How the ranks ri,t should be distributed in order to minimize the variance of the unbiased estimator of Nt(J )? Nt(J )
  • 23. § Post-­‐stream  Estimation • enables  retrospective  subgraph queries • after  any  number  t of  edge  arrivals  have  taken  place,  we  can   compute  an  unbiased  estimator  for  any  subgraph § In-­‐stream  Estimation • we  can  take  “snapshots”  of  estimates  of  specific  sampled  subgraphs at  arbitrary  times  during  the  stream • Still  Unbiased! • Lightweight  online/incremental  update  of  unbiased  estimates  of   subgraph counts • Same  sampling  procedure • Using  stopped  Martingale
  • 24. Input Graph Priority Sampling Framework GPS(m) Output For each edge k Edge stream k1, k2, ..., k, ... Sampled Edge stream ˆK Stored State m = O(| ˆK|) Compute edge priority r(k) = w(k)/u(k) Update the sample Update unbiased estimates of subgraph counts
  • 25. In-stream Estimation We define a snapshot as an edge subset J, with a family of stopping times T such that T = {Tj : j 2 J} We prove the sequence is a stopped Martingale (Theorem 4) ˆST J,t = Q j2J ˆS Tj j,t = Q j2J ˆSj,min{Tj ,t} E[ ˆST J,t] = SJ,t
  • 26. § We  use  GPS  for  the  estimation  of   • Triangle  counts • Wedge  counts • Global  clustering  coefficient • And  their  unbiased  variance    (Theorem  3  in  the  paper) • Weight  function • Used    a  large  set  of  graphs  from  a  variety  of  domains  (social,  we,   tech,  etc)    -­‐ data  is  available  on  http://networkrepository.com/ — Up  to  49B  edges W(k, ˆK) = 9 ⇤ ˆ4(k) + 1 where ˆ4(k) is the number of triangles completed by edge k and whose edges in ˆK
  • 27. -­ GPS  accurately  estimates  various  properties  simultaneously -­ Consistent  performance  across  graphs  from  various  domains -­ A  key  advantage  for  GPS  in-­stream  has  smaller  variance  and  tight  confidence  bounds
  • 28. Results  for  triangle  counts   Using  massive  real-­world  and  synthetic  graphs  of  up  to  49B  edges GPS  is  shown  to  be  accurate  with  <0.01  error   Sample  size  =  1M  edges,  in-­stream  estimation 95%  confidence  intervals
  • 29. 10 4 10 5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 soc−twitter−2010 Sample Size |K| x/x 10 4 10 5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 soc−twitter−2010 Sample Size |K| x/x Global  Clustering  Coeff 10 4 10 5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 soc−twitter−2010 Sample Size |K| x/x 10 4 10 5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 soc−twitter−2010 Sample Size |K| x/x Triangle  Count 10 4 10 5 0.9 0.92 0.94 0.96 0.98 1 1.02 1.04 1.06 1.08 1.1 soc−twitter−2010 Sample Size |K| x/x 10 4 10 5 0.9 0.92 0.94 0.96 0.98 1 1.02 1.04 1.06 1.08 1.1 soc−twitter−2010 Sample Size |K| x/x Wedge  Count Actual Estimated/Actual Confidence  Upper  &  Lower  Bounds   Sample  Size  =  40K  edges Accurate  estimates  for  large  Twitter  graph  ~  265M  edges,  and  17.2B  triangles 95%  confidence  intervals
  • 30. Global  Clustering  CoeffTriangle  Count Wedge  Count Actual Estimated/Actual Confidence  Upper  &  Lower  Bounds   Sample  Size  =  40K  edges Accurate  estimates  for  large  social  network  Orkut  ~  120M  edges,  and  630M  triangles 95%  confidence  intervals 10 4 10 5 10 6 0.9 0.92 0.94 0.96 0.98 1 1.02 1.04 1.06 1.08 1.1 soc−orkut Sample Size |K| x/x 10 4 10 5 10 6 0.9 0.92 0.94 0.96 0.98 1 1.02 1.04 1.06 1.08 1.1 soc−orkut Sample Size |K| x/x 10 4 10 5 10 6 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 soc−orkut Sample Size |K| x/x 10 4 10 5 10 6 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 soc−orkut Sample Size |K| x/x 10 4 10 5 10 6 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 soc−orkut Sample Size |K| x/x 10 4 10 5 10 6 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 soc−orkut Sample Size |K| x/x
  • 31. 0 2 4 6 8 10 12 x 10 7 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 x 10 8 Stream Size at time t (|Kt|) Trianglesattimet(xt) soc−orkut Actual Estimate Upper Bound Lower Bound 0 2 4 6 8 10 12 x 10 7 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 x 10 8 Stream Size at time t (|Kt|) Trianglesattimet(xt) soc−orkut 0 2 4 6 8 10 12 x 10 7 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 Stream Size at time t (|Kt|) ClusteringCoeff.attimet(xt) soc−orkut Actual Estimate Upper Bound Lower Bound 0 2 4 6 8 10 12 x 10 7 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 Stream Size at time t (|Kt|) ClusteringCoeff.attimet(xt) soc−orkut GPS  in-­stream  estimates  over  time Sample  size  =  80K  edges 95%  confidence  intervals
  • 32. 0.994 0.996 0.998 1 1.002 1.004 1.006 0.994 0.996 0.998 1 1.002 1.004 1.006 ca-hollywood-2009 com-amazon higgs-social-network soc-flickr soc-youtube-snap socfb-Indiana69 socfb-Penn94 socfb-Texas84 socfb-UF21 tech-as-skitter web-BerkStan web-google GPS  In-­stream  Estimation,  sample  size  100K  edges GPS  accurately  estimates  both  triangle  and  wedge  counts   simultaneously  with  a  single  sample
  • 34. We  observe  accurate  results  with  no  significant  difference  in  error  between   the  ordering  schemes
  • 35. § We  used  three  schemes  for  weighting  edges  during  sampling § Goal:  estimate  triangle  counts  for  Friendster  social  network   with  sample  size=1M  (0.1%  of  the  graph) 1. triangle-­‐based  weights  (3%  relative  error) 2. wedge-­‐based  weights  (25%  relative  error) 3. uniform  weights  for  all  incoming  edges  (43%  relative  error) -­‐ this  is  equivalent  to  simple  random  sampling The  estimator  variance  was  3.8x  higher  using  wedge-­based weights,  and   6.2x  higher  using  uniform  weights  compared  to  triangle-­based  weights.
  • 36. § A  sample  is  representative if  graph  properties  of  interest  can  be   estimated  with  a  known  degree  of  accuracy   § We  proposed  a  generic  framework  Graph  Priority  Sampling  (GPS) -­‐ GPS  is  an  efficient single-­‐pass  streaming  framework -­‐ GPS  selects  a  representative sample  and  computes  unbiased estimates  of   counts  of  connected  subsets  of  edges  (e.g.,  triangles,  wedges  …)     -­‐ Theoretical  properties  of  GPS  are  supported  by  empirical  analysis       § GPS  admits  generalizations  by  allowing  the  dependence of  the   sampling  process  as  a  function  of  the  stored  state  and/or  auxiliary   variables § GPS  is  variance  minimizing  sampling  approach   § GPS  has  a  relative  estimation  error  <  1%