SlideShare a Scribd company logo
scalable and efficient algorithms for
analysis of massive, streaming graphs
E. Jason Riedy and David A. Bader
MS76 Scalable Network Analysis: Tools, Algorithms, Applications
SIAM PP, 15 April 2016
HPC Lab, School of Computational Science and Engineering
Georgia Institute of Technology
motivation and applications
(insert prefix here)-scale data analysis
Cyber-security Identify anomalies, malicious actors
Health care Finding outbreaks, population epidemiology
Social networks Advertising, searching, grouping
Intelligence Decisions at scale, regulating algorithms
Systems biology Understanding interactions, drug design
Power grid Disruptions, conservation
Simulation Discrete events, cracking meshes
• Graphs are a motif / theme in data analysis.
• Changing and dynamic graphs are important! 3
outline
1. Motivation and background
2. Incremental PageRank
3. Seed set expansion
4. Community maintenance
5. STINGER: Framework for streaming graph analysis
4
why graphs?
Another tool, like dense and sparse linear algebra.
• Combine things with pairwise
relationships
• Smaller, more generic than raw data.
• Taught (roughly) to all CS students...
• Semantic attributions can capture
essential relationships.
• Traversals can be faster than filtering
DB joins.
• Provide clear phrasing for queries
about relationships.
5
potential applications
• Social Networks
• Identify communities, influences, bridges, trends,
anomalies (trends before they happen)...
• Potential to help social sciences, city planning, and
others with large-scale data.
• Cybersecurity
• Determine if new connections can access a device or
represent new threat in < 5ms...
• Is the transfer by a virus / persistent threat?
• Bioinformatics, health
• Construct gene sequences, analyze protein
interactions, map brain interactions
• Credit fraud forensics ⇒ detection ⇒ monitoring
• Integrate all the customer’s data, identify in real-time
6
streaming graph data
Networks data rates:
• Gigabit ethernet: 81k – 1.5M packets per second
• Over 130 000 flows per second on 10 GigE (< 7.7 µs)
Person-level data rates:
• 500M posts per day on Twitter (6k / sec)1
• 3M posts per minute on Facebook (50k / sec)2
We need to analyze only changes and not entire graph.
Throughput & latency trade off and expose different
levels of concurrency.
1
www.internetlivestats.com/twitter-statistics/
2
www.jeffbullas.com/2015/04/17/21-awesome-facebook-facts-and-statistics-you-need-to-check-out/
7
streaming graph analysis
Terminology:
• Streaming changes into a massive, evolving graph
• Not CS streaming algorithm (tiny memory)
• Need to handle deletions as well as insertions
Previous throughput results (not comprehensive review):
Data ingest >2M up/sec [Ediger, McColl, Poovey, Campbell, & B
2014]
Clustering coefficients >100K up/sec [R, Meyerhenke, Bader,
Ediger, & Mattson 2012]
Connected comp. >1M up/sec [McColl, Green, & B 2013]
Community clustering >100K up/sec∗
[R & B 2013]
8
incremental pagerank
pagerank
Everyone’s “favorite” metric: PageRank.
• Stationary distribution of the random surfer model.
• Eigenvalue problem can be re-phrased as a linear
system
(
I − αAT
D−1
)
x = kv,
with
α teleportation constant, much < 1
A adjacency matrix
D diagonal matrix of out degrees, with
x/0 = x (self-loop)
v personalization vector, here 1/|V|
k irrelevant scaling constant
• Amenable to analysis, etc. 10
incremental pagerank
• Streaming data setting, update PageRank without
touching the entire graph.
• Existing methods maintain databases of walks, etc.
• Let A∆ = A + ∆A, D∆ = D + ∆D for the new graph,
want to solve for x + ∆x.
• Simple algebra:
(
I − αAT
∆D−1
∆
)
∆x = α
(
A∆D−1
∆ − AD−1
)
x,
and the right-hand side is sparse.
• Re-arrange for Jacobi,
∆x(k+1)
= αAT
∆D−1
∆ ∆x(k)
+ α
(
A∆D−1
∆ − AD−1
)
x,
iterate, ...
11
incremental pagerank: accumulating error
• And fail. The updated solution wanders away from
the true solution. Top rankings stay the same...
12
incremental pagerank: think instead
• The old solution x is an ok, not exact, solution to the
original problem, now a nearby problem.
• How close? Residual:
r′
= kv − x + αA∆D−1
∆ x
= r + α
(
A∆D−1
∆ − AD−1
)
x.
• Solve (I − αA∆D−1
∆ )∆x = r′
.
• Cheat by not refining all of r′
, only region growing
around the changes:
(I − αA∆D−1
∆ )∆x = r′
|∆
• (Also cheat by updating r rather than recomputing at
the changes.)
13
incremental pagerank: works
Riedy, GABB at IPDPS 2016, to appear.
14
incremental pagerank: worst latency
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
0.001
0.010
0.001
0.010
0.01
0.10
1.00
1
100
1
10
100
powerPGPgiantcompocaidaRouterLevelbelgium.osmcoPapersCiteseer
10 100 1000
Batch size
Updatetime(s)
Algorithm q dpr dprheld pr_restart
Riedy, GABB at IPDPS 2016, to appear.
15
seed set expansion
graphs: big, nasty hairballs
Yifan Hu’s (AT&T) visualization of the in-2004 data set
http://www2.research.att.com/~yifanhu/gallery.html
17
but no shortage of structure...
Protein interactions, Giot et al., “A Protein
Interaction Map of Drosophila melanogaster”,
Science 302, 1722-1736, 2003.
Jason’s network via LinkedIn Labs
• Locally, there are clusters or communities.
• There are methods for global community detection.
• Also need local communities around seeds for
queries and targetted analysis.
18
seed set expansion
• Seed set expansion finds the “best” subgraph or
communities for a set of vertices of interest
• Many quality criteria: Modularity, conductance, etc.
• Can be applied to cryptocurrency to identify and
track groups of interacting entities
• Dynamic algorithm updates communities faster than
recomputation, allowing us to keep up with new data
produced
19
static seed set expansion
Greedy expansion starting from S = { seed vertices }:
1. Check the fitness of every vertex v neighboring S.
• fitness = f(S ∪ {v}) − f(S)
2. It any fitness is positive, include most fit v in S.
• Currently sequential, could include all sufficiently
good neighbors.
3. Record at which step v is included.
• This list is the base for updates.
Now the dynamic version by example...
20
dynamic seed set example
In preparation with Anita Zakrzewska and Eisha Nathan
21
dynamic seed set example
In preparation with Anita Zakrzewska and Eisha Nathan
21
dynamic seed set example
In preparation with Anita Zakrzewska and Eisha Nathan
21
dynamic seed set quality
Graphs from the Koblenz Network Collection.
22
dynamic seed set speed-up
Graphs from the Koblenz Network Collection.
23
global community updates
community detection
• Partition a graph’s
vertices into disjoint
communities.
• A community locally
optimizes some
metric, NP-hard.
• Trying to capture that
vertices are more
similar within one
community than
between
communities. Jason’s network via LinkedIn Labs
25
what about streaming?
• Simple approach based on agglomeration:
1. Extract all vertices touched by an update.
2. Re-start agglomeration.
• “Works” and is fast (MTAAP 2013), but never
mentioned quality.
• Extracted vertices form bridges and do not re-merge.
• Some methods based on label propagation, not
metric-driven.
• Backtracking (e.g. Görke, et al., JEA 2013) preserves
quality at cost of change size.
• Ongoing: Can we limit backtracking?
26
community quality: achievable
Data from Stanford SNAP archive: Facebook.
Stream generation: Reversing the graph.
In preparation with Pushkar Godbolé
27
community quality: change size
Data from Stanford SNAP archive: Facebook.
Stream generation: Reversing the graph.
In preparation with Pushkar Godbolé
28
closing
future directions
• Of course, continuing to develop streaming /
dynamic / incremental algorithms.
• For massive graphs, computing small changes is
always a win.
• Improving approximations or replacing expensive
metrics like betweenness centrality would be great.
• Including more external and semantic data.
• If vertices are documents or data records, many
more measures of similarity.
• Only now being exploited in concert with static graph
algorithms.
30
hpc lab people
Faculty:
• David A. Bader
• Oded Green
Data:
• Pushkar Godbolé
• Anita Zakrzewska
• Eisha Nathan
STINGER:
• Robert McColl,
• James Fairbanks,
• Adam McLaughlin,
• Daniel Henderson,
• David Ediger (now
GTRI),
• Jason Poovey (GTRI),
• Karl Jiang, and
• feedback from users in
industry, government,
academia
Support: DoD, DoE, NSF, Intel, IBM, Oracle 31
stinger: where do you get it?
Home: www.cc.gatech.edu/stinger/
Code: git.cc.gatech.edu/git/u/eriedy3/stinger.git/
Gateway to
• code,
• development,
• documentation,
• presentations...
Remember: Academic code, but maturing
with contributions.
Users / contributors / questioners:
Georgia Tech, PNNL, CMU, Berkeley, Intel,
Cray, NVIDIA, IBM, Federal Government,
Ionic Security, Citi, ...
32

More Related Content

PDF
Graph Analysis Beyond Linear Algebra
PDF
Updating PageRank for Streaming Graphs
PDF
Graph Analysis Trends and Opportunities -- CMG Performance and Capacity 2014
PDF
A New Algorithm Model for Massive-Scale Streaming Graph Analysis
PDF
High-Performance Analysis of Streaming Graphs
PDF
Joey gonzalez, graph lab, m lconf 2013
PPTX
Big Data + Big Sim: Query Processing over Unstructured CFD Models
PDF
Josh Patterson MLconf slides
Graph Analysis Beyond Linear Algebra
Updating PageRank for Streaming Graphs
Graph Analysis Trends and Opportunities -- CMG Performance and Capacity 2014
A New Algorithm Model for Massive-Scale Streaming Graph Analysis
High-Performance Analysis of Streaming Graphs
Joey gonzalez, graph lab, m lconf 2013
Big Data + Big Sim: Query Processing over Unstructured CFD Models
Josh Patterson MLconf slides

What's hot (20)

PPTX
The Other HPC: High Productivity Computing in Polystore Environments
PDF
MOA for the IoT at ACML 2016
PPT
Scalable Machine Learning: The Role of Stratified Data Sharding
PDF
Artificial intelligence and data stream mining
PPTX
Large-Scale Graph Computation on Just a PC: Aapo Kyrola Ph.D. thesis defense
PDF
GraphChi big graph processing
PPTX
CS267_Graph_Lab
PPTX
Visualizing and Clustering Life Science Applications in Parallel 
PPTX
Crowdsourced Data Processing: Industry and Academic Perspectives
PDF
Machine Learning in the Cloud with GraphLab
PDF
Scalable Distributed Real-Time Clustering for Big Data Streams
PDF
A New Year in Data Science: ML Unpaused
PPTX
Ability Study of Proximity Measure for Big Data Mining Context on Clustering
PDF
(Big) Data Science
PDF
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
PPTX
Three Tools for "Human-in-the-loop" Data Science
PPTX
Apache Spark GraphX highlights.
PDF
Introduction to Data Mining - A Beginner's Guide
PPTX
Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...
PPTX
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
The Other HPC: High Productivity Computing in Polystore Environments
MOA for the IoT at ACML 2016
Scalable Machine Learning: The Role of Stratified Data Sharding
Artificial intelligence and data stream mining
Large-Scale Graph Computation on Just a PC: Aapo Kyrola Ph.D. thesis defense
GraphChi big graph processing
CS267_Graph_Lab
Visualizing and Clustering Life Science Applications in Parallel 
Crowdsourced Data Processing: Industry and Academic Perspectives
Machine Learning in the Cloud with GraphLab
Scalable Distributed Real-Time Clustering for Big Data Streams
A New Year in Data Science: ML Unpaused
Ability Study of Proximity Measure for Big Data Mining Context on Clustering
(Big) Data Science
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
Three Tools for "Human-in-the-loop" Data Science
Apache Spark GraphX highlights.
Introduction to Data Mining - A Beginner's Guide
Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
Ad

Viewers also liked (10)

PDF
Network Challenge: Error and Sensitivity Analysis
PDF
STING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
PDF
Graph Exploitation Seminar, 2011
PPTX
Hadoop World 2011: Hadoop and Graph Data Management: Challenges and Opportuni...
PPTX
Introduction to STINGER
PDF
Vertex neighborhoods, low conductance cuts, and good seeds for local communit...
PDF
Community Detection with Networkx
PDF
Graph Analyses with Python and NetworkX
PDF
Community Detection in Social Media
PDF
STINGER: Multi-threaded Graph Streaming
Network Challenge: Error and Sensitivity Analysis
STING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
Graph Exploitation Seminar, 2011
Hadoop World 2011: Hadoop and Graph Data Management: Challenges and Opportuni...
Introduction to STINGER
Vertex neighborhoods, low conductance cuts, and good seeds for local communit...
Community Detection with Networkx
Graph Analyses with Python and NetworkX
Community Detection in Social Media
STINGER: Multi-threaded Graph Streaming
Ad

Similar to Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs (20)

PDF
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
PDF
High-Performance Analysis of Streaming Graphs
PPTX
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
PDF
Matrix Factorization In Recommender Systems
PDF
STING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
PDF
F14 lec12graphs
PDF
DA ST-1 SET-B-Solution.pdf we also provide the many type of solution
PPTX
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
PDF
Big learning 1.2
PPTX
Graphical Structure Learning accelerated with POWER9
PDF
Scalable Similarity-Based Neighborhood Methods with MapReduce
PPTX
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
PDF
Data streaming fundamentals- EUDAT Summer School (Giuseppe Fiameni, CINECA)
PDF
Realtime Analytics
PPTX
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
PDF
Efficient aggregation for graph summarization
PDF
ICASSP 2012: Analysis of Streaming Social Networks and Graphs on Multicore Ar...
PDF
Workshop nwav 47 - LVS - Tool for Quantitative Data Analysis
PDF
STING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
PDF
Start From A MapReduce Graph Pattern-recognize Algorithm
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
High-Performance Analysis of Streaming Graphs
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Matrix Factorization In Recommender Systems
STING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
F14 lec12graphs
DA ST-1 SET-B-Solution.pdf we also provide the many type of solution
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Big learning 1.2
Graphical Structure Learning accelerated with POWER9
Scalable Similarity-Based Neighborhood Methods with MapReduce
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
Data streaming fundamentals- EUDAT Summer School (Giuseppe Fiameni, CINECA)
Realtime Analytics
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
Efficient aggregation for graph summarization
ICASSP 2012: Analysis of Streaming Social Networks and Graphs on Multicore Ar...
Workshop nwav 47 - LVS - Tool for Quantitative Data Analysis
STING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
Start From A MapReduce Graph Pattern-recognize Algorithm

More from Jason Riedy (17)

PDF
Lucata at the HPEC GraphBLAS BoF
PDF
LAGraph 2021-10-13
PDF
Lucata at the HPEC GraphBLAS BoF
PDF
Graph analysis and novel architectures
PDF
GraphBLAS and Emus
PDF
Reproducible Linear Algebra from Application to Architecture
PDF
PEARC19: Wrangling Rogues: A Case Study on Managing Experimental Post-Moore A...
PDF
ICIAM 2019: Reproducible Linear Algebra from Application to Architecture
PDF
Novel Architectures for Applications in Data Science and Beyond
PDF
Characterization of Emu Chick with Microbenchmarks
PDF
CRNCH 2018 Summit: Rogues Gallery Update
PDF
Augmented Arithmetic Operations Proposed for IEEE-754 2018
PDF
Graph Analysis: New Algorithm Models, New Architectures
PDF
CRNCH Rogues Gallery: A Community Core for Novel Computing Platforms
PDF
CRNCH Rogues Gallery: A Community Core for Novel Computing Platforms
PDF
SIAM Annual Meeting 2012: Streaming Graph Analytics for Massive Graphs
PDF
MTAAP12: Scalable Community Detection
Lucata at the HPEC GraphBLAS BoF
LAGraph 2021-10-13
Lucata at the HPEC GraphBLAS BoF
Graph analysis and novel architectures
GraphBLAS and Emus
Reproducible Linear Algebra from Application to Architecture
PEARC19: Wrangling Rogues: A Case Study on Managing Experimental Post-Moore A...
ICIAM 2019: Reproducible Linear Algebra from Application to Architecture
Novel Architectures for Applications in Data Science and Beyond
Characterization of Emu Chick with Microbenchmarks
CRNCH 2018 Summit: Rogues Gallery Update
Augmented Arithmetic Operations Proposed for IEEE-754 2018
Graph Analysis: New Algorithm Models, New Architectures
CRNCH Rogues Gallery: A Community Core for Novel Computing Platforms
CRNCH Rogues Gallery: A Community Core for Novel Computing Platforms
SIAM Annual Meeting 2012: Streaming Graph Analytics for Massive Graphs
MTAAP12: Scalable Community Detection

Recently uploaded (20)

PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
Mega Projects Data Mega Projects Data
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
1_Introduction to advance data techniques.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
Foundation of Data Science unit number two notes
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Mega Projects Data Mega Projects Data
Data_Analytics_and_PowerBI_Presentation.pptx
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
oil_refinery_comprehensive_20250804084928 (1).pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Introduction-to-Cloud-ComputingFinal.pptx
Reliability_Chapter_ presentation 1221.5784
Introduction to Knowledge Engineering Part 1
Moving the Public Sector (Government) to a Digital Adoption
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
1_Introduction to advance data techniques.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Foundation of Data Science unit number two notes
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn

Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs

  • 1. scalable and efficient algorithms for analysis of massive, streaming graphs E. Jason Riedy and David A. Bader MS76 Scalable Network Analysis: Tools, Algorithms, Applications SIAM PP, 15 April 2016 HPC Lab, School of Computational Science and Engineering Georgia Institute of Technology
  • 3. (insert prefix here)-scale data analysis Cyber-security Identify anomalies, malicious actors Health care Finding outbreaks, population epidemiology Social networks Advertising, searching, grouping Intelligence Decisions at scale, regulating algorithms Systems biology Understanding interactions, drug design Power grid Disruptions, conservation Simulation Discrete events, cracking meshes • Graphs are a motif / theme in data analysis. • Changing and dynamic graphs are important! 3
  • 4. outline 1. Motivation and background 2. Incremental PageRank 3. Seed set expansion 4. Community maintenance 5. STINGER: Framework for streaming graph analysis 4
  • 5. why graphs? Another tool, like dense and sparse linear algebra. • Combine things with pairwise relationships • Smaller, more generic than raw data. • Taught (roughly) to all CS students... • Semantic attributions can capture essential relationships. • Traversals can be faster than filtering DB joins. • Provide clear phrasing for queries about relationships. 5
  • 6. potential applications • Social Networks • Identify communities, influences, bridges, trends, anomalies (trends before they happen)... • Potential to help social sciences, city planning, and others with large-scale data. • Cybersecurity • Determine if new connections can access a device or represent new threat in < 5ms... • Is the transfer by a virus / persistent threat? • Bioinformatics, health • Construct gene sequences, analyze protein interactions, map brain interactions • Credit fraud forensics ⇒ detection ⇒ monitoring • Integrate all the customer’s data, identify in real-time 6
  • 7. streaming graph data Networks data rates: • Gigabit ethernet: 81k – 1.5M packets per second • Over 130 000 flows per second on 10 GigE (< 7.7 µs) Person-level data rates: • 500M posts per day on Twitter (6k / sec)1 • 3M posts per minute on Facebook (50k / sec)2 We need to analyze only changes and not entire graph. Throughput & latency trade off and expose different levels of concurrency. 1 www.internetlivestats.com/twitter-statistics/ 2 www.jeffbullas.com/2015/04/17/21-awesome-facebook-facts-and-statistics-you-need-to-check-out/ 7
  • 8. streaming graph analysis Terminology: • Streaming changes into a massive, evolving graph • Not CS streaming algorithm (tiny memory) • Need to handle deletions as well as insertions Previous throughput results (not comprehensive review): Data ingest >2M up/sec [Ediger, McColl, Poovey, Campbell, & B 2014] Clustering coefficients >100K up/sec [R, Meyerhenke, Bader, Ediger, & Mattson 2012] Connected comp. >1M up/sec [McColl, Green, & B 2013] Community clustering >100K up/sec∗ [R & B 2013] 8
  • 10. pagerank Everyone’s “favorite” metric: PageRank. • Stationary distribution of the random surfer model. • Eigenvalue problem can be re-phrased as a linear system ( I − αAT D−1 ) x = kv, with α teleportation constant, much < 1 A adjacency matrix D diagonal matrix of out degrees, with x/0 = x (self-loop) v personalization vector, here 1/|V| k irrelevant scaling constant • Amenable to analysis, etc. 10
  • 11. incremental pagerank • Streaming data setting, update PageRank without touching the entire graph. • Existing methods maintain databases of walks, etc. • Let A∆ = A + ∆A, D∆ = D + ∆D for the new graph, want to solve for x + ∆x. • Simple algebra: ( I − αAT ∆D−1 ∆ ) ∆x = α ( A∆D−1 ∆ − AD−1 ) x, and the right-hand side is sparse. • Re-arrange for Jacobi, ∆x(k+1) = αAT ∆D−1 ∆ ∆x(k) + α ( A∆D−1 ∆ − AD−1 ) x, iterate, ... 11
  • 12. incremental pagerank: accumulating error • And fail. The updated solution wanders away from the true solution. Top rankings stay the same... 12
  • 13. incremental pagerank: think instead • The old solution x is an ok, not exact, solution to the original problem, now a nearby problem. • How close? Residual: r′ = kv − x + αA∆D−1 ∆ x = r + α ( A∆D−1 ∆ − AD−1 ) x. • Solve (I − αA∆D−1 ∆ )∆x = r′ . • Cheat by not refining all of r′ , only region growing around the changes: (I − αA∆D−1 ∆ )∆x = r′ |∆ • (Also cheat by updating r rather than recomputing at the changes.) 13
  • 14. incremental pagerank: works Riedy, GABB at IPDPS 2016, to appear. 14
  • 15. incremental pagerank: worst latency q q q q q q q q q q q q q q q 0.001 0.010 0.001 0.010 0.01 0.10 1.00 1 100 1 10 100 powerPGPgiantcompocaidaRouterLevelbelgium.osmcoPapersCiteseer 10 100 1000 Batch size Updatetime(s) Algorithm q dpr dprheld pr_restart Riedy, GABB at IPDPS 2016, to appear. 15
  • 17. graphs: big, nasty hairballs Yifan Hu’s (AT&T) visualization of the in-2004 data set http://www2.research.att.com/~yifanhu/gallery.html 17
  • 18. but no shortage of structure... Protein interactions, Giot et al., “A Protein Interaction Map of Drosophila melanogaster”, Science 302, 1722-1736, 2003. Jason’s network via LinkedIn Labs • Locally, there are clusters or communities. • There are methods for global community detection. • Also need local communities around seeds for queries and targetted analysis. 18
  • 19. seed set expansion • Seed set expansion finds the “best” subgraph or communities for a set of vertices of interest • Many quality criteria: Modularity, conductance, etc. • Can be applied to cryptocurrency to identify and track groups of interacting entities • Dynamic algorithm updates communities faster than recomputation, allowing us to keep up with new data produced 19
  • 20. static seed set expansion Greedy expansion starting from S = { seed vertices }: 1. Check the fitness of every vertex v neighboring S. • fitness = f(S ∪ {v}) − f(S) 2. It any fitness is positive, include most fit v in S. • Currently sequential, could include all sufficiently good neighbors. 3. Record at which step v is included. • This list is the base for updates. Now the dynamic version by example... 20
  • 21. dynamic seed set example In preparation with Anita Zakrzewska and Eisha Nathan 21
  • 22. dynamic seed set example In preparation with Anita Zakrzewska and Eisha Nathan 21
  • 23. dynamic seed set example In preparation with Anita Zakrzewska and Eisha Nathan 21
  • 24. dynamic seed set quality Graphs from the Koblenz Network Collection. 22
  • 25. dynamic seed set speed-up Graphs from the Koblenz Network Collection. 23
  • 27. community detection • Partition a graph’s vertices into disjoint communities. • A community locally optimizes some metric, NP-hard. • Trying to capture that vertices are more similar within one community than between communities. Jason’s network via LinkedIn Labs 25
  • 28. what about streaming? • Simple approach based on agglomeration: 1. Extract all vertices touched by an update. 2. Re-start agglomeration. • “Works” and is fast (MTAAP 2013), but never mentioned quality. • Extracted vertices form bridges and do not re-merge. • Some methods based on label propagation, not metric-driven. • Backtracking (e.g. Görke, et al., JEA 2013) preserves quality at cost of change size. • Ongoing: Can we limit backtracking? 26
  • 29. community quality: achievable Data from Stanford SNAP archive: Facebook. Stream generation: Reversing the graph. In preparation with Pushkar Godbolé 27
  • 30. community quality: change size Data from Stanford SNAP archive: Facebook. Stream generation: Reversing the graph. In preparation with Pushkar Godbolé 28
  • 32. future directions • Of course, continuing to develop streaming / dynamic / incremental algorithms. • For massive graphs, computing small changes is always a win. • Improving approximations or replacing expensive metrics like betweenness centrality would be great. • Including more external and semantic data. • If vertices are documents or data records, many more measures of similarity. • Only now being exploited in concert with static graph algorithms. 30
  • 33. hpc lab people Faculty: • David A. Bader • Oded Green Data: • Pushkar Godbolé • Anita Zakrzewska • Eisha Nathan STINGER: • Robert McColl, • James Fairbanks, • Adam McLaughlin, • Daniel Henderson, • David Ediger (now GTRI), • Jason Poovey (GTRI), • Karl Jiang, and • feedback from users in industry, government, academia Support: DoD, DoE, NSF, Intel, IBM, Oracle 31
  • 34. stinger: where do you get it? Home: www.cc.gatech.edu/stinger/ Code: git.cc.gatech.edu/git/u/eriedy3/stinger.git/ Gateway to • code, • development, • documentation, • presentations... Remember: Academic code, but maturing with contributions. Users / contributors / questioners: Georgia Tech, PNNL, CMU, Berkeley, Intel, Cray, NVIDIA, IBM, Federal Government, Ionic Security, Citi, ... 32