SlideShare a Scribd company logo
Simplicial closure and
higher-order link
prediction
Austin R. Benson · Cornell
LA/OPT Seminar · April 26, 2018
LA/OPT Seminar 1
bit.ly/sc-holp-code bit.ly/arb-laopt-2018
bit.ly/sc-holp-data arXiv 1802.06916
Joint work with
Rediet Abebe & Jon Kleinberg
(Cornell)
Michael Schaub & Ali Jadbabaie (MIT)
Networks are sets of nodes and edges (graphs)
that model real-world systems.
LA/OPT Seminar 2
Collaboration
nodes are
people/groups
edges link entities
working together
Communications
nodes are
people/accounts
edges show info.
exchange
Physical proximity
nodes are people/animals
edges link those that
interact in close proximity
Drug compounds
nodes are substances
edge between substances
that appear in the same drug
Real-world systems are often composed from
“higher-order” interactions that we reduce to
pairwise ones.
LA/OPT Seminar 3
Collaboration
nodes are
people/groups
teams are made up of
small groups
Communications
nodes are
people/accounts
emails often have several
recipients, not just one
Physical proximity
nodes are people/animals
people often gather in
small groups
Drug compounds
nodes are substances
drugs are made up of
several substances
There are many ways to mathematically represent
the higher-order structure present in relational
data.
LA/OPT Seminar 4
• Hypergraphs [Berge 89]
• Set systems [Frankl 95]
• Affiliation networks [Feld 81, Newman-Watts-Strogatz 02]
• Abstract simplicial complexes [Lim 15, Osting-Palande-Wang 17]
• Multilayer networks [Kivelä+ 14, Boccaletti+ 14, many others…]
• Meta-paths [Sun-Han 12]
• Motif-based representations [Benson-Gleich-Leskovec 15, 17]
• …
LA/OPT Seminar 5
However, there are problems with higher-order data
representation…
1. Researchers and practitioners often don’t use it.
Graphs are easier and we already have lots of graph analysis tools.
2. We don’t have good frameworks for evaluating them
(especially for what they can do beyond graphs).
A goal with this research is to provide a framework through
which higher-order models can be evaluated.
Link prediction is a classical machine learning
problem in network science, which is used to
evaluate models.
LA/OPT Seminar 6
We observe data which is a list of edges in a graph up to some point t.
We want to predict which new edges will form in the future.
Shows up in a variety of applications
• Predicting new social relationships and friend recommendation.
[Backstrom-Leskovec 11; Wang+ 15]
• Inferring new links between genes and diseases.
[Wang-Gulbahce-Yu 11; Moreau-Tranchevent 12]
• Suggesting novel connections in the scientific community.
[Liben-Nowell-Kleinberg 07; Tang-Wu-Sun-Su 12]
Also useful as a framework to evaluate new methods!
[Liben-Nowell-Kleinberg 07; Lü-Zhau 11]
We propose “higher-order link prediction” as a
similar framework for evaluation of higher-order
models.
LA/OPT Seminar 7
Data.
Observe simplices up to some
time t. Using this data, want to
predict what groups of > 2
nodes will appear in a simplex
in the future.
t
1
2
3
4
5
6
7
8
9
We predict structure that classical link
prediction would not even consider!
Possible applications
• Novel combinations of drugs for treatments.
• Group chat recommendation in social networks.
• Team formation.
Thinking of higher-order data as a weighted
projected graph with “filled in” structures is a
convenient viewpoint.
LA/OPT Seminar 8
1
2
3
4
5
6
7
8
9
Data. Pictures to have in mind.
Generalized means of edges weights are often
good predictors of new 3-node simplices
appearing.
LA/OPT Seminar 9
Good performance from this local information is a deviation from classical link prediction,
where methods that use long paths (e.g., PageRank) perform well [Liben-Nowell & Kleinberg
07].
For structures on k nodes, the subsets of size k-1 contain rich information only when k > 2.
i
j k
Wij
Wjk
Wjk
i
j k
?
LA/OPT Seminar 10
Three parts of this talk.
1. Basic study of datasets.
How diverse are the types of datasets that we encounter?
2. Simplicial closure.
How do we characterize the ways in which new simplices appear?
3. Higher-order link prediction.
How do we use insights to predict which new simplices appear?
Our datasets are timestamped simplices, where
each simplex is a subset of nodes.
LA/OPT Seminar 11
1. Co-authorship in different domains.
2. Emails with multiple recipients.
3. Tags on Q&A forums.
4. Threads on Q&A forums.
5. Contact/proximity measurements.
6. Musical artist collaboration.
7. Substance makeup and classification
codes applied to drugs the FDA
examines.
8. U.S. Congress committee
memberships and bill sponsorship.
9. Combinations of drugs seen in
patients in ER visits.
https://math.stackexchange.com/q/80181
Thinking of higher-order data as a weighted
projected graph with “filled in” structures is a
convenient viewpoint.
LA/OPT Seminar 12
1
2
3
4
5
6
7
8
9
Data. Pictures to have in mind.
LA/OPT Seminar 13
i
j k
i
j k
Warm-up. What’s more common?
or
“Open triangle”
each pair has been in a
simplex together but all 3
nodes have never been in the
same simplex
“Closed triangle”
there is some simplex that
contains all 3 nodes
There is lots of variation in the fraction of triangles
that are open, but datasets from the same domain
are similar.
LA/OPT Seminar 14
Most open triangles do not come from
asynchronous temporal behavior.
LA/OPT Seminar 15
i
j k
In 61.1% to 97.4% of open
triangles, all three pairs of edges
have an overlapping period of
activity.
⟶ there is an overlapping period
of
activity between all 3 edges
(Helly’s theorem).
A simple model can account for open triangle
variation.
LA/OPT Seminar 16
Consider a dataset with n nodes.
Form a 3-node simplex between nodes i, j, k with prob. p = 1 / nb i.i.d.
Thus, we always get ϴ(pn3) = ϴ(n3 - b) closed triangles in expectation.
Proposition (rough).
If b < 1, then we get ϴ(n3) open triangles in expectation for large n.
If b > 1, then we get ϴ(n3(2-b)) open triangles in expectation for large n.
⟶ The number of open triangles grows faster for b < 3/2.
A simple model can account for open triangle
variation.
LA/OPT Seminar 17
Consider a dataset with n nodes.
Form a 3-node simplex between nodes i, j, k with prob. p = 1 / nb i.i.d.
b = 0.8, 0.82, 0.84, ..., 1.8
Larger b is darker marker.
Summary of structure of datasets.
LA/OPT Seminar 18
1. Datasets with higher-order structure are common.
2. It’s useful to think of them as “projected graphs” with additional filled
in structure (like a simplicial complex).
3. Perhaps surprisingly, several datasets have many open triangles.
Not due to asynchronous temporal behavior.
A simple model can give a range of open triangles.
LA/OPT Seminar 19
Three parts of this talk.
1. Basic study of datasets.
How diverse are the types of datasets that we encounter?
2. Simplicial closure.
How do we characterize the ways in which new simplices appear?
3. Higher-order link prediction.
How do we use insights to predict which new simplices appear?
1
2+1
1
2+
1
1
1
1
2+
2+
2+
1
1
2+
2+
1
2+
2+
2+
58,749
245,996
13,034
74,219
45,71514,541
13,034
7,575
13,098
5,781
32,617
773
5,029
3,56021,103952
11,579
445
5,029
389
9,674
618
2,977
285
0
2,732,839
0
157,236
0
66,644
0
7,987
0
8,844
0
328
0
3,171
0
779
0
722
0
285
Simplicial closure is the appearance of a group of
nodes appearing together in a simplex for the first
time.
LA/OPT Seminar 20
Co-authorship data of authors publishing in history.
Simplicial closure is the appearance of a group of
nodes appearing together in a simplex for the first
time.
LA/OPT Seminar 21
H IV prot ease
inhibit ors
UGT 1A 1
inhibit ors
Breast cancer resist ance
prot ein inhibit ors
1
2+
2+
1
2+
1
1
2+
2+
1
2+
2+
2+
Reyat az
RedPharm
2003
Reyat az
Squibb & Sons
2003
K alet ra
Physicians To-
t al Care
2006
Promact a
GSK (25mg)
2008
Promact a
GSK (50mg)
2008
K alet ra
D OH Cent ral
Pharmacy
2009
Evot az
Squibb & Sons
2015
Substances in marketed drugs recorded in the National Drug Code
directory.
icons
colors 16.04
1
2+
2+
1
2+
2+
H ow can I change
t he icon colors, ap-
pearance, et c. at
t he t op panel?
2011
H ow do I change
t he icon and t ext
color?
2012
U bunt u 15.10
/ 16.04 t heme
doesn’t change
2016
U bunt u 16.04
Eclipse launcher
icon problems
2016
Set deskt op icons background
color K ubunt u 16.04
2016
Tags on askubuntu.com forum questions.
Simplicial closure probability on 3 nodes largely
depends on the weights in the projected network.
LA/OPT Seminar 22
Take first 80% of the data (in time), record the configuration of every triple of
nodes, and compute the fraction that simplicially close in the final 20% of the data.
Increased edge density
increases closure
probability.
Increased tie strength
increases closure
probability.
Tension between
edge density and tie
strength.
Left and middle observations are consistent with theory and empirical studies of social
networks. [Granovetter 73; Leskovec+ 08; Backstrom+ 06; Kossinets-Watts 06]
Simplicial closure probability on 4 nodes has
similar behavior to those with 3 nodes, just “up
one dimension”.
LA/OPT Seminar 23
Take first 80% of the data (in time), record the configuration of every 4
nodes, and compute the fraction that simplicially close in the final 20% of
the data.
Increased edge density
increases closure
probability.
Increased simplicial tie
strength increases
closure probability.
Tension between
simplicial density
and simplicial tie
strength.
Induced count = c - 2 * # - 2 * # - #
- 3 * # - 3 * # - 4 * #
Computing simplicial closure probabilities
requires some clever counting algorithms.
LA/OPT Seminar 24
How many sets of 4 nodes induce the structure?
1
2+
Assume we can enumerate 3- and 4-cliques and quickly
determine their “simplicial structure”.
1
2+1
1
2+ 1
2+
2+
1
1
1
2+
1
1
2+
2+
1
2+
2+
2+
Summary of simplicial closure.
LA/OPT Seminar 25
1. Useful to think of “trajectories” of sets of nodes in the projected graph.
2. More edges between group of nodes ⟶ more likely to close in future.
True for both 3-node and 4-node groups!
3. Stronger ties between groups of nodes ⟶ also more likely to close.
True for both 3-node and 4-node groups!
LA/OPT Seminar 26
Three parts of this talk.
1. Basic study of datasets.
How diverse are the types of datasets that we encounter?
2. Simplicial closure.
How do we characterize the ways in which new simplices appear?
3. Higher-order link prediction.
How do we use insights to predict which new simplices appear?
We propose “higher-order link prediction” as a
similar framework for evaluation of higher-order
models.
LA/OPT Seminar 27
Data.
Observe simplices up to some
time t. Using this data, want to
predict what groups of > 2
nodes will appear in a simplex
in the future.
t
1
2
3
4
5
6
7
8
9
We predict structure that classical link
prediction would not even consider!
LA/OPT Seminar 28
Our structural analysis tells us what we should be
looking at for prediction.
1. Edge density is a positive indicator.
⟶ focus our attention on predicting “open triangles”
2. Tie strength is a positive indicator.
⟶ various ways of incorporating this information
i
j k
Wij
Wjk
Wjk
LA/OPT Seminar 29
For every open triangle, we assign a score
function on first 80% of data based on structural
properties.
Four broad classes of score functions for
an open triangle. Score is s(i, j, k).
s(i, j, k)…
1. is a function of Wij, Wjk, Wjk
2. is a function of A[:, [i, j, k]] and the
simplices containing these nodes
3. is built from “whole-network”
similarity scores on edges
4. is “learned” from data
i
j k
Wij
Wjk
Wjk
After computing scores, predict that open triangles with highest
scores will simplicially close in final 20% of data.
LA/OPT Seminar 30
1. s(i, j, k) is a function of Wij , Wjk , and Wjk
i
j k
Wij
Wjk
Wjk
1. Arithmetic mean
2. Geometric mean
3. Harmonic mean
4. Generalized mean
LA/OPT Seminar 31
2. s(i, j, k) is a function of is a function of A[:, [i, j,
k]] and the related simplicial information
i
j k
Wij
Wjk
Wjk
1. Common 4-th neighbors
2. Generalized Jaccard coefficient
3. Preferential attachment (graph)
4. Preferential attachment (simplices)
LA/OPT Seminar 32
3. s(i, j, k) is is built from “whole-network”
similarity scores on edges: s(i, j, k) = Sij + Sji + Sjk +
Skj + Sik + Ski
i
j k
Wij
Wjk
Wjk
1. PageRank (unweighted or
weighted)
2. Katz (unweighted or weighted)
3. Fancier topological methods based
on lifted random walks
(work in progress…)
LA/OPT Seminar 33
Hopefully Michael chimed in on matrix inverses…
Want to reduce
computational cost
(only need scores on
open triangles).
For each node i that participates in an open triangle
1. Solve
2. Store ith column
using GMRES with low tolerance
LA/OPT Seminar 34
4. s(i, j, k) is learned from data
1. Split data into training and validation sets.
2. Compute features of (i, j, k) from previous ideas using training data.
3. Throw features + validation labels into machine learning blender
→ learn model.
4. Re-compute features on combined training + validation
→ apply model on the data.
LA/OPT Seminar 35
LA/OPT Seminar 36
A few lessons learned from applying all of these
ideas.
1. We can predict pretty well on all datasets using some method.
→ 4x to 107x better than random w/r/t mean average
precision
depending on the dataset/method
2. On some classes of datasets, we can consistently predict well.
→ thread co-participation and co-tagging on stack exchange
3. Simply averaging Wij, Wjk, and Wik consistently performs well.
→ something between harmonic and geometric mean is
consistently best
i
j k
Wij
Wjk
Wjk
Generalized means of edges weights are often
good predictors of new 3-node simplices
appearing.
LA/OPT Seminar 37
Good performance from this local information is a deviation from classical link prediction,
where methods that use long paths (e.g., PageRank) perform well [Liben-Nowell & Kleinberg
07].
For structures on k nodes, the subsets of size k-1 contain rich information only when k > 2.
i
j k
Wij
Wjk
Wjk
i
j k
?
LA/OPT Seminar 38
Shameless plug for SIAM ALA 2018!
David Gleich and I are organizing a minitutorial.
Tensor Eigenvectors and Stochastic Processes
May 6, 10:45am – 12:45pm
Learn about how stochastics help us understand
problems in numerical multilinear algebra!
It’s a minitutorial—little background is assumed.
Lots of open problems for ICME students to solve.
Simplicial closure and higher-order link prediction.
LA/OPT Seminar 39
Paper. arXiv 1802.06916 Code. bit.ly/sc-holp-code
Data. bit.ly/sc-holp-data Slides. bit.ly/arb-laopt-
2018
THANKS!
http://cs.cornell.edu/~arb
@austinbenson
arb@cs.cornell.edu
1. We want higher-order models to
be pervasive.
2. Higher-order link prediction is a
way to compare models.
3. Have a better model or
algorithm? Great! Show us on
the data.
4. Lots of interesting structural
information in these datasets.

More Related Content

PDF
Computational Frameworks for Higher-order Network Data Analysis
PDF
Higher-order link prediction
PDF
Three hypergraph eigenvector centralities
PDF
Higher-order link prediction and other hypergraph modeling
PPTX
Simplicial closure and higher-order link prediction (SIAMNS18)
PDF
Set prediction three ways
PDF
Simplicial closure & higher-order link prediction
PPTX
Simplicial closure and higher-order link prediction
Computational Frameworks for Higher-order Network Data Analysis
Higher-order link prediction
Three hypergraph eigenvector centralities
Higher-order link prediction and other hypergraph modeling
Simplicial closure and higher-order link prediction (SIAMNS18)
Set prediction three ways
Simplicial closure & higher-order link prediction
Simplicial closure and higher-order link prediction

What's hot (19)

PDF
Higher-order Link Prediction GraphEx
PDF
Link prediction in networks with core-fringe structure
PDF
Semi-supervised learning of edge flows
PDF
Higher-order Link Prediction Syracuse
PDF
Sampling methods for counting temporal motifs
PDF
Unsupervised Learning of a Social Network from a Multiple-Source News Corpus
PDF
Sequences of Sets KDD '18
PDF
An Efficient Modified Common Neighbor Approach for Link Prediction in Social ...
PDF
Choosing to grow a graph
PDF
TRANSFORMATION RULES FOR BUILDING OWL ONTOLOGIES FROM RELATIONAL DATABASES
PPS
Classement Leiden Ranking
PPTX
Supporting scientific discovery through linkages of literature and data
PDF
An information-theoretic, all-scales approach to comparing networks
DOC
Equation 2.doc
PDF
BookyScholia: A Methodology for the Investigation of Expert Systems
PDF
Adaptive named entity recognition for social network analysis and domain onto...
PDF
DESIGN METHODOLOGY FOR RELATIONAL DATABASES: ISSUES RELATED TO TERNARY RELATI...
PDF
Predicting_new_friendships_in_social_networks
PDF
OntoFrac-S
Higher-order Link Prediction GraphEx
Link prediction in networks with core-fringe structure
Semi-supervised learning of edge flows
Higher-order Link Prediction Syracuse
Sampling methods for counting temporal motifs
Unsupervised Learning of a Social Network from a Multiple-Source News Corpus
Sequences of Sets KDD '18
An Efficient Modified Common Neighbor Approach for Link Prediction in Social ...
Choosing to grow a graph
TRANSFORMATION RULES FOR BUILDING OWL ONTOLOGIES FROM RELATIONAL DATABASES
Classement Leiden Ranking
Supporting scientific discovery through linkages of literature and data
An information-theoretic, all-scales approach to comparing networks
Equation 2.doc
BookyScholia: A Methodology for the Investigation of Expert Systems
Adaptive named entity recognition for social network analysis and domain onto...
DESIGN METHODOLOGY FOR RELATIONAL DATABASES: ISSUES RELATED TO TERNARY RELATI...
Predicting_new_friendships_in_social_networks
OntoFrac-S
Ad

Similar to Simplicial closure and higher-order link prediction LA/OPT (20)

PDF
Simplicial closure and simplicial diffusions
PDF
Simplicial closure and higher-order link prediction --- SIAMNS18
PDF
Protein-protein interactions-graph-theoretic-modeling
PDF
Simplicial closure & higher-order link prediction
PDF
Biological networks 1st Edition Frantois Kopos
PDF
Biological networks 1st Edition Frantois Kopos all chapter instant download
PDF
Biological networks 1st Edition Frantois Kopos
PDF
Summary Of Thesis
PPT
Socialnetworkanalysis (Tin180 Com)
PDF
Engineering Data Science Objectives for Social Network Analysis
PPTX
Master defence 2020 - Serhii Brodiuk - Concept Embedding and Network Analysis...
PPTX
Community detection
PPTX
02 Network Data Collection
PPTX
02 Network Data Collection (2016)
PDF
Carmine gelormini network analysis
PDF
Chemistry Reserach as a Social Machine
PDF
Maps of sparse memory networks reveal overlapping communities in network flows
PDF
Community structure in social and biological structures
PDF
Phylogenetics
PDF
Distribution of maximal clique size of the
Simplicial closure and simplicial diffusions
Simplicial closure and higher-order link prediction --- SIAMNS18
Protein-protein interactions-graph-theoretic-modeling
Simplicial closure & higher-order link prediction
Biological networks 1st Edition Frantois Kopos
Biological networks 1st Edition Frantois Kopos all chapter instant download
Biological networks 1st Edition Frantois Kopos
Summary Of Thesis
Socialnetworkanalysis (Tin180 Com)
Engineering Data Science Objectives for Social Network Analysis
Master defence 2020 - Serhii Brodiuk - Concept Embedding and Network Analysis...
Community detection
02 Network Data Collection
02 Network Data Collection (2016)
Carmine gelormini network analysis
Chemistry Reserach as a Social Machine
Maps of sparse memory networks reveal overlapping communities in network flows
Community structure in social and biological structures
Phylogenetics
Distribution of maximal clique size of the
Ad

More from Austin Benson (12)

PDF
Hypergraph Cuts with General Splitting Functions (JMM)
PDF
Spectral embeddings and evolving networks
PDF
Hypergraph Cuts with General Splitting Functions
PDF
Hypergraph Cuts with General Splitting Functions
PDF
Random spatial network models for core-periphery structure
PDF
Random spatial network models for core-periphery structure.
PPTX
Higher-order clustering in networks
PPTX
New perspectives on measuring network clustering
PPTX
Higher-order spectral graph clustering with motifs
PPTX
Tensor Eigenvectors and Stochastic Processes
PPTX
Spacey random walks CMStatistics 2017
PPTX
Spacey random walks CAM Colloquium
Hypergraph Cuts with General Splitting Functions (JMM)
Spectral embeddings and evolving networks
Hypergraph Cuts with General Splitting Functions
Hypergraph Cuts with General Splitting Functions
Random spatial network models for core-periphery structure
Random spatial network models for core-periphery structure.
Higher-order clustering in networks
New perspectives on measuring network clustering
Higher-order spectral graph clustering with motifs
Tensor Eigenvectors and Stochastic Processes
Spacey random walks CMStatistics 2017
Spacey random walks CAM Colloquium

Recently uploaded (20)

PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
Lecture1 pattern recognition............
PDF
Mega Projects Data Mega Projects Data
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
modul_python (1).pptx for professional and student
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPT
DATA COLLECTION METHODS-ppt for nursing research
PDF
Business Analytics and business intelligence.pdf
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PDF
[EN] Industrial Machine Downtime Prediction
PDF
Microsoft Core Cloud Services powerpoint
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Leprosy and NLEP programme community medicine
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PDF
annual-report-2024-2025 original latest.
Galatica Smart Energy Infrastructure Startup Pitch Deck
ISS -ESG Data flows What is ESG and HowHow
Lecture1 pattern recognition............
Mega Projects Data Mega Projects Data
IBA_Chapter_11_Slides_Final_Accessible.pptx
modul_python (1).pptx for professional and student
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
DATA COLLECTION METHODS-ppt for nursing research
Business Analytics and business intelligence.pdf
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
[EN] Industrial Machine Downtime Prediction
Microsoft Core Cloud Services powerpoint
Acceptance and paychological effects of mandatory extra coach I classes.pptx
SAP 2 completion done . PRESENTATION.pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Qualitative Qantitative and Mixed Methods.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Leprosy and NLEP programme community medicine
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
annual-report-2024-2025 original latest.

Simplicial closure and higher-order link prediction LA/OPT

  • 1. Simplicial closure and higher-order link prediction Austin R. Benson · Cornell LA/OPT Seminar · April 26, 2018 LA/OPT Seminar 1 bit.ly/sc-holp-code bit.ly/arb-laopt-2018 bit.ly/sc-holp-data arXiv 1802.06916 Joint work with Rediet Abebe & Jon Kleinberg (Cornell) Michael Schaub & Ali Jadbabaie (MIT)
  • 2. Networks are sets of nodes and edges (graphs) that model real-world systems. LA/OPT Seminar 2 Collaboration nodes are people/groups edges link entities working together Communications nodes are people/accounts edges show info. exchange Physical proximity nodes are people/animals edges link those that interact in close proximity Drug compounds nodes are substances edge between substances that appear in the same drug
  • 3. Real-world systems are often composed from “higher-order” interactions that we reduce to pairwise ones. LA/OPT Seminar 3 Collaboration nodes are people/groups teams are made up of small groups Communications nodes are people/accounts emails often have several recipients, not just one Physical proximity nodes are people/animals people often gather in small groups Drug compounds nodes are substances drugs are made up of several substances
  • 4. There are many ways to mathematically represent the higher-order structure present in relational data. LA/OPT Seminar 4 • Hypergraphs [Berge 89] • Set systems [Frankl 95] • Affiliation networks [Feld 81, Newman-Watts-Strogatz 02] • Abstract simplicial complexes [Lim 15, Osting-Palande-Wang 17] • Multilayer networks [Kivelä+ 14, Boccaletti+ 14, many others…] • Meta-paths [Sun-Han 12] • Motif-based representations [Benson-Gleich-Leskovec 15, 17] • …
  • 5. LA/OPT Seminar 5 However, there are problems with higher-order data representation… 1. Researchers and practitioners often don’t use it. Graphs are easier and we already have lots of graph analysis tools. 2. We don’t have good frameworks for evaluating them (especially for what they can do beyond graphs). A goal with this research is to provide a framework through which higher-order models can be evaluated.
  • 6. Link prediction is a classical machine learning problem in network science, which is used to evaluate models. LA/OPT Seminar 6 We observe data which is a list of edges in a graph up to some point t. We want to predict which new edges will form in the future. Shows up in a variety of applications • Predicting new social relationships and friend recommendation. [Backstrom-Leskovec 11; Wang+ 15] • Inferring new links between genes and diseases. [Wang-Gulbahce-Yu 11; Moreau-Tranchevent 12] • Suggesting novel connections in the scientific community. [Liben-Nowell-Kleinberg 07; Tang-Wu-Sun-Su 12] Also useful as a framework to evaluate new methods! [Liben-Nowell-Kleinberg 07; Lü-Zhau 11]
  • 7. We propose “higher-order link prediction” as a similar framework for evaluation of higher-order models. LA/OPT Seminar 7 Data. Observe simplices up to some time t. Using this data, want to predict what groups of > 2 nodes will appear in a simplex in the future. t 1 2 3 4 5 6 7 8 9 We predict structure that classical link prediction would not even consider! Possible applications • Novel combinations of drugs for treatments. • Group chat recommendation in social networks. • Team formation.
  • 8. Thinking of higher-order data as a weighted projected graph with “filled in” structures is a convenient viewpoint. LA/OPT Seminar 8 1 2 3 4 5 6 7 8 9 Data. Pictures to have in mind.
  • 9. Generalized means of edges weights are often good predictors of new 3-node simplices appearing. LA/OPT Seminar 9 Good performance from this local information is a deviation from classical link prediction, where methods that use long paths (e.g., PageRank) perform well [Liben-Nowell & Kleinberg 07]. For structures on k nodes, the subsets of size k-1 contain rich information only when k > 2. i j k Wij Wjk Wjk i j k ?
  • 10. LA/OPT Seminar 10 Three parts of this talk. 1. Basic study of datasets. How diverse are the types of datasets that we encounter? 2. Simplicial closure. How do we characterize the ways in which new simplices appear? 3. Higher-order link prediction. How do we use insights to predict which new simplices appear?
  • 11. Our datasets are timestamped simplices, where each simplex is a subset of nodes. LA/OPT Seminar 11 1. Co-authorship in different domains. 2. Emails with multiple recipients. 3. Tags on Q&A forums. 4. Threads on Q&A forums. 5. Contact/proximity measurements. 6. Musical artist collaboration. 7. Substance makeup and classification codes applied to drugs the FDA examines. 8. U.S. Congress committee memberships and bill sponsorship. 9. Combinations of drugs seen in patients in ER visits. https://math.stackexchange.com/q/80181
  • 12. Thinking of higher-order data as a weighted projected graph with “filled in” structures is a convenient viewpoint. LA/OPT Seminar 12 1 2 3 4 5 6 7 8 9 Data. Pictures to have in mind.
  • 13. LA/OPT Seminar 13 i j k i j k Warm-up. What’s more common? or “Open triangle” each pair has been in a simplex together but all 3 nodes have never been in the same simplex “Closed triangle” there is some simplex that contains all 3 nodes
  • 14. There is lots of variation in the fraction of triangles that are open, but datasets from the same domain are similar. LA/OPT Seminar 14
  • 15. Most open triangles do not come from asynchronous temporal behavior. LA/OPT Seminar 15 i j k In 61.1% to 97.4% of open triangles, all three pairs of edges have an overlapping period of activity. ⟶ there is an overlapping period of activity between all 3 edges (Helly’s theorem).
  • 16. A simple model can account for open triangle variation. LA/OPT Seminar 16 Consider a dataset with n nodes. Form a 3-node simplex between nodes i, j, k with prob. p = 1 / nb i.i.d. Thus, we always get ϴ(pn3) = ϴ(n3 - b) closed triangles in expectation. Proposition (rough). If b < 1, then we get ϴ(n3) open triangles in expectation for large n. If b > 1, then we get ϴ(n3(2-b)) open triangles in expectation for large n. ⟶ The number of open triangles grows faster for b < 3/2.
  • 17. A simple model can account for open triangle variation. LA/OPT Seminar 17 Consider a dataset with n nodes. Form a 3-node simplex between nodes i, j, k with prob. p = 1 / nb i.i.d. b = 0.8, 0.82, 0.84, ..., 1.8 Larger b is darker marker.
  • 18. Summary of structure of datasets. LA/OPT Seminar 18 1. Datasets with higher-order structure are common. 2. It’s useful to think of them as “projected graphs” with additional filled in structure (like a simplicial complex). 3. Perhaps surprisingly, several datasets have many open triangles. Not due to asynchronous temporal behavior. A simple model can give a range of open triangles.
  • 19. LA/OPT Seminar 19 Three parts of this talk. 1. Basic study of datasets. How diverse are the types of datasets that we encounter? 2. Simplicial closure. How do we characterize the ways in which new simplices appear? 3. Higher-order link prediction. How do we use insights to predict which new simplices appear?
  • 21. Simplicial closure is the appearance of a group of nodes appearing together in a simplex for the first time. LA/OPT Seminar 21 H IV prot ease inhibit ors UGT 1A 1 inhibit ors Breast cancer resist ance prot ein inhibit ors 1 2+ 2+ 1 2+ 1 1 2+ 2+ 1 2+ 2+ 2+ Reyat az RedPharm 2003 Reyat az Squibb & Sons 2003 K alet ra Physicians To- t al Care 2006 Promact a GSK (25mg) 2008 Promact a GSK (50mg) 2008 K alet ra D OH Cent ral Pharmacy 2009 Evot az Squibb & Sons 2015 Substances in marketed drugs recorded in the National Drug Code directory. icons colors 16.04 1 2+ 2+ 1 2+ 2+ H ow can I change t he icon colors, ap- pearance, et c. at t he t op panel? 2011 H ow do I change t he icon and t ext color? 2012 U bunt u 15.10 / 16.04 t heme doesn’t change 2016 U bunt u 16.04 Eclipse launcher icon problems 2016 Set deskt op icons background color K ubunt u 16.04 2016 Tags on askubuntu.com forum questions.
  • 22. Simplicial closure probability on 3 nodes largely depends on the weights in the projected network. LA/OPT Seminar 22 Take first 80% of the data (in time), record the configuration of every triple of nodes, and compute the fraction that simplicially close in the final 20% of the data. Increased edge density increases closure probability. Increased tie strength increases closure probability. Tension between edge density and tie strength. Left and middle observations are consistent with theory and empirical studies of social networks. [Granovetter 73; Leskovec+ 08; Backstrom+ 06; Kossinets-Watts 06]
  • 23. Simplicial closure probability on 4 nodes has similar behavior to those with 3 nodes, just “up one dimension”. LA/OPT Seminar 23 Take first 80% of the data (in time), record the configuration of every 4 nodes, and compute the fraction that simplicially close in the final 20% of the data. Increased edge density increases closure probability. Increased simplicial tie strength increases closure probability. Tension between simplicial density and simplicial tie strength.
  • 24. Induced count = c - 2 * # - 2 * # - # - 3 * # - 3 * # - 4 * # Computing simplicial closure probabilities requires some clever counting algorithms. LA/OPT Seminar 24 How many sets of 4 nodes induce the structure? 1 2+ Assume we can enumerate 3- and 4-cliques and quickly determine their “simplicial structure”. 1 2+1 1 2+ 1 2+ 2+ 1 1 1 2+ 1 1 2+ 2+ 1 2+ 2+ 2+
  • 25. Summary of simplicial closure. LA/OPT Seminar 25 1. Useful to think of “trajectories” of sets of nodes in the projected graph. 2. More edges between group of nodes ⟶ more likely to close in future. True for both 3-node and 4-node groups! 3. Stronger ties between groups of nodes ⟶ also more likely to close. True for both 3-node and 4-node groups!
  • 26. LA/OPT Seminar 26 Three parts of this talk. 1. Basic study of datasets. How diverse are the types of datasets that we encounter? 2. Simplicial closure. How do we characterize the ways in which new simplices appear? 3. Higher-order link prediction. How do we use insights to predict which new simplices appear?
  • 27. We propose “higher-order link prediction” as a similar framework for evaluation of higher-order models. LA/OPT Seminar 27 Data. Observe simplices up to some time t. Using this data, want to predict what groups of > 2 nodes will appear in a simplex in the future. t 1 2 3 4 5 6 7 8 9 We predict structure that classical link prediction would not even consider!
  • 28. LA/OPT Seminar 28 Our structural analysis tells us what we should be looking at for prediction. 1. Edge density is a positive indicator. ⟶ focus our attention on predicting “open triangles” 2. Tie strength is a positive indicator. ⟶ various ways of incorporating this information i j k Wij Wjk Wjk
  • 29. LA/OPT Seminar 29 For every open triangle, we assign a score function on first 80% of data based on structural properties. Four broad classes of score functions for an open triangle. Score is s(i, j, k). s(i, j, k)… 1. is a function of Wij, Wjk, Wjk 2. is a function of A[:, [i, j, k]] and the simplices containing these nodes 3. is built from “whole-network” similarity scores on edges 4. is “learned” from data i j k Wij Wjk Wjk After computing scores, predict that open triangles with highest scores will simplicially close in final 20% of data.
  • 30. LA/OPT Seminar 30 1. s(i, j, k) is a function of Wij , Wjk , and Wjk i j k Wij Wjk Wjk 1. Arithmetic mean 2. Geometric mean 3. Harmonic mean 4. Generalized mean
  • 31. LA/OPT Seminar 31 2. s(i, j, k) is a function of is a function of A[:, [i, j, k]] and the related simplicial information i j k Wij Wjk Wjk 1. Common 4-th neighbors 2. Generalized Jaccard coefficient 3. Preferential attachment (graph) 4. Preferential attachment (simplices)
  • 32. LA/OPT Seminar 32 3. s(i, j, k) is is built from “whole-network” similarity scores on edges: s(i, j, k) = Sij + Sji + Sjk + Skj + Sik + Ski i j k Wij Wjk Wjk 1. PageRank (unweighted or weighted) 2. Katz (unweighted or weighted) 3. Fancier topological methods based on lifted random walks (work in progress…)
  • 33. LA/OPT Seminar 33 Hopefully Michael chimed in on matrix inverses… Want to reduce computational cost (only need scores on open triangles). For each node i that participates in an open triangle 1. Solve 2. Store ith column using GMRES with low tolerance
  • 34. LA/OPT Seminar 34 4. s(i, j, k) is learned from data 1. Split data into training and validation sets. 2. Compute features of (i, j, k) from previous ideas using training data. 3. Throw features + validation labels into machine learning blender → learn model. 4. Re-compute features on combined training + validation → apply model on the data.
  • 36. LA/OPT Seminar 36 A few lessons learned from applying all of these ideas. 1. We can predict pretty well on all datasets using some method. → 4x to 107x better than random w/r/t mean average precision depending on the dataset/method 2. On some classes of datasets, we can consistently predict well. → thread co-participation and co-tagging on stack exchange 3. Simply averaging Wij, Wjk, and Wik consistently performs well. → something between harmonic and geometric mean is consistently best i j k Wij Wjk Wjk
  • 37. Generalized means of edges weights are often good predictors of new 3-node simplices appearing. LA/OPT Seminar 37 Good performance from this local information is a deviation from classical link prediction, where methods that use long paths (e.g., PageRank) perform well [Liben-Nowell & Kleinberg 07]. For structures on k nodes, the subsets of size k-1 contain rich information only when k > 2. i j k Wij Wjk Wjk i j k ?
  • 38. LA/OPT Seminar 38 Shameless plug for SIAM ALA 2018! David Gleich and I are organizing a minitutorial. Tensor Eigenvectors and Stochastic Processes May 6, 10:45am – 12:45pm Learn about how stochastics help us understand problems in numerical multilinear algebra! It’s a minitutorial—little background is assumed. Lots of open problems for ICME students to solve.
  • 39. Simplicial closure and higher-order link prediction. LA/OPT Seminar 39 Paper. arXiv 1802.06916 Code. bit.ly/sc-holp-code Data. bit.ly/sc-holp-data Slides. bit.ly/arb-laopt- 2018 THANKS! http://cs.cornell.edu/~arb @austinbenson arb@cs.cornell.edu 1. We want higher-order models to be pervasive. 2. Higher-order link prediction is a way to compare models. 3. Have a better model or algorithm? Great! Show us on the data. 4. Lots of interesting structural information in these datasets.