k-Nearest Neighbors in Uncertain Graphs (Michalis Potamias, Francesco Bonchi, Aristides Gionis, George Kollios)

k-Nearest Neighbors
in Uncertain Graphs
Michalis Potamias Francesco Bonchi

Aristides Gionis George Kollios

Thesis
• Many complex networks are modeled as
probabilistic (i.e., uncertain) graphs.

• The probabilistic treatment of such graphs leads
to better understanding of real data.

Nearest Neighbors in Uncertain Graphs @ VLDB 2010 2

Probabilistic Protein-Protein
Interaction Networks
Possible interactions between
proteins are established
through biological experiments
that entail uncertainty.
The edge probability
represents that uncertainty.
A
0.2 0.6

0.4
B C
0.3 0.7

D

Source: Asthana et al., Genome Research 2004


• Neighbors of a given node in a standard graph?
– Nodes close in terms of shortest path distance!

A
• How do we define neighbors
0.2 0.6
in probabilistic graphs?
0.4
B C
• How do we define the distance?
0.3 0.7

D
– Treat them as weighted graphs (N06)
– Nodes with high reliability(GR04)
– Most probable path (BI03)
– …shortest paths? (VLDB10)

• Why is it important to find good neighbors of
proteins in PPI networks?
– Detection of candidate co-complex relationships.
– Actual co-complex relationships can be
established through experiments in the lab.


Outline
• Thesis
• Probabilistic PPI Networks
• Distance Definition
• Sampling Algorithms
• kNN Pruning
• Experiments


Outline
• Thesis
• kNN Pruning
• Experiments


A
0.6

B
0.2

0.4
C
Distance Definition
0.3 0.7

D
A A A A A A A A

B C B C B C B C B C B C B C B C

D D D D D D D D

A A A A A A A A


D D D D D D D D

A A A A A A A A


D D D D D D D D

A A A A A A A A


D D D D D D D D


Distance Definition
the graph
A
0.2 0.6

0.4
B C
0.3 0.7

D


Distance Definition
the graph a world
A A
0.2 0.6 Pr(world ) p( A, B) p( B, D)
0.4 (1 p( B, C )) (1 p(C , D)) (1 p( A, D))
B C B C
0.3 0.7

D D


Distance Definition
the graph a world
A A
0.2 0.6 Pr(world ) p( A, B) p( B, D)
0.4 (1 p( B, C )) (1 p(C , D)) (1 p( A, D))
B C B C
0.3 0.7

D D

PDF .44
.3
.26

1 2 inf
shortest path length d(B,D)

Distance Definition
• Use well known statistics of the Shortest Path
PDF:
– Median
– Majority (mode)
– ExpectedReliable
• infinity problem
PDF
• Hard! they require .44
d med 2
.3
explicit enumeration .26
d maj inf
of possible worlds:
d exp 1.46
resort to sampling! 1 2 inf
shortest path length d(B,D)

Outline
• Thesis
• kNN Pruning
• Experiments


Sampling Algorithms
1. sample (a small number of) worlds
2. compute sample median (approximation)
3. output result
– Median (Chernoff bound)
– ExpectedReliable (Hoeffding inequality)
– Majority (No bound)


Sampling Algorithms

BIOMINE FLICKR
database of biological entities users from flickr.com. edges have
and uncertain interactions from been created assuming homophily
UHelsinki based on jaccard of flickr groups
1M nodes, 10M edges 77K nodes, 20M edges


Outline
• Thesis
• kNN Pruning
• Experiments


kNN Pruning
• Query: Given a probabilistic graph, and a
source node find the set of k nodes closest to
the source.

• Naïve algorithm:
1. sample worlds
2. run dijkstra traversals and compute a pdf of the sp
distance per node
3. calculate the median distance to all nodes using the
pdf’s
4. compute k-nn

kNN Pruning naive

1nn - median
node: A
sample: 5 worlds

E 0.5
D
0.6
0.8

B 0.3
0.9

A G
0.3
0.7
C
0.4

F


kNN Pruning naive
E
D
1nn - median
B
node: A
A G
sample: 5 worlds
C

F

E 0.5
D
0.6
0.8

B 0.3
0.9

A G
0.3
1 2 3
0.7
C B C D E F G
0.4

F


kNN Pruning naive
E E
D D
1nn - median
B B
node: A
A G A G
sample: 5 worlds
C C

F F

E 0.5
D
0.6
0.8

B 0.3
0.9

A G
0.3
1 2 3
0.7
C B C D E F G
0.4

F


kNN Pruning naive
E E E
D D D
1nn - median
B B B
node: A
A G A G A G
sample: 5 worlds
C C C

F F F

E 0.5
D
0.6
0.8

B 0.3
0.9

A G
0.3
1 1 2 2 3 2 2
0.7
C B C D E F G
0.4

F


kNN Pruning naive
E E E E
D D D D
1nn - median
B B B B
node: A
A G A G A G A G
sample: 5 worlds
C C C C

F F F F

E 0.5
D
0.6
0.8

B 0.3
0.9

A G
0.3
1 1 2 2 3 2 2
0.7
C B C D E F G
0.4

F


kNN Pruning naive
E E E E E
D D D D D
1nn - median
B B B B B
node: A
A G A G A G A G A G
sample: 5 worlds
C C C C C

F F F F F

E 0.5
D
0.6
0.8

B 0.3
0.9

A G
0.3
1 1 2 2 3 2 2
0.7
C B C D E F G
0.4

F


kNN Pruning naive
E E E E E
D D D D D
1nn - median
B B B B B
node: A
A G A G A G A G A G
sample: 5 worlds
C C C C C

3 F F F F F

E 0.5
0.6
D 2
0.8

1 B 0.3
0.9

A G
0.3
1 1 2 2 3 2 2
0.7
C B C D E F G
0.4

F


kNN Pruning
1nn - median
node: A
sample: 5 worlds

E 0.5
D
0.6
0.8
• algorithm
B 0.3
0.9
– sample worlds on the fly
– increase the horizon of each dijkstra one hop at a
A G
time
0.3
0.7 – maintain truncated pdf histograms
C
0.4

F


kNN Pruning
1nn - median
node: A
sample: 5 worlds

E 0.5
D
0.6
0.8

B 0.3
0.9

A G
0.3
0.7
C
0.4

F


kNN Pruning
1nn - median
B
node: A
A
sample: 5 worlds

E 0.5
D
0.6
0.8

B 0.3
0.9

A G
0.3
1
0.7
C B
0.4

F


kNN Pruning
1nn - median
B B
node: A
A
sample: 5 worlds A

E 0.5
D
0.6
0.8

B 0.3
0.9

A G
0.3
1
0.7
C B
0.4

F


kNN Pruning
1nn - median
B B B
node: A
A
sample: 5 worlds A A

C

E 0.5
D
0.6
0.8

B 0.3
0.9

A G
0.3
1 1
0.7
C B C
0.4

F


kNN Pruning
1nn - median
B B B B
node: A
A
sample: 5 worlds A A A

C C

E 0.5
D
0.6
0.8

B 0.3
0.9

A G
0.3
1 1
0.7
C B C
0.4

F


kNN Pruning
1nn - median
B B B B
node: A
A A

C C

E 0.5
D
0.6
0.8

B 0.3
0.9

A G
0.3
1 1
0.7
C B C
0.4

F


kNN Pruning
1nn - median
B B B B
node: A
A A

C C

E 0.5
D
0.6
0.8

1 B 0.3
0.9

A G
0.3
1 1
0.7
>1 C B C
0.4

F


kNN Pruning
1nn - median
B B B B
node: A
A A

C C

E 0.5
0.6
D •B has distance 1
0.8 •C has distance greater than 1
1 B 0.3
•D, E, F, G, … were not discovered (d>1)
0.9
•1NN set is complete with B – no need to cont

A G •just 2 nodes visited (and 2 histograms
0.3
1 1 maintained)
0.7
•worlds were only partially instantiated
>1 C B C •same answer as the naive
0.4

F •with a small cost: dijkstra state needs to be
maintained in memory for all worlds

kNN Pruning
for 200 worlds and 5NN the speedups were:
247x (BIOMINE), 111x (FLICKR), 269x (DBLP)

BIOMINE FLICKR DBLP
database of biological entities users from flickr.com. edges have authors from dblp. probabilities
and uncertain interactions from been created assuming homophily have been assigned based on
UHelsinki based on jaccard of flickr groups number of coauthored papers
1M nodes, 10M edges 77K nodes, 20M edges 226K nodes, 1.4M edges


Less uncertainty, more pruning



A A
•boost probabilities of edges by d d
0.2 0.6 1-0.8 1-0.4
giving each edge d chances 0.4
d
1-0.6
B C B C
•d=1: original graph
0.3 0.7 d
•increasing d, p goes to 1 1-0.7 d
1-0.3
D D



A A
0.2 0.6 1-0.8 1-0.4
d
1-0.6
B C B C
0.3 0.7 d
1-0.3
D D


Outline
• Thesis
• kNN Pruning
• Experiments


Experiments
• Dataset
– Probabilistic PPI network
[Krogan et al, Nature 06]
– Protein co-complex
relationships (ground truth)
[Mewes et al, Nuc Acids Res 04]

• Experiment
– Choose a ground truth edge
(A,B)
– Choose a node C s.t. there is
no ground truth edge (A,C)
– Classification task: Distinguish
between the two types of
edges: (A,B) and (A,C)


Conclusion
• Probabilistic graph analysis benefits from
possible-world semantics.

– Extended standard graph concepts to
probabilistic graphs and designed
approximation algorithms to compute them
– Introduced novel pruning algorithms for kNN
in probabilistic graphs
– Confirmed the efficacy of our framework on
real data.


Future Work
• Enrich model
– Node probabilities
– Arbitrary PDFs
• Explore random walks further


Thank you!

?


k-Nearest Neighbors in Uncertain Graphs (Michalis Potamias, Francesco Bonchi, Aristides Gionis, George Kollios)

More Related Content

Similar to k-Nearest Neighbors in Uncertain Graphs (Michalis Potamias, Francesco Bonchi, Aristides Gionis, George Kollios) (8)

Recently uploaded (20)

k-Nearest Neighbors in Uncertain Graphs (Michalis Potamias, Francesco Bonchi, Aristides Gionis, George Kollios)