SlideShare a Scribd company logo
Artem Lutov, Mourad Khayati and Philippe Cudré-Mauroux
eXascale Infolab, University of Fribourg, Switzerland
https://github.com/eXascaleInfolab/daoc
https://bit.ly/daoc-slides
Stability includes both robustness and determinism in our paper:
● Robustness means that the forming clusters should evolve gracefully
and without any surges on minor perturbations (i.e., some
changes in the links or nodes) of the input network (i.e., graph).
● Determinism represents both non-stochastic and input
order-independent (i.e., reshuffling-resistant) results.
2
Stable (i.e., robust and deterministic) clustering of large networks:
DAOC - Deterministic and Agglomerative Overlapping Clustering
● Mutual Maximal Gain, to ensure robustness while being capable of
identifying micro-scale clusters
● Overlap Decomposition, to identify fine-grained clusters in a
deterministic way, while capturing multiple optima even for the
algorithms whose optimization function is not supported overlaps
3
Human perception-adapted Taxonomy construction
for large Evolving Networks by Incremental Clustering
● Stable
● Fully-automatic
● Browsable
● Large
● Multi-viewpoint
● Narrow (7 ± 2 rule)
4
● Robust + Determ.
● Parameter-free
● Hierarchical
● Near-linear runtime
● Overlapping
● Fine-grained
Human perception-adapted Taxonomy construction
for large Evolving Networks by Incremental Clustering
● Stable
● Fully-automatic
● Browsable
● Large
● Multi-viewpoint
● Narrow (7 ± 2 rule)
5
● Robust + Determ.
● Parameter-free
● Hierarchical
● Near-linear runtime
● Overlapping
● Fine-grained
Louvain
6
Modularity:
Modularity gain:
Mutual Maximal(⬦) Gain:
Decomposition of a node of degree d=3 into K=3 fragments:
7
OD
constraints:
The following algorithms are evaluated on synthetic (multiple instances of each category) and
real-world networks using open-source benchmarking framework, Clubmark:
8
* the feature is partially available, parameters tuning might be required for specific cases
◦ the feature is not supported by the original implementation of the algorithm
9
F1h (average value and deviation)
for subsequent perturbations (link
removals) of a synthetic network.
Stable algorithms (among non-
stochastic and ensemble) are
outlined with a bold line and
expected to have a gracefully
decreasing F1h without surges.
10
11
DAOC is a novel clustering algorithm design for a stable (both robust
and deterministic) clustering of large networks aiming to construct
human perception-adapted taxonomies without any manual tuning.
● DAOC is 25% more accurate on average than state-of-the-art
non-stochastic clustering algorithms being on par with the most
accurate existing (including stochastic) clustering algorithms
● DAOC is one the least memory consuming and fastest state-of-
the-art clustering algorithms being applicable to large networks
12
Artem Lutov <artem.lutov@unifr.ch>
https://github.com/eXascaleInfolab/daoc
13
14
1515
Racing cars
Overlapping Clusters Clusters on Various Resolutions
Blue cars
Jeeps
Cars
Racing cars
Bikes
Racing &
blue cars
Bikes
Matching the clusterings (unordered sets of elements) even with the
elements having a single membership may yield multiple best matches:
=> Strictclusterslabeling is not
always possible and undesirable.
Many dedicated accuracy metrics
are designed but few of them are
applicable for the elements with
multiplemembership.
16
Produced Ground-truth
Dark or Cyan?
Yellow
● Applicable for the elements having multiple membership
● Applicable for Large Datasets: ideally O(N), runtime up to O(N2
)
Families with the accuracy metrics satisfying our requirements:
● Pair Counting Based Metrics: Omega Index [Collins,1988]
● Cluster Matching Based Metrics: Average F1 score [Yang,2013]
● Information Theory Based Metrics: Generalized NMI
[Esquivel,2012]
Problem: accuracy values interpretability and the metric selection. 17
Omega Index (𝛀) counts the number of pairs of elements occurring
in exactly the same number of clusters as in the number of categories
and adjusted to the expected number of such pairs:
18
,
,
C’ - ground-truth
(categories)
C - produced cls.
Soft Omega Index take into account pairs present in different
number of clusters by normalizing smaller number of occurrences of
each pair of elements in all clusters of one clustering by the larger
number of occurrences in another clustering:
19
,
F1a is defined as the average of the weighted F1 scores of a) the best
matching ground-truth clusters to the formed clusters and b) the best
matching formed clusters to the ground-truth clusters:
20
,
F1 - F1-measure
[Rijsbergen, 1974]
F1h uses harmonic instead of the arithm. mean to address F1a ≳ 0.5
for the clusters produced from all combinations of the nodes (F1C‘,C
=
1 since for each category there exists the exactly matching cluster,
F1C,C’
→0 since majority of the clusters have low similarity to the
categories):
21
, for the contribution m of the nodes:
F1p is the harmonic mean of the average over each clustering of the
best local probabilities (f1 ➞ pprob) for each cluster:
22
Purpose: O(N(|C’| + |C|)) ➞ O(N)
Cluster
mbs # Member nodes, const
cont #Members contrib, const
counter # Contribs counter
Counter
orig # Originating cluster
ctr # Raw counter, <= mbs
C
23
..
...
a
for a in g2.mbs: for c in cls(C.a):
cc = c.counter;
if cc.orig != g2:
cc.ctr=0; cc.orig=g2
cc.ctr += 1 / |C.a| if ovp else 1
fmatch(cc.ctr, c.cont, g2.cont)
g2
c1
c3
C’
SNAP DBLP (Nodes: 317,080
Edges: 1,049,866 Clusters:
13,477) ground-truth vs
clustering by the Louvain.
Evaluation on Intel Xeon
E5-2620 (32 logical CPUs)
@ 2.10 GHz, apps compiled
using GCC 5.4 with -O3 flag.
24
NMI is Mutual Information I(C’:C) normalized by the max or mean
value of the unconditional entropy H of the clusterings C’, C:
25
,
,
GNMI
[Esquivel,2012]
uses a stochastic
process to
compute MI.
(Soft) 𝛀
MF1
GNMI
26
O(N2
), performs purely for the
multi-resolution clusterings.
Evaluates the best-matching
clusters only (unfair advantage
for the larger clusters).
Biased to the number of clusters,
non-deterministic results, the
convergence is not guaranteed in
the stochastic implementation.
values are not affected by the
number of clusters.
O(N), F1p satisfiers more
formal constraints than others.
Highly parallelized, evaluates
full matches, well-grounded
theoretically.

More Related Content

PDF
Parallel bfs using 2 stacks
PDF
PDF
Future semantic segmentation with convolutional LSTM
PDF
DOC
Cs(unit 2)
PPTX
CFD Lecture (3/8): Mesh Generation in CFD
PDF
Implementation of the Binary Multiplier on CPLD Using Reversible Logic Gates
PDF
Learning to Extrapolate Knowledge: Transductive Few-shot Out-of-Graph Link Pr...
Parallel bfs using 2 stacks
Future semantic segmentation with convolutional LSTM
Cs(unit 2)
CFD Lecture (3/8): Mesh Generation in CFD
Implementation of the Binary Multiplier on CPLD Using Reversible Logic Gates
Learning to Extrapolate Knowledge: Transductive Few-shot Out-of-Graph Link Pr...

What's hot (19)

PDF
Deformable Part Models are Convolutional Neural Networks
PDF
Qb pc ii
PDF
Colfax-Winograd-Summary _final (1)
PDF
SPAA11
PDF
Scheduling in Time-Sensitive Networks (TSN) for Mixed-Criticality Industrial ...
PDF
DSP IEEE paper
PDF
Cn tutorial (6 cs119)
PDF
Building and road detection from large aerial imagery
PPT
Graph Matching
PDF
Parallel Patterns for Window-based Stateful Operators on Data Streams: an Alg...
PDF
F044062933
PPTX
Aerial detection part2
DOCX
High performance nb-ldpc decoder with reduction of message exchange
PPTX
Fuzzy clustering using RSIO-FCM ppt
PPTX
Programmable Logic Array
PDF
A High Throughput CFA AES S-Box with Error Correction Capability
PPT
Data comparation
PDF
Dsp ic(2) jan 2013
PDF
Capp nov dec2012
Deformable Part Models are Convolutional Neural Networks
Qb pc ii
Colfax-Winograd-Summary _final (1)
SPAA11
Scheduling in Time-Sensitive Networks (TSN) for Mixed-Criticality Industrial ...
DSP IEEE paper
Cn tutorial (6 cs119)
Building and road detection from large aerial imagery
Graph Matching
Parallel Patterns for Window-based Stateful Operators on Data Streams: an Alg...
F044062933
Aerial detection part2
High performance nb-ldpc decoder with reduction of message exchange
Fuzzy clustering using RSIO-FCM ppt
Programmable Logic Array
A High Throughput CFA AES S-Box with Error Correction Capability
Data comparation
Dsp ic(2) jan 2013
Capp nov dec2012
Ad

Similar to DAOC: Stable Clustering of Large Networks (20)

PDF
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...
PDF
Recurrent Instance Segmentation (UPC Reading Group)
PPTX
Programmable Exascale Supercomputer
PDF
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
PPTX
A Tale of Data Pattern Discovery in Parallel
PPTX
PDCLECTURE.pptx
PDF
24-02-18 Rejender pratap.pdf
PDF
Standardising the compressed representation of neural networks
PDF
COMPARATIVE PERFORMANCE ANALYSIS OF RNSC AND MCL ALGORITHMS ON POWER-LAW DIST...
PDF
The Next Chapter in the Sordid Love/Hate Relationship Between DBs and OSes by...
PPTX
Towards the Design of Heuristics by Means of Self-Assembly
PDF
Lexically constrained decoding for sequence generation using grid beam search
PPTX
Leach-Protocol
PDF
Sudormrf.pdf
PDF
The basic ML architecture for all modern PCs and game consoles is similar
PPT
Parallel Computing 2007: Bring your own parallel application
PPTX
ADBS_parallel Databases in Advanced DBMS
PDF
Xmeasures - Accuracy evaluation of overlapping and multi-resolution clusterin...
PDF
Simulator for Energy Efficient Clustering in Mobile Ad Hoc Networks
PDF
F017123439
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...
Recurrent Instance Segmentation (UPC Reading Group)
Programmable Exascale Supercomputer
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
A Tale of Data Pattern Discovery in Parallel
PDCLECTURE.pptx
24-02-18 Rejender pratap.pdf
Standardising the compressed representation of neural networks
COMPARATIVE PERFORMANCE ANALYSIS OF RNSC AND MCL ALGORITHMS ON POWER-LAW DIST...
The Next Chapter in the Sordid Love/Hate Relationship Between DBs and OSes by...
Towards the Design of Heuristics by Means of Self-Assembly
Lexically constrained decoding for sequence generation using grid beam search
Leach-Protocol
Sudormrf.pdf
The basic ML architecture for all modern PCs and game consoles is similar
Parallel Computing 2007: Bring your own parallel application
ADBS_parallel Databases in Advanced DBMS
Xmeasures - Accuracy evaluation of overlapping and multi-resolution clusterin...
Simulator for Energy Efficient Clustering in Mobile Ad Hoc Networks
F017123439
Ad

Recently uploaded (20)

PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
Lecture1 pattern recognition............
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
Business Analytics and business intelligence.pdf
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
STUDY DESIGN details- Lt Col Maksud (21).pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Clinical guidelines as a resource for EBP(1).pdf
ISS -ESG Data flows What is ESG and HowHow
oil_refinery_comprehensive_20250804084928 (1).pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
Lecture1 pattern recognition............
Miokarditis (Inflamasi pada Otot Jantung)
Business Analytics and business intelligence.pdf
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...

DAOC: Stable Clustering of Large Networks

  • 1. Artem Lutov, Mourad Khayati and Philippe Cudré-Mauroux eXascale Infolab, University of Fribourg, Switzerland https://github.com/eXascaleInfolab/daoc https://bit.ly/daoc-slides
  • 2. Stability includes both robustness and determinism in our paper: ● Robustness means that the forming clusters should evolve gracefully and without any surges on minor perturbations (i.e., some changes in the links or nodes) of the input network (i.e., graph). ● Determinism represents both non-stochastic and input order-independent (i.e., reshuffling-resistant) results. 2
  • 3. Stable (i.e., robust and deterministic) clustering of large networks: DAOC - Deterministic and Agglomerative Overlapping Clustering ● Mutual Maximal Gain, to ensure robustness while being capable of identifying micro-scale clusters ● Overlap Decomposition, to identify fine-grained clusters in a deterministic way, while capturing multiple optima even for the algorithms whose optimization function is not supported overlaps 3
  • 4. Human perception-adapted Taxonomy construction for large Evolving Networks by Incremental Clustering ● Stable ● Fully-automatic ● Browsable ● Large ● Multi-viewpoint ● Narrow (7 ± 2 rule) 4 ● Robust + Determ. ● Parameter-free ● Hierarchical ● Near-linear runtime ● Overlapping ● Fine-grained
  • 5. Human perception-adapted Taxonomy construction for large Evolving Networks by Incremental Clustering ● Stable ● Fully-automatic ● Browsable ● Large ● Multi-viewpoint ● Narrow (7 ± 2 rule) 5 ● Robust + Determ. ● Parameter-free ● Hierarchical ● Near-linear runtime ● Overlapping ● Fine-grained Louvain
  • 7. Decomposition of a node of degree d=3 into K=3 fragments: 7 OD constraints:
  • 8. The following algorithms are evaluated on synthetic (multiple instances of each category) and real-world networks using open-source benchmarking framework, Clubmark: 8 * the feature is partially available, parameters tuning might be required for specific cases ◦ the feature is not supported by the original implementation of the algorithm
  • 9. 9 F1h (average value and deviation) for subsequent perturbations (link removals) of a synthetic network. Stable algorithms (among non- stochastic and ensemble) are outlined with a bold line and expected to have a gracefully decreasing F1h without surges.
  • 10. 10
  • 11. 11
  • 12. DAOC is a novel clustering algorithm design for a stable (both robust and deterministic) clustering of large networks aiming to construct human perception-adapted taxonomies without any manual tuning. ● DAOC is 25% more accurate on average than state-of-the-art non-stochastic clustering algorithms being on par with the most accurate existing (including stochastic) clustering algorithms ● DAOC is one the least memory consuming and fastest state-of- the-art clustering algorithms being applicable to large networks 12
  • 14. 14
  • 15. 1515 Racing cars Overlapping Clusters Clusters on Various Resolutions Blue cars Jeeps Cars Racing cars Bikes Racing & blue cars Bikes
  • 16. Matching the clusterings (unordered sets of elements) even with the elements having a single membership may yield multiple best matches: => Strictclusterslabeling is not always possible and undesirable. Many dedicated accuracy metrics are designed but few of them are applicable for the elements with multiplemembership. 16 Produced Ground-truth Dark or Cyan? Yellow
  • 17. ● Applicable for the elements having multiple membership ● Applicable for Large Datasets: ideally O(N), runtime up to O(N2 ) Families with the accuracy metrics satisfying our requirements: ● Pair Counting Based Metrics: Omega Index [Collins,1988] ● Cluster Matching Based Metrics: Average F1 score [Yang,2013] ● Information Theory Based Metrics: Generalized NMI [Esquivel,2012] Problem: accuracy values interpretability and the metric selection. 17
  • 18. Omega Index (𝛀) counts the number of pairs of elements occurring in exactly the same number of clusters as in the number of categories and adjusted to the expected number of such pairs: 18 , , C’ - ground-truth (categories) C - produced cls.
  • 19. Soft Omega Index take into account pairs present in different number of clusters by normalizing smaller number of occurrences of each pair of elements in all clusters of one clustering by the larger number of occurrences in another clustering: 19 ,
  • 20. F1a is defined as the average of the weighted F1 scores of a) the best matching ground-truth clusters to the formed clusters and b) the best matching formed clusters to the ground-truth clusters: 20 , F1 - F1-measure [Rijsbergen, 1974]
  • 21. F1h uses harmonic instead of the arithm. mean to address F1a ≳ 0.5 for the clusters produced from all combinations of the nodes (F1C‘,C = 1 since for each category there exists the exactly matching cluster, F1C,C’ →0 since majority of the clusters have low similarity to the categories): 21 , for the contribution m of the nodes:
  • 22. F1p is the harmonic mean of the average over each clustering of the best local probabilities (f1 ➞ pprob) for each cluster: 22
  • 23. Purpose: O(N(|C’| + |C|)) ➞ O(N) Cluster mbs # Member nodes, const cont #Members contrib, const counter # Contribs counter Counter orig # Originating cluster ctr # Raw counter, <= mbs C 23 .. ... a for a in g2.mbs: for c in cls(C.a): cc = c.counter; if cc.orig != g2: cc.ctr=0; cc.orig=g2 cc.ctr += 1 / |C.a| if ovp else 1 fmatch(cc.ctr, c.cont, g2.cont) g2 c1 c3 C’
  • 24. SNAP DBLP (Nodes: 317,080 Edges: 1,049,866 Clusters: 13,477) ground-truth vs clustering by the Louvain. Evaluation on Intel Xeon E5-2620 (32 logical CPUs) @ 2.10 GHz, apps compiled using GCC 5.4 with -O3 flag. 24
  • 25. NMI is Mutual Information I(C’:C) normalized by the max or mean value of the unconditional entropy H of the clusterings C’, C: 25 , , GNMI [Esquivel,2012] uses a stochastic process to compute MI.
  • 26. (Soft) 𝛀 MF1 GNMI 26 O(N2 ), performs purely for the multi-resolution clusterings. Evaluates the best-matching clusters only (unfair advantage for the larger clusters). Biased to the number of clusters, non-deterministic results, the convergence is not guaranteed in the stochastic implementation. values are not affected by the number of clusters. O(N), F1p satisfiers more formal constraints than others. Highly parallelized, evaluates full matches, well-grounded theoretically.