Measuring the Validity of Clustering Validation Datasets

(1 min summary + main result)
• Clustering is essential to data analytics
• Practitioners (Data Scientists, Domain Experts) pick a clustering technique
to explore their specific domain dataset.
• Researchers design clustering techniques and rank them on benchmark
datasets representative of an application domain to help practitioners choose
the most suitable technique.
• We question the validity of benchmark datasets used for clustering
validation.
• We propose an axiomatic approach and its practical implementation to
evaluate and rank benchmark datasets for clustering evaluation.
• We show that many benchmark datasets are of low quality, which has drastic
consequences when used for ranking clustering techniques
• We discuss future usage of our approach to explore how concepts cluster in
the representation spaces of GenAI foundation models
Next page to get the main result
H. Jeon, M. Aupetit, D. Shin, A. Cho, S. Park and J. Seo, "Measuring the Validity of Clustering Validation Datasets,"
in IEEE Transactions on Pattern Analysis and Machine Intelligence, doi: 10.1109/TPAMI.2025.3548011
https://ieeexplore.ieee.org/document/10909451

Ranking of clustering techniques
strongly depends on cluster label
matching of benchmark datasets
Researchers should use top-tier benchmark
datasets to compare clustering techniques.
Next page to get a
15 min summary
• Ranking clustering techniques using only top-tier
benchmark datasets in terms of cluster label matching
evaluated with the proposed IVMA leads to drastically
different rankings than using all the datasets or only the
bottom-tier ones.
(a) shows the ranking
stability when a
researcher picks 10
benchmark datasets
randomly out of the
top-tier, entire, bottom-
tier sets.
(b) shows the ranking
using 100% of these
subsets.

Clustering is an essential tool
for practitioners
• Health: Discover new drugs against cancer
• E-Health: Discover activity patterns to fight obesity/diabetes
• Cybersecurity: Discover a new type of attack
• Education: Segment pupils by skillsets
• Material science: Discover new materials to capture CO2
• Marketing: Segment customer profiles or new markets
• GenAI: Explore internal knowledge for explainability/alignment
…
(15-minute summary)

Clustering basics
• Data instances are tuples of features
• Patients characterized by their biometrics, genetics…
• Customers characterized by their demographics, transactions…
• Materials characterized by their chemical components, topology…
• Cyber attacks characterized by their IPs, process & network events…
• A similarity/distance/divergence measure tells how two
(sets of) instances are similar based on their feature values
• Euclidean, Cosine, Edit, Kullback-Leibler, Wasserstein…
• A clustering technique assigns labels to instances so that
instances with the same label are more similar to each other
than to instances with different labels
• Cluster labels have no semantic
• The number of clusters is unknown
• K-Means, Gaussian Mixture, Spectral, Agglomerative…
>>
Cluster 1
Cluster 2

Good and bad clusterings
Labels are assigned to instances by the clustering technique
based on the relative positions of the instances in the feature space.
and are adjacent
and are merged / overlapping
is split
(A) Labels match well with clusters
(B) Labels do not match with clusters
Internal validation measures (IVM) like Silhouette, Davies-Bouldin…
are used to quantify this cluster-label matching (CLM)

Practitioners explore specific data
• They seek the best cluster labels for their specific dataset
• Which clustering technique do they choose?
• (In theory) Efficiency validated on benchmark datasets based on
scientific research
• (In practice) Availability of Python / R… package
• They seek the clustering technique that gives the best
cluster labels across multiple datasets representative of
domain data
• They also look for clustering techniques that
• can handle specific data types (text, images, speech, genes…)
• are faster
• are scalable to a large number of features and instances
• are easier to use/have fewer hand-tuned parameters
• …
• How do they validate clustering techniques?
• Using benchmark datasets with ground-truth labels
representative of domain data
Researchers design new techniques
We question the validity of clustering
validation benchmark datasets

How does clustering validation work?
• Practitioners
• Consider the domain dataset D to explore and analyze
• Pick a clustering technique C
• Pick a set of parameters k, typically the number of clusters
• Run clustering Ck to get cluster labels L = Ck(D)
• Compute an internal validation measure (IVM) for each
clustering result IVM(D,L)
• Silhouette, Xie-Beni, Davies-Bouldin, Calinski-Harabasz…
• Select the clustering L*Ck = argmaxCk IVM(D, Ck(D)) that maximizes
IVM over all Ck
• Proceed with the analysis and report their results to their clients
• Researchers
• Pick a benchmark dataset Di with ground-truth labels Li
GT
• Pick a clustering technique Cj
• Get the optimal clustering L*Cj of Di as the practitioner would do
• Compute an external validation measure (EVM) to compare the
optimal labels L*Cj with the ground truth Li
GT : EVM(L*Cj, Li
GT)
• Adjusted rand index, adjusted mutual information…
• Rank clustering techniques C1, C2… based on their EVM
aggregated across all benchmark datasets D1, D2…
• Publish their results to reach practitioners and other researchers

Why clustering benchmark datasets
must be validated?
• If ground truth labels LGT (A)
match well with data clusters
(good CLM)
• Then EVM (D,G) correlates with
clustering L* quality (C,F)
• Good clustering (C) gets high EVM (D)
• Bad clustering (F) gets low EVM (G)
EVM is a reliable measure
of clustering quality
All good! 

• If ground truth labels LGT (B) do not
match with data clusters (bad CLM)
• Then EVM (E,H) does not correlate
with L* quality (C,F)
• Good clustering (C) gets low EVM (E)
• Bad clustering (F) gets low EVM (H)
EVM is unreliable and can
lead to a pessimistic evaluation
of clustering techniques! 

Why clustering benchmark datasets
must be validated?

How to evaluate the quality of
benchmark datasets?
• How to measure the cluster label matching CLM of the ground truth
labels?
Use internal validation measures IVM?
• No, standard IVMs are designed to compare different clusterings of
the same specific dataset (same instances, same features (A,B)).
• We need new IVMs to compare CLM across different datasets (I,J,K).
• We state that such adjusted IVMs (IVMA) must be invariant to what
can change across multiple benchmark datasets (I,J,K) and is
irrelevant to quantify the CLM. These are the Across-datasets axioms:
• Axiom A1: Invariance to the number of data instances (data size)
• Axiom A2: Invariance to the number of features (data dimension)
• Axiom A3: Invariance to the number of labels (LGT)
• Axiom A4: Invariance to the range of IVMA values

How to implement these axioms?
• We propose formal definitions for these axioms
• We propose protocols to adjust existing IVMs
such that they validate the across-dataset
axioms
• More details in sections 3 and 4
• Now, let’s see the experimental results…

IVMA are more accurate to evaluate
cluster label matching (1)
• For each of 96
benchmark datasets,
we compare
• IVMs, adjusted IVMs
(IVMA), and supervised
classifiers based on the
data labels LGT
• Against a ground truth
clustering L* obtained
from an ensemble of 9
clustering techniques.
• Classifiers (as
expected) cannot tell
apart good from bad
clusterings.
• Adjusted IVMs (IVMA) of
data labels LGT match
the best with ground
truth clustering L*.

IVMA are more accurate to evaluate
cluster label matching (2)
• Adjusted IVMA (b) is more correlated to the ground
truth ensemble clustering than base IVM (a)
Adjusted IVMA
IVM

Ranking of clustering techniques
strongly depends on cluster label
matching of benchmark datasets
Researchers should use IVMA-based top-tier benchmark datasets
to compare clustering techniques.
• Ranking clustering techniques using only top-tier
benchmark datasets in terms of cluster label matching
evaluated with the proposed IVMA leads to drastically
different rankings than using all the datasets or only the
bottom-tier ones.
(a) shows the ranking
stability when a
researcher picks 10
benchmark datasets
randomly out of the
top-tier, entire, bottom-
tier sets.
(b) shows the ranking
using 100% of these
subsets.

IVMA are fast to compute
• IVMA are as fast as IVMs
• IVMA are 4 order of magnitude
faster than clustering ensemble

What’s Next
• In GenAI foundation models, semantically related
facts or tokens should form clusters in different
embedding spaces. Verifying and tracking these
clusters could help explain and edit models’
responses.
• However, these embedding spaces may change
dimension across layers, and sample size and
labels categorizing sets of facts or tokens may
vary across models or analytic phases.
• Adjusted IVMA being fast and accurate, while
robust to changes in dimensions, samples and
labels, set the ground for developing new
approaches to evaluate how labels of interest
form clusters across layers and model instances
during pre- and post-training of GenAI models.

Thank you for reading up here!
H. Jeon, M. Aupetit, D. Shin, A. Cho, S. Park and J. Seo,
"Measuring the Validity of Clustering Validation Datasets,"
in IEEE Transactions on Pattern Analysis and Machine
Intelligence, doi: 10.1109/TPAMI.2025.3548011
https://ieeexplore.ieee.org/document/10909451
@ARTICLE{10909451,
author={Jeon, Hyeon and Aupetit, Michaël and Shin, DongHwa
and Cho, Aeri and Park, Seokhyeon and Seo, Jinwook},
journal={IEEE Transactions on Pattern Analysis and Machine
Intelligence},
title={Measuring the Validity of Clustering Validation Datasets},
year={2025},
pages={1-14},
doi={10.1109/TPAMI.2025.3548011}}
Ranked datasets
https://github.com/hj-n/labeled-datasets
Adjusted IVMs
https://github.com/hj-n/clm
Other amazing work of Hyeon
https://www.hyeonjeon.com/publications
Read, Use, Share, Cite

Measuring the Validity of Clustering Validation Datasets

More Related Content

Similar to Measuring the Validity of Clustering Validation Datasets (20)

Recently uploaded (20)

Measuring the Validity of Clustering Validation Datasets