SlideShare a Scribd company logo
(1 min summary + main result)
• Clustering is essential to data analytics
• Practitioners (Data Scientists, Domain Experts) pick a clustering technique
to explore their specific domain dataset.
• Researchers design clustering techniques and rank them on benchmark
datasets representative of an application domain to help practitioners choose
the most suitable technique.
• We question the validity of benchmark datasets used for clustering
validation.
• We propose an axiomatic approach and its practical implementation to
evaluate and rank benchmark datasets for clustering evaluation.
• We show that many benchmark datasets are of low quality, which has drastic
consequences when used for ranking clustering techniques
• We discuss future usage of our approach to explore how concepts cluster in
the representation spaces of GenAI foundation models
Next page to get the main result
H. Jeon, M. Aupetit, D. Shin, A. Cho, S. Park and J. Seo, "Measuring the Validity of Clustering Validation Datasets,"
in IEEE Transactions on Pattern Analysis and Machine Intelligence, doi: 10.1109/TPAMI.2025.3548011
https://ieeexplore.ieee.org/document/10909451
Ranking of clustering techniques
strongly depends on cluster label
matching of benchmark datasets
Researchers should use top-tier benchmark
datasets to compare clustering techniques.
Next page to get a
15 min summary
• Ranking clustering techniques using only top-tier
benchmark datasets in terms of cluster label matching
evaluated with the proposed IVMA leads to drastically
different rankings than using all the datasets or only the
bottom-tier ones.
(a) shows the ranking
stability when a
researcher picks 10
benchmark datasets
randomly out of the
top-tier, entire, bottom-
tier sets.
(b) shows the ranking
using 100% of these
subsets.
Clustering is an essential tool
for practitioners
• Health: Discover new drugs against cancer
• E-Health: Discover activity patterns to fight obesity/diabetes
• Cybersecurity: Discover a new type of attack
• Education: Segment pupils by skillsets
• Material science: Discover new materials to capture CO2
• Marketing: Segment customer profiles or new markets
• GenAI: Explore internal knowledge for explainability/alignment
…
(15-minute summary)
Clustering basics
• Data instances are tuples of features
• Patients characterized by their biometrics, genetics…
• Customers characterized by their demographics, transactions…
• Materials characterized by their chemical components, topology…
• Cyber attacks characterized by their IPs, process & network events…
• A similarity/distance/divergence measure tells how two
(sets of) instances are similar based on their feature values
• Euclidean, Cosine, Edit, Kullback-Leibler, Wasserstein…
• A clustering technique assigns labels to instances so that
instances with the same label are more similar to each other
than to instances with different labels
• Cluster labels have no semantic
• The number of clusters is unknown
• K-Means, Gaussian Mixture, Spectral, Agglomerative…
>>
Cluster 1
Cluster 2
Good and bad clusterings
Labels are assigned to instances by the clustering technique
based on the relative positions of the instances in the feature space.
and are adjacent
and are merged / overlapping
is split
(A) Labels match well with clusters
(B) Labels do not match with clusters
Internal validation measures (IVM) like Silhouette, Davies-Bouldin…
are used to quantify this cluster-label matching (CLM)
Practitioners explore specific data
• They seek the best cluster labels for their specific dataset
• Which clustering technique do they choose?
• (In theory) Efficiency validated on benchmark datasets based on
scientific research
• (In practice) Availability of Python / R… package
• They seek the clustering technique that gives the best
cluster labels across multiple datasets representative of
domain data
• They also look for clustering techniques that
• can handle specific data types (text, images, speech, genes…)
• are faster
• are scalable to a large number of features and instances
• are easier to use/have fewer hand-tuned parameters
• …
• How do they validate clustering techniques?
• Using benchmark datasets with ground-truth labels
representative of domain data
Researchers design new techniques
We question the validity of clustering
validation benchmark datasets
How does clustering validation work?
• Practitioners
• Consider the domain dataset D to explore and analyze
• Pick a clustering technique C
• Pick a set of parameters k, typically the number of clusters
• Run clustering Ck to get cluster labels L = Ck(D)
• Compute an internal validation measure (IVM) for each
clustering result IVM(D,L)
• Silhouette, Xie-Beni, Davies-Bouldin, Calinski-Harabasz…
• Select the clustering L*Ck = argmaxCk IVM(D, Ck(D)) that maximizes
IVM over all Ck
• Proceed with the analysis and report their results to their clients
• Researchers
• Pick a benchmark dataset Di with ground-truth labels Li
GT
• Pick a clustering technique Cj
• Get the optimal clustering L*Cj of Di as the practitioner would do
• Compute an external validation measure (EVM) to compare the
optimal labels L*Cj with the ground truth Li
GT : EVM(L*Cj, Li
GT)
• Adjusted rand index, adjusted mutual information…
• Rank clustering techniques C1, C2… based on their EVM
aggregated across all benchmark datasets D1, D2…
• Publish their results to reach practitioners and other researchers
Why clustering benchmark datasets
must be validated?
• If ground truth labels LGT (A)
match well with data clusters
(good CLM)
• Then EVM (D,G) correlates with
clustering L* quality (C,F)
• Good clustering (C) gets high EVM (D)
• Bad clustering (F) gets low EVM (G)
EVM is a reliable measure
of clustering quality
All good! 
• If ground truth labels LGT (B) do not
match with data clusters (bad CLM)
• Then EVM (E,H) does not correlate
with L* quality (C,F)
• Good clustering (C) gets low EVM (E)
• Bad clustering (F) gets low EVM (H)
EVM is unreliable and can
lead to a pessimistic evaluation
of clustering techniques! 

Why clustering benchmark datasets
must be validated?
How to evaluate the quality of
benchmark datasets?
• How to measure the cluster label matching CLM of the ground truth
labels?
Use internal validation measures IVM?
• No, standard IVMs are designed to compare different clusterings of
the same specific dataset (same instances, same features (A,B)).
• We need new IVMs to compare CLM across different datasets (I,J,K).
• We state that such adjusted IVMs (IVMA) must be invariant to what
can change across multiple benchmark datasets (I,J,K) and is
irrelevant to quantify the CLM. These are the Across-datasets axioms:
• Axiom A1: Invariance to the number of data instances (data size)
• Axiom A2: Invariance to the number of features (data dimension)
• Axiom A3: Invariance to the number of labels (LGT)
• Axiom A4: Invariance to the range of IVMA values
How to implement these axioms?
• We propose formal definitions for these axioms
• We propose protocols to adjust existing IVMs
such that they validate the across-dataset
axioms
• More details in sections 3 and 4
• Now, let’s see the experimental results…
IVMA are more accurate to evaluate
cluster label matching (1)
• For each of 96
benchmark datasets,
we compare
• IVMs, adjusted IVMs
(IVMA), and supervised
classifiers based on the
data labels LGT
• Against a ground truth
clustering L* obtained
from an ensemble of 9
clustering techniques.
• Classifiers (as
expected) cannot tell
apart good from bad
clusterings.
• Adjusted IVMs (IVMA) of
data labels LGT match
the best with ground
truth clustering L*.
IVMA are more accurate to evaluate
cluster label matching (2)
• Adjusted IVMA (b) is more correlated to the ground
truth ensemble clustering than base IVM (a)
Adjusted IVMA
IVM
Ranking of clustering techniques
strongly depends on cluster label
matching of benchmark datasets
Researchers should use IVMA-based top-tier benchmark datasets
to compare clustering techniques.
• Ranking clustering techniques using only top-tier
benchmark datasets in terms of cluster label matching
evaluated with the proposed IVMA leads to drastically
different rankings than using all the datasets or only the
bottom-tier ones.
(a) shows the ranking
stability when a
researcher picks 10
benchmark datasets
randomly out of the
top-tier, entire, bottom-
tier sets.
(b) shows the ranking
using 100% of these
subsets.
IVMA are fast to compute
• IVMA are as fast as IVMs
• IVMA are 4 order of magnitude
faster than clustering ensemble
What’s Next
• In GenAI foundation models, semantically related
facts or tokens should form clusters in different
embedding spaces. Verifying and tracking these
clusters could help explain and edit models’
responses.
• However, these embedding spaces may change
dimension across layers, and sample size and
labels categorizing sets of facts or tokens may
vary across models or analytic phases.
• Adjusted IVMA being fast and accurate, while
robust to changes in dimensions, samples and
labels, set the ground for developing new
approaches to evaluate how labels of interest
form clusters across layers and model instances
during pre- and post-training of GenAI models.
Thank you for reading up here!
H. Jeon, M. Aupetit, D. Shin, A. Cho, S. Park and J. Seo,
"Measuring the Validity of Clustering Validation Datasets,"
in IEEE Transactions on Pattern Analysis and Machine
Intelligence, doi: 10.1109/TPAMI.2025.3548011
https://ieeexplore.ieee.org/document/10909451
@ARTICLE{10909451,
author={Jeon, Hyeon and Aupetit, Michaël and Shin, DongHwa
and Cho, Aeri and Park, Seokhyeon and Seo, Jinwook},
journal={IEEE Transactions on Pattern Analysis and Machine
Intelligence},
title={Measuring the Validity of Clustering Validation Datasets},
year={2025},
pages={1-14},
doi={10.1109/TPAMI.2025.3548011}}
Ranked datasets
https://github.com/hj-n/labeled-datasets
Adjusted IVMs
https://github.com/hj-n/clm
Other amazing work of Hyeon
https://www.hyeonjeon.com/publications
Read, Use, Share, Cite

More Related Content

PDF
Testing Machine Learning-enabled Systems: A Personal Perspective
PDF
Automated Testing and Safety Analysis of Deep Neural Networks
PDF
A Study of Efficiency Improvements Technique for K-Means Algorithm
PPTX
Weka bike rental
PDF
What deep learning can bring to...
PPTX
Machine Learning : Clustering - Cluster analysis.pptx
DOCX
Types of Machine Learnig Algorithms(CART, ID3)
PDF
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
Testing Machine Learning-enabled Systems: A Personal Perspective
Automated Testing and Safety Analysis of Deep Neural Networks
A Study of Efficiency Improvements Technique for K-Means Algorithm
Weka bike rental
What deep learning can bring to...
Machine Learning : Clustering - Cluster analysis.pptx
Types of Machine Learnig Algorithms(CART, ID3)
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...

Similar to Measuring the Validity of Clustering Validation Datasets (20)

PDF
Handling Missing Attributes using Matrix Factorization 
PDF
Kaggle Higgs Boson Machine Learning Challenge
PDF
Scalable Software Testing and Verification of Non-Functional Properties throu...
PPTX
Kaggle Gold Medal Case Study
PPTX
Multi variate presentation
PPTX
Presentation - Predicting Online Purchases Using Conversion Prediction Modeli...
PPTX
module 1.pptx
PPTX
Towards a Comprehensive Machine Learning Benchmark
PDF
Revisiting the Notion of Diversity in Software Testing
PDF
Review of Existing Methods in K-means Clustering Algorithm
PDF
THEORITICAL FRAMEWORK FOR THE DATA MINING PROCESS
PDF
The Power of Auto ML and How Does it Work
PDF
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
PDF
Adobe Audition Crack FRESH Version 2025 FREE
PDF
Artificial Neural Networks for data mining
PDF
To bag, or to boost? A question of balance
PPT
PDF
Guiding through a typical Machine Learning Pipeline
PDF
Towards a Quality Assessment of Web Corpora for Language Technology Applications
Handling Missing Attributes using Matrix Factorization 
Kaggle Higgs Boson Machine Learning Challenge
Scalable Software Testing and Verification of Non-Functional Properties throu...
Kaggle Gold Medal Case Study
Multi variate presentation
Presentation - Predicting Online Purchases Using Conversion Prediction Modeli...
module 1.pptx
Towards a Comprehensive Machine Learning Benchmark
Revisiting the Notion of Diversity in Software Testing
Review of Existing Methods in K-means Clustering Algorithm
THEORITICAL FRAMEWORK FOR THE DATA MINING PROCESS
The Power of Auto ML and How Does it Work
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Adobe Audition Crack FRESH Version 2025 FREE
Artificial Neural Networks for data mining
To bag, or to boost? A question of balance
Guiding through a typical Machine Learning Pipeline
Towards a Quality Assessment of Web Corpora for Language Technology Applications
Ad

Recently uploaded (20)

PDF
[EN] Industrial Machine Downtime Prediction
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
Introduction to Data Science and Data Analysis
PDF
annual-report-2024-2025 original latest.
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Introduction to machine learning and Linear Models
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
[EN] Industrial Machine Downtime Prediction
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Introduction to Data Science and Data Analysis
annual-report-2024-2025 original latest.
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Business Ppt On Nestle.pptx huunnnhhgfvu
Introduction to Knowledge Engineering Part 1
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
1_Introduction to advance data techniques.pptx
Introduction to machine learning and Linear Models
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
climate analysis of Dhaka ,Banglades.pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Ad

Measuring the Validity of Clustering Validation Datasets

  • 1. (1 min summary + main result) • Clustering is essential to data analytics • Practitioners (Data Scientists, Domain Experts) pick a clustering technique to explore their specific domain dataset. • Researchers design clustering techniques and rank them on benchmark datasets representative of an application domain to help practitioners choose the most suitable technique. • We question the validity of benchmark datasets used for clustering validation. • We propose an axiomatic approach and its practical implementation to evaluate and rank benchmark datasets for clustering evaluation. • We show that many benchmark datasets are of low quality, which has drastic consequences when used for ranking clustering techniques • We discuss future usage of our approach to explore how concepts cluster in the representation spaces of GenAI foundation models Next page to get the main result H. Jeon, M. Aupetit, D. Shin, A. Cho, S. Park and J. Seo, "Measuring the Validity of Clustering Validation Datasets," in IEEE Transactions on Pattern Analysis and Machine Intelligence, doi: 10.1109/TPAMI.2025.3548011 https://ieeexplore.ieee.org/document/10909451
  • 2. Ranking of clustering techniques strongly depends on cluster label matching of benchmark datasets Researchers should use top-tier benchmark datasets to compare clustering techniques. Next page to get a 15 min summary • Ranking clustering techniques using only top-tier benchmark datasets in terms of cluster label matching evaluated with the proposed IVMA leads to drastically different rankings than using all the datasets or only the bottom-tier ones. (a) shows the ranking stability when a researcher picks 10 benchmark datasets randomly out of the top-tier, entire, bottom- tier sets. (b) shows the ranking using 100% of these subsets.
  • 3. Clustering is an essential tool for practitioners • Health: Discover new drugs against cancer • E-Health: Discover activity patterns to fight obesity/diabetes • Cybersecurity: Discover a new type of attack • Education: Segment pupils by skillsets • Material science: Discover new materials to capture CO2 • Marketing: Segment customer profiles or new markets • GenAI: Explore internal knowledge for explainability/alignment … (15-minute summary)
  • 4. Clustering basics • Data instances are tuples of features • Patients characterized by their biometrics, genetics… • Customers characterized by their demographics, transactions… • Materials characterized by their chemical components, topology… • Cyber attacks characterized by their IPs, process & network events… • A similarity/distance/divergence measure tells how two (sets of) instances are similar based on their feature values • Euclidean, Cosine, Edit, Kullback-Leibler, Wasserstein… • A clustering technique assigns labels to instances so that instances with the same label are more similar to each other than to instances with different labels • Cluster labels have no semantic • The number of clusters is unknown • K-Means, Gaussian Mixture, Spectral, Agglomerative… >> Cluster 1 Cluster 2
  • 5. Good and bad clusterings Labels are assigned to instances by the clustering technique based on the relative positions of the instances in the feature space. and are adjacent and are merged / overlapping is split (A) Labels match well with clusters (B) Labels do not match with clusters Internal validation measures (IVM) like Silhouette, Davies-Bouldin… are used to quantify this cluster-label matching (CLM)
  • 6. Practitioners explore specific data • They seek the best cluster labels for their specific dataset • Which clustering technique do they choose? • (In theory) Efficiency validated on benchmark datasets based on scientific research • (In practice) Availability of Python / R… package • They seek the clustering technique that gives the best cluster labels across multiple datasets representative of domain data • They also look for clustering techniques that • can handle specific data types (text, images, speech, genes…) • are faster • are scalable to a large number of features and instances • are easier to use/have fewer hand-tuned parameters • … • How do they validate clustering techniques? • Using benchmark datasets with ground-truth labels representative of domain data Researchers design new techniques We question the validity of clustering validation benchmark datasets
  • 7. How does clustering validation work? • Practitioners • Consider the domain dataset D to explore and analyze • Pick a clustering technique C • Pick a set of parameters k, typically the number of clusters • Run clustering Ck to get cluster labels L = Ck(D) • Compute an internal validation measure (IVM) for each clustering result IVM(D,L) • Silhouette, Xie-Beni, Davies-Bouldin, Calinski-Harabasz… • Select the clustering L*Ck = argmaxCk IVM(D, Ck(D)) that maximizes IVM over all Ck • Proceed with the analysis and report their results to their clients • Researchers • Pick a benchmark dataset Di with ground-truth labels Li GT • Pick a clustering technique Cj • Get the optimal clustering L*Cj of Di as the practitioner would do • Compute an external validation measure (EVM) to compare the optimal labels L*Cj with the ground truth Li GT : EVM(L*Cj, Li GT) • Adjusted rand index, adjusted mutual information… • Rank clustering techniques C1, C2… based on their EVM aggregated across all benchmark datasets D1, D2… • Publish their results to reach practitioners and other researchers
  • 8. Why clustering benchmark datasets must be validated? • If ground truth labels LGT (A) match well with data clusters (good CLM) • Then EVM (D,G) correlates with clustering L* quality (C,F) • Good clustering (C) gets high EVM (D) • Bad clustering (F) gets low EVM (G) EVM is a reliable measure of clustering quality All good! 
  • 9. • If ground truth labels LGT (B) do not match with data clusters (bad CLM) • Then EVM (E,H) does not correlate with L* quality (C,F) • Good clustering (C) gets low EVM (E) • Bad clustering (F) gets low EVM (H) EVM is unreliable and can lead to a pessimistic evaluation of clustering techniques!   Why clustering benchmark datasets must be validated?
  • 10. How to evaluate the quality of benchmark datasets? • How to measure the cluster label matching CLM of the ground truth labels? Use internal validation measures IVM? • No, standard IVMs are designed to compare different clusterings of the same specific dataset (same instances, same features (A,B)). • We need new IVMs to compare CLM across different datasets (I,J,K). • We state that such adjusted IVMs (IVMA) must be invariant to what can change across multiple benchmark datasets (I,J,K) and is irrelevant to quantify the CLM. These are the Across-datasets axioms: • Axiom A1: Invariance to the number of data instances (data size) • Axiom A2: Invariance to the number of features (data dimension) • Axiom A3: Invariance to the number of labels (LGT) • Axiom A4: Invariance to the range of IVMA values
  • 11. How to implement these axioms? • We propose formal definitions for these axioms • We propose protocols to adjust existing IVMs such that they validate the across-dataset axioms • More details in sections 3 and 4 • Now, let’s see the experimental results…
  • 12. IVMA are more accurate to evaluate cluster label matching (1) • For each of 96 benchmark datasets, we compare • IVMs, adjusted IVMs (IVMA), and supervised classifiers based on the data labels LGT • Against a ground truth clustering L* obtained from an ensemble of 9 clustering techniques. • Classifiers (as expected) cannot tell apart good from bad clusterings. • Adjusted IVMs (IVMA) of data labels LGT match the best with ground truth clustering L*.
  • 13. IVMA are more accurate to evaluate cluster label matching (2) • Adjusted IVMA (b) is more correlated to the ground truth ensemble clustering than base IVM (a) Adjusted IVMA IVM
  • 14. Ranking of clustering techniques strongly depends on cluster label matching of benchmark datasets Researchers should use IVMA-based top-tier benchmark datasets to compare clustering techniques. • Ranking clustering techniques using only top-tier benchmark datasets in terms of cluster label matching evaluated with the proposed IVMA leads to drastically different rankings than using all the datasets or only the bottom-tier ones. (a) shows the ranking stability when a researcher picks 10 benchmark datasets randomly out of the top-tier, entire, bottom- tier sets. (b) shows the ranking using 100% of these subsets.
  • 15. IVMA are fast to compute • IVMA are as fast as IVMs • IVMA are 4 order of magnitude faster than clustering ensemble
  • 16. What’s Next • In GenAI foundation models, semantically related facts or tokens should form clusters in different embedding spaces. Verifying and tracking these clusters could help explain and edit models’ responses. • However, these embedding spaces may change dimension across layers, and sample size and labels categorizing sets of facts or tokens may vary across models or analytic phases. • Adjusted IVMA being fast and accurate, while robust to changes in dimensions, samples and labels, set the ground for developing new approaches to evaluate how labels of interest form clusters across layers and model instances during pre- and post-training of GenAI models.
  • 17. Thank you for reading up here! H. Jeon, M. Aupetit, D. Shin, A. Cho, S. Park and J. Seo, "Measuring the Validity of Clustering Validation Datasets," in IEEE Transactions on Pattern Analysis and Machine Intelligence, doi: 10.1109/TPAMI.2025.3548011 https://ieeexplore.ieee.org/document/10909451 @ARTICLE{10909451, author={Jeon, Hyeon and Aupetit, Michaël and Shin, DongHwa and Cho, Aeri and Park, Seokhyeon and Seo, Jinwook}, journal={IEEE Transactions on Pattern Analysis and Machine Intelligence}, title={Measuring the Validity of Clustering Validation Datasets}, year={2025}, pages={1-14}, doi={10.1109/TPAMI.2025.3548011}} Ranked datasets https://github.com/hj-n/labeled-datasets Adjusted IVMs https://github.com/hj-n/clm Other amazing work of Hyeon https://www.hyeonjeon.com/publications Read, Use, Share, Cite