1-minute and 15-minute summaries of our IEEE TPAMI paper:
H. Jeon, M. Aupetit, D. Shin, A. Cho, S. Park and J. Seo, "Measuring the Validity of Clustering Validation Datasets," in IEEE Transactions on Pattern Analysis and Machine Intelligence, doi: 10.1109/TPAMI.2025.3548011
Clustering is essential to data analytics.
Practitioners (Data Scientists, Domain Experts) pick a clustering technique to explore their specific domain dataset.
Researchers design clustering techniques and rank them on benchmark datasets representative of an application domain to help practitioners choose the most suitable technique.
We question the validity of benchmark datasets used for clustering validation.
We propose an axiomatic approach and its practical implementation to evaluate and rank benchmark datasets for clustering evaluation.
We show that many benchmark datasets are of low quality, which has drastic consequences when used for ranking clustering techniques.
We discuss future usage of our approach to explore how concepts cluster in the representation spaces of GenAI foundation models.
Ranked datasets
https://github.com/hj-n/labeled-datasets
Adjusted IVMs
https://github.com/hj-n/clm
Other amazing work of Hyeon Jeon
https://www.hyeonjeon.com/publications
Related topics: