This document discusses challenging issues and similarity measures for web document clustering. It begins with an introduction to text mining and document clustering. It then reviews related work on similarity approaches and measures. Some key challenging issues in web document clustering are discussed, such as measuring semantic similarity between words and evaluating cluster validity. Various types of similarity measures are also described, including string-based measures like Jaro-Winkler distance and corpus-based measures like latent semantic analysis. The conclusion states that accurate clustering requires a precise definition of similarity between document pairs and discusses different similarity measures that can be used.
Related topics: