The document discusses parallelizing spam clustering using Apache Hadoop. It presents an implementation of k-means clustering on a dataset of 1 million spam emails distributed across Apache Hadoop. The implementation abstracts the k-means algorithm and defines mappers and reducers to run the algorithm in parallel. Benchmark results show the Hadoop implementation is faster than a sequential approach and scales well with additional nodes. Analysis of overhead shows sorting to be the largest contributor. The document concludes there is room for further optimization of the system.