This document discusses a lock-free algorithm for tree-based reduction for large scale clustering on GPGPUs. It describes how lock contention can reduce parallel efficiency. It then illustrates a lock-free technique using tree-based reduction for clustering large datasets on GPGPUs. Experimental results show the performance of using atomic instructions, CUDA Thrust libraries, and the proposed method.