This document describes research into automatically detecting web trackers. It explores using network structure and behavior to classify nodes in a bipartite graph of website referrers and hosts as trackers or not. Various graph-based and machine learning algorithms are tested, including relevance search, clustering, and community detection on a projected graph. The best approach involved preprocessing logs into graphs, running community detection, and achieving very high accuracy in classifying nodes as trackers or not. Lessons learned include starting simply, choosing features incrementally, visualizing data, and using a flexible system like Flink.
Related topics: