This document discusses indexing techniques for scalable record linkage and deduplication. It introduces the problems of record linkage on large datasets that do not fit in memory and addresses corrupted data. Blocking is presented as a common approach, where similar records are grouped into blocks to reduce the number of record pairs that must be compared. The document also discusses research on developing machine learning techniques to automatically learn optimal blocking keys and blocking functions. Evaluation frameworks for record linkage are introduced. The sorted neighborhood method is described in detail, including how it creates keys, sorts data, and merges records to link them.
Related topics: