Indexing Techniques for Scalable Record Linkage and Deduplication

Indexing Techniques for
Scalable Record Linkage and Deduplication
Pradeeban Kathiravelu
INESC-ID Lisboa
Instituto Superior T´ecnico, Universidade de Lisboa
Lisbon, Portugal
Data Quality – Presentation 3
April 14, 2015.
Pradeeban Kathiravelu (IST-ULisboa) Record Linkage 1 / 18

Introduction
Introduction
Matching.
Approach known as:
Data or Record Linkage.
Data or Field Matching.
The Merge/Purge Problem.
Too large to ﬁt in the main memory.
Corrupted incoming new data requiring complex tests.
Importance of accuracy, than missing data.

Introduction
Matching Records
{Data|Record} Linkage | {Data|Field} Matching

Introduction
Motivation
Linked Data
Improving data quality and integrity.
Allowing re-use of existing data sources.
Reducing costs and eﬀorts in data acquisition.
Multiple Domains
Fraud and crime detection.
Pervasive health systems.
Enterprise business systems.

Introduction
Indexing in Record Linkage

Record Linkage
Record Linkage Approaches
Blocking
.[] [] [] [] Similar values.
Blocking key.
Trade-oﬀ of size: False negatives vs cost.
Blocking Keys
No. of true matches in the candidate record pairs ⇑.
Total No. of candidate pairs ⇓.

Record Linkage
Research Avenues
Scaling to large data sets.
While keeping a high linkage quality.
Development of techniques that can learn optimal blocking key
deﬁnitions.
Manual ⇒ Supervised machine learning based approaches.
Machine learning approaches leveraging,
Predicate-based formulations of learnable blocking functions.
The sequential covering algorithm, which discovers disjunctive sets of
rules.

Evaluation
Evaluation
Evaluation Framework
Febrl (Freely Extensible Biomedical Record Linkage).
Developed in Python -
https://sourceforge.net/projects/febrl/
data standardisation (segmentation and cleaning).
probabilistic record linkage (”fuzzy” matching)
Data Sets
SecondString Toolkit.
Developed in Java - http://secondstring.sourceforge.net/
Approximate string-matching techniques.
Census, bibliographic, restaurant, and CD records.

Evaluation
Indexing Techniques

Sorted-Neighborhood
Sorted-Neighborhood method
Partition the data.
Sort the partitions before the
matching.
with the most important BKV
Corrupted keys?
Approach:
Create Keys
Sort Data
Merge

Sorted-Neighborhood
Case

Sorted-Neighborhood
Equational Theory

Sorted-Neighborhood
Accuracy of Sorted-Neighborhood method

Sorted-Neighborhood
Clustering Methods vs. SNM

Sorted-Neighborhood
Memory-based database (13751 records)

Sorted-Neighborhood
Multiple Processors (1 million records; width = 10)

Sorted-Neighborhood
Time Performance

Sorted-Neighborhood
References
Christen, P. (2012). A survey of indexing techniques for scalable
record linkage and deduplication. Knowledge and Data Engineering,
IEEE Transactions on, 24(9), 1537-1555.
Hern´andez, M. A., & Stolfo, S. J. (1995, June). The merge/purge
problem for large databases. In ACM SIGMOD Record (Vol. 24, No.
2, pp. 127-138). ACM.
Thank you!

Indexing Techniques for Scalable Record Linkage and Deduplication

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Indexing Techniques for Scalable Record Linkage and Deduplication (20)

More from Pradeeban Kathiravelu, Ph.D. (20)

Recently uploaded (20)

Indexing Techniques for Scalable Record Linkage and Deduplication