SlideShare a Scribd company logo
Indexing Techniques for
Scalable Record Linkage and Deduplication
Pradeeban Kathiravelu
INESC-ID Lisboa
Instituto Superior T´ecnico, Universidade de Lisboa
Lisbon, Portugal
Data Quality – Presentation 3
April 14, 2015.
Pradeeban Kathiravelu (IST-ULisboa) Record Linkage 1 / 18
Introduction
Introduction
Matching.
Approach known as:
Data or Record Linkage.
Data or Field Matching.
The Merge/Purge Problem.
Too large to fit in the main memory.
Corrupted incoming new data requiring complex tests.
Importance of accuracy, than missing data.
Pradeeban Kathiravelu (IST-ULisboa) Record Linkage 2 / 18
Introduction
Matching Records
{Data|Record} Linkage | {Data|Field} Matching
Pradeeban Kathiravelu (IST-ULisboa) Record Linkage 3 / 18
Introduction
Motivation
Linked Data
Improving data quality and integrity.
Allowing re-use of existing data sources.
Reducing costs and efforts in data acquisition.
Multiple Domains
Fraud and crime detection.
Pervasive health systems.
Enterprise business systems.
Pradeeban Kathiravelu (IST-ULisboa) Record Linkage 4 / 18
Introduction
Indexing in Record Linkage
Pradeeban Kathiravelu (IST-ULisboa) Record Linkage 5 / 18
Record Linkage
Record Linkage Approaches
Blocking
.[] [] [] [] Similar values.
Blocking key.
Trade-off of size: False negatives vs cost.
Blocking Keys
No. of true matches in the candidate record pairs ⇑.
Total No. of candidate pairs ⇓.
Pradeeban Kathiravelu (IST-ULisboa) Record Linkage 6 / 18
Record Linkage
Research Avenues
Scaling to large data sets.
While keeping a high linkage quality.
Development of techniques that can learn optimal blocking key
definitions.
Manual ⇒ Supervised machine learning based approaches.
Machine learning approaches leveraging,
Predicate-based formulations of learnable blocking functions.
The sequential covering algorithm, which discovers disjunctive sets of
rules.
Pradeeban Kathiravelu (IST-ULisboa) Record Linkage 7 / 18
Evaluation
Evaluation
Evaluation Framework
Febrl (Freely Extensible Biomedical Record Linkage).
Developed in Python -
https://sourceforge.net/projects/febrl/
data standardisation (segmentation and cleaning).
probabilistic record linkage (”fuzzy” matching)
Data Sets
SecondString Toolkit.
Developed in Java - http://secondstring.sourceforge.net/
Approximate string-matching techniques.
Census, bibliographic, restaurant, and CD records.
Pradeeban Kathiravelu (IST-ULisboa) Record Linkage 8 / 18
Evaluation
Indexing Techniques
Pradeeban Kathiravelu (IST-ULisboa) Record Linkage 9 / 18
Sorted-Neighborhood
Sorted-Neighborhood method
Partition the data.
Sort the partitions before the
matching.
with the most important BKV
Corrupted keys?
Approach:
Create Keys
Sort Data
Merge
Pradeeban Kathiravelu (IST-ULisboa) Record Linkage 10 / 18
Sorted-Neighborhood
Case
Pradeeban Kathiravelu (IST-ULisboa) Record Linkage 11 / 18
Sorted-Neighborhood
Equational Theory
Pradeeban Kathiravelu (IST-ULisboa) Record Linkage 12 / 18
Sorted-Neighborhood
Accuracy of Sorted-Neighborhood method
Pradeeban Kathiravelu (IST-ULisboa) Record Linkage 13 / 18
Sorted-Neighborhood
Clustering Methods vs. SNM
Pradeeban Kathiravelu (IST-ULisboa) Record Linkage 14 / 18
Sorted-Neighborhood
Memory-based database (13751 records)
Pradeeban Kathiravelu (IST-ULisboa) Record Linkage 15 / 18
Sorted-Neighborhood
Multiple Processors (1 million records; width = 10)
Pradeeban Kathiravelu (IST-ULisboa) Record Linkage 16 / 18
Sorted-Neighborhood
Time Performance
Pradeeban Kathiravelu (IST-ULisboa) Record Linkage 17 / 18
Sorted-Neighborhood
References
Christen, P. (2012). A survey of indexing techniques for scalable
record linkage and deduplication. Knowledge and Data Engineering,
IEEE Transactions on, 24(9), 1537-1555.
Hern´andez, M. A., & Stolfo, S. J. (1995, June). The merge/purge
problem for large databases. In ACM SIGMOD Record (Vol. 24, No.
2, pp. 127-138). ACM.
Thank you!
Pradeeban Kathiravelu (IST-ULisboa) Record Linkage 18 / 18

More Related Content

PDF
Efficient Duplicate Detection Over Massive Data Sets
PDF
∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data
PPTX
The lifecycle of reproducible science data and what provenance has got to do ...
PPTX
Scalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud
PPTX
Big Data and Dataflow: Made for each other
PPTX
Project Name
Efficient Duplicate Detection Over Massive Data Sets
∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data
The lifecycle of reproducible science data and what provenance has got to do ...
Scalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud
Big Data and Dataflow: Made for each other
Project Name

What's hot (20)

PDF
IEEE Datamining 2016 Title and Abstract
PPTX
Topic modeling using big data analytics
PPTX
Networking Materials Data
PPTX
Data Trajectories: tracking the reuse of published data for transitive credi...
PDF
18 Meta Techniques in Computer Science
PPTX
Your data won’t stay smart forever: exploring the temporal dimension of (big ...
PPTX
Frequent Itemset Mining on BigData
PDF
Distributed Algorithm for Frequent Pattern Mining using HadoopMap Reduce Fram...
PDF
On how to efficiently implement Deep Learning algorithms on PYNQ platform
PPT
Large Scale On-Demand Image Processing For Disaster Relief
PDF
Knowledge Graph Maintenance
PDF
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
PDF
High Performance Data Analytics and a Java Grande Run Time
PDF
Knowledge Graph Maintenance
PDF
Reconciling Event-Based Knowledge through RDF2VEC
PDF
An Introduction of Recent Research on MapReduce (2011)
PPT
Lessons Learned from a Year's Worth of Benchmarking Large Data Clouds (Robert...
DOCX
A tree cluster-based data-gathering algorithm for industrial ws ns with a mob...
PPTX
Real Time Reporting Platform
IEEE Datamining 2016 Title and Abstract
Topic modeling using big data analytics
Networking Materials Data
Data Trajectories: tracking the reuse of published data for transitive credi...
18 Meta Techniques in Computer Science
Your data won’t stay smart forever: exploring the temporal dimension of (big ...
Frequent Itemset Mining on BigData
Distributed Algorithm for Frequent Pattern Mining using HadoopMap Reduce Fram...
On how to efficiently implement Deep Learning algorithms on PYNQ platform
Large Scale On-Demand Image Processing For Disaster Relief
Knowledge Graph Maintenance
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
High Performance Data Analytics and a Java Grande Run Time
Knowledge Graph Maintenance
Reconciling Event-Based Knowledge through RDF2VEC
An Introduction of Recent Research on MapReduce (2011)
Lessons Learned from a Year's Worth of Benchmarking Large Data Clouds (Robert...
A tree cluster-based data-gathering algorithm for industrial ws ns with a mob...
Real Time Reporting Platform
Ad

Viewers also liked (20)

PPT
Prescription Event Monitoring & Record Linkage Systems
PPT
Prescription event monitorig
PPT
An adaptive algorithm for detection of duplicate records
PPT
online Record Linkage
PPTX
Linking data without common identifiers
PPT
A Case Study in Record Linkage_PVER Conf_May2011
PDF
Approximate Protocol for Privacy Preserving Associate Rule Mining
PPTX
Data Linkage
PDF
Predictive Models and data linkage
PDF
Brisbane Health-y Data: Queensland Data Linkage Framework
PDF
Privacy Preserved Distributed Data Sharing with Load Balancing Scheme
PPT
Data protection and linkage
PPTX
Prescription event monitoring and record linkage system
PDF
Approximation Algorithms Part Four: APTAS
PPTX
Privacy preserving in data mining with hybrid approach
PDF
Privacy Preserving Data Mining
PDF
A Review Study on the Privacy Preserving Data Mining Techniques and Approaches
PDF
Privacy Preserving Data Mining
PPT
Data mining and privacy preserving in data mining
PPT
Spontaneous reporting
Prescription Event Monitoring & Record Linkage Systems
Prescription event monitorig
An adaptive algorithm for detection of duplicate records
online Record Linkage
Linking data without common identifiers
A Case Study in Record Linkage_PVER Conf_May2011
Approximate Protocol for Privacy Preserving Associate Rule Mining
Data Linkage
Predictive Models and data linkage
Brisbane Health-y Data: Queensland Data Linkage Framework
Privacy Preserved Distributed Data Sharing with Load Balancing Scheme
Data protection and linkage
Prescription event monitoring and record linkage system
Approximation Algorithms Part Four: APTAS
Privacy preserving in data mining with hybrid approach
Privacy Preserving Data Mining
A Review Study on the Privacy Preserving Data Mining Techniques and Approaches
Privacy Preserving Data Mining
Data mining and privacy preserving in data mining
Spontaneous reporting
Ad

Similar to Indexing Techniques for Scalable Record Linkage and Deduplication (20)

PDF
Introduction to Data Quality
PDF
Z04506138145
PDF
DisGeNET Tutorial SWAT4LS 2015-12-07
PPTX
SAFE: Policy Aware SPARQL Query Federation Over RDF Data Cubes
PPTX
The CIARD RINGValeri
PDF
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...
PDF
IRJET- Survey of Estimation of Crop Yield using Agriculture Data
PDF
The Case for Graphs in Supply Chains
PPTX
EDF2012 Peter Boncz - LOD benchmarking SRbench
PDF
IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Da...
PDF
Ijricit 01-002 enhanced replica detection in short time for large data sets
PDF
A Framework for Online Clustering Based on Evolving Semi-supervision
PDF
Data and Processes: Can we Marry Them . . . and Make the Marriage Last?
PDF
IRJET- Cluster Analysis for Effective Information Retrieval through Cohesive ...
PDF
How Partitioning Clustering Technique For Implementing...
PDF
iMarine catalogue of services
PDF
Turning data into knowledge the impacts of bioinformatics
PDF
IRJET- Customer Online Buying Prediction using Frequent Item Set Mining
DOCX
Ontology based clustering algorithms
Introduction to Data Quality
Z04506138145
DisGeNET Tutorial SWAT4LS 2015-12-07
SAFE: Policy Aware SPARQL Query Federation Over RDF Data Cubes
The CIARD RINGValeri
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...
IRJET- Survey of Estimation of Crop Yield using Agriculture Data
The Case for Graphs in Supply Chains
EDF2012 Peter Boncz - LOD benchmarking SRbench
IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Da...
Ijricit 01-002 enhanced replica detection in short time for large data sets
A Framework for Online Clustering Based on Evolving Semi-supervision
Data and Processes: Can we Marry Them . . . and Make the Marriage Last?
IRJET- Cluster Analysis for Effective Information Retrieval through Cohesive ...
How Partitioning Clustering Technique For Implementing...
iMarine catalogue of services
Turning data into knowledge the impacts of bioinformatics
IRJET- Customer Online Buying Prediction using Frequent Item Set Mining
Ontology based clustering algorithms

More from Pradeeban Kathiravelu, Ph.D. (20)

PDF
Google Summer of Code_2023.pdf
PDF
Google Summer of Code (GSoC) 2022
PDF
Google Summer of Code (GSoC) 2022
PPTX
Niffler: A DICOM Framework for Machine Learning and Processing Pipelines.
PDF
Google summer of code (GSoC) 2021
PPTX
A DICOM Framework for Machine Learning Pipelines against Real-Time Radiology ...
PDF
Google Summer of Code (GSoC) 2020 for mentors
PDF
Google Summer of Code (GSoC) 2020
PDF
Data Services with Bindaas: RESTful Interfaces for Diverse Data Sources
PDF
The UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degree
PDF
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Compos...
PDF
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Composi...
PDF
UCL Ph.D. Confirmation 2018
PDF
Software-Defined Systems for Network-Aware Service Composition and Workflow P...
PDF
Moving bits with a fleet of shared virtual routers
PDF
Software-Defined Data Services: Interoperable and Network-Aware Big Data Exec...
PDF
On-Demand Service-Based Big Data Integration: Optimized for Research Collabor...
PDF
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
PDF
Software-Defined Inter-Cloud Composition of Big Services
PDF
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
Google Summer of Code_2023.pdf
Google Summer of Code (GSoC) 2022
Google Summer of Code (GSoC) 2022
Niffler: A DICOM Framework for Machine Learning and Processing Pipelines.
Google summer of code (GSoC) 2021
A DICOM Framework for Machine Learning Pipelines against Real-Time Radiology ...
Google Summer of Code (GSoC) 2020 for mentors
Google Summer of Code (GSoC) 2020
Data Services with Bindaas: RESTful Interfaces for Diverse Data Sources
The UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degree
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Compos...
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Composi...
UCL Ph.D. Confirmation 2018
Software-Defined Systems for Network-Aware Service Composition and Workflow P...
Moving bits with a fleet of shared virtual routers
Software-Defined Data Services: Interoperable and Network-Aware Big Data Exec...
On-Demand Service-Based Big Data Integration: Optimized for Research Collabor...
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
Software-Defined Inter-Cloud Composition of Big Services
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...

Recently uploaded (20)

PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Empathic Computing: Creating Shared Understanding
PDF
cuic standard and advanced reporting.pdf
PDF
Encapsulation theory and applications.pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Electronic commerce courselecture one. Pdf
PPT
Teaching material agriculture food technology
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
KodekX | Application Modernization Development
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Machine learning based COVID-19 study performance prediction
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Empathic Computing: Creating Shared Understanding
cuic standard and advanced reporting.pdf
Encapsulation theory and applications.pdf
Big Data Technologies - Introduction.pptx
Network Security Unit 5.pdf for BCA BBA.
The AUB Centre for AI in Media Proposal.docx
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Mobile App Security Testing_ A Comprehensive Guide.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Electronic commerce courselecture one. Pdf
Teaching material agriculture food technology
Diabetes mellitus diagnosis method based random forest with bat algorithm
Review of recent advances in non-invasive hemoglobin estimation
KodekX | Application Modernization Development
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx

Indexing Techniques for Scalable Record Linkage and Deduplication