SlideShare a Scribd company logo
8
Most read
15
Most read
19
Most read
DeduplicationBouvet BigOne, 2011-04-13Lars Marius Garshol, <larsga@bouvet.no>http://twitter.com/larsga
Getting startedBaby steps
The problemThe suppliers tableReal-world data is very, very messy
The problem – take 2SuppliersCustomersCustomersCustomersCompaniesCRMBillingERPEach of these has internal duplicates,plus duplicates across the tables. Noeasy fix.
But ... what about identifiers?No, there are no system IDs across these tablesYes, there are outside identifiersorganization number for companiespersonal number for peopleBut, these are problematicmany records don't have themthey are inconsistently formattedsometimes they are misspelledsome parts of huge organizations have the same org number, but need to be treated as separate
First attempt at solutionI wrote a simple Python script in ~2 hoursIt does the following:load all recordsnormalize the datastrip extra whitespace, lowercase, remove letters from org codes...use Bayesian inferencing for matching
Configuration
MatchingThis sums out to 0.93 probability
ProblemsThe functions comparing values are still pretty primitivePerformance is abysmal90 minutes to process 14,500 recordsperformance is O(n2)total number of records is ~2.5 milliontime to process all records: 1 year 10 monthsNow what?
An ideaWell, we don't necessarily need to compare each record with all others if we have indexeswe can look up the records which have matching valuesUse DBM for the indexes, for exampleUnfortunately, these only allow exact matchingBut, we can break up complex values into tokens, and index thoseHang on, isn't this rather like a search engine?Bing!Let's try Lucene!
Lucene-based prototypeI whip out Jython and try itNew script first builds Lucene indexThen searches all records against the indexTime to process 14,500 records: 1 minuteNow we're talking...
Reality sets inA splash of cold water to the face
Prior artIt turns out people have been doing this beforeThey call itentity resolutionidentity resolutionmerge/purgededuplicationrecord linkage...This makes Googling for information an absolute nightmare
Existing toolsSeveral commercial toolsthey look big and expensive: we skip thoseStian found some open source toolsOyster: slow, bad architecture, primitive matchingSERF: slow, bad architectureI’ve later found more, but was not impressedSo, it seems we still have to do it ourselves
Finds in the research literatureGeneralproblem is well-understood"naïve Bayes" is naïvelots of interesting work on value comparisonsperformance problem 'solved' with "blocking"build a key from parts of the datasort records by keycompare each record with m nearest neighboursperformance goes from O(n2) to O(n m)parallel processing widely usedSwoosh papercompare and merge should have ICAR1 propertiesoptimal algorithms for general merge foundrun-time for 14,000 records ~1.5 hours...1 Idempotence, commutativity, associativity, reflexivity
Good research papersThreat and Fraud Intelligence, Las Vegas Style, Jeff Jonashttp://jeffjonas.typepad.com/IEEE.Identity.Resolution.pdfReal-world data is dirty: Data Cleansing and the Merge/Purge Problem, Hernandez & Stolfohttp://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.30.3496&rep=rep1&type=pdfSwoosh: a generic approach to entity resolution, Benjelloun, Garcia-Molina et alhttp://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.122.5696&rep=rep1&type=pdf
DUplicate KillErDuke
Java deduplication engineWork in progressso far spent only ~20 hours on itonly command-line batch client built so farBased on Lucene 3.1Open source (on Google Code)http://code.google.com/p/duke/Blazingly fast960,000 records in 11 minutes on this laptop
Architecturedata inequivalences outSDshare clientSDshare serverRDF frontendDatastore APIDuke engineLuceneH2 database
Architecture #2data inlink file outCommand-line clientMore frontends: JDBC
 SPARQL
 RDF file
 ...CSV frontendDatastore APIDuke engineLucene

More Related Content

PDF
從 Trino 到企業級資料共享產品的開發之路(一) Connector 動態更新
PDF
Massive-Scale Entity Resolution Using the Power of Apache Spark and Graph
PPTX
Introduction to Data Engineering
PDF
An Insider’s Guide to Maximizing Spark SQL Performance
PDF
Data warehouse architecture
PDF
Building Your Data Streams for all the IoT
PPTX
Intro to Big Data and NoSQL
PPTX
Elastic Stack Introduction
從 Trino 到企業級資料共享產品的開發之路(一) Connector 動態更新
Massive-Scale Entity Resolution Using the Power of Apache Spark and Graph
Introduction to Data Engineering
An Insider’s Guide to Maximizing Spark SQL Performance
Data warehouse architecture
Building Your Data Streams for all the IoT
Intro to Big Data and NoSQL
Elastic Stack Introduction

What's hot (20)

PPTX
Data analytics and visualization
ODP
Deep Dive Into Elasticsearch
PPTX
How to build a successful Data Lake
PPTX
Text mining
PPTX
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
PPTX
Elastic stack Presentation
PDF
CyberAgentにおけるMongoDB
PDF
Dynamic Allocation in Spark
PDF
Hadoop Overview kdd2011
PDF
Understanding big data and data analytics big data
PPTX
Text mining
PPTX
Building a Big Data Pipeline
PPTX
How different between Big Data, Business Intelligence and Analytics ?
PDF
Data mining
PPTX
Introduction to ELK
PDF
ヤフーのAIプラットフォーム紹介 ~AIテックカンパニーを支えるデータ基盤~ #yjtc
PDF
Pinot: Near Realtime Analytics @ Uber
PDF
Testing Rich Domain Models
PPTX
What is big data?
PPTX
IoT と時系列データと Elasticsearch | Data Pipeline Casual Talk Vol.4
Data analytics and visualization
Deep Dive Into Elasticsearch
How to build a successful Data Lake
Text mining
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
Elastic stack Presentation
CyberAgentにおけるMongoDB
Dynamic Allocation in Spark
Hadoop Overview kdd2011
Understanding big data and data analytics big data
Text mining
Building a Big Data Pipeline
How different between Big Data, Business Intelligence and Analytics ?
Data mining
Introduction to ELK
ヤフーのAIプラットフォーム紹介 ~AIテックカンパニーを支えるデータ基盤~ #yjtc
Pinot: Near Realtime Analytics @ Uber
Testing Rich Domain Models
What is big data?
IoT と時系列データと Elasticsearch | Data Pipeline Casual Talk Vol.4
Ad

Similar to Deduplication (20)

PPTX
Record Deduplication and Record Linkage
PDF
Indexing based Genetic Programming Approach to Record Deduplication
PPTX
Linking data without common identifiers
PDF
Data De-Duplication Engine for Efficient Storage Management
PDF
Bi4101343346
PDF
A BRIEF REVIEW ALONG WITH A NEW PROPOSED APPROACH OF DATA DE DUPLICATION
PPTX
Low-Latency Data Access: The Required Synergy Between Memory & Disk
PDF
Ijricit 01-002 enhanced replica detection in short time for large data sets
PPTX
Beyond Kaggle: Solving Data Science Challenges at Scale
PPTX
Dedup with hadoop
PDF
Analysis on Deduplication Techniques for Storage of Data in Cloud
PDF
Matching data detection for the integration system
PDF
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
PDF
4.on demand quality of web services using ranking by multi criteria 31-35
PDF
11.0004www.iiste.org call for paper.on demand quality of web services using r...
PDF
Elimination of data redundancy before persisting into dbms using svm classifi...
PDF
Data Deduplication Approaches: Concepts, Strategies, and Challenges 1st Editi...
PDF
Duplicate Detection of Records in Queries using Clustering
PPTX
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
PPTX
Linking data without common identifiers
Record Deduplication and Record Linkage
Indexing based Genetic Programming Approach to Record Deduplication
Linking data without common identifiers
Data De-Duplication Engine for Efficient Storage Management
Bi4101343346
A BRIEF REVIEW ALONG WITH A NEW PROPOSED APPROACH OF DATA DE DUPLICATION
Low-Latency Data Access: The Required Synergy Between Memory & Disk
Ijricit 01-002 enhanced replica detection in short time for large data sets
Beyond Kaggle: Solving Data Science Challenges at Scale
Dedup with hadoop
Analysis on Deduplication Techniques for Storage of Data in Cloud
Matching data detection for the integration system
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
4.on demand quality of web services using ranking by multi criteria 31-35
11.0004www.iiste.org call for paper.on demand quality of web services using r...
Elimination of data redundancy before persisting into dbms using svm classifi...
Data Deduplication Approaches: Concepts, Strategies, and Challenges 1st Editi...
Duplicate Detection of Records in Queries using Clustering
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
Linking data without common identifiers
Ad

More from Lars Marius Garshol (20)

PDF
JSLT: JSON querying and transformation
PDF
Data collection in AWS at Schibsted
PPTX
Kveik - what is it?
PDF
Nature-inspired algorithms
PDF
Collecting 600M events/day
PDF
History of writing
PDF
NoSQL and Einstein's theory of relativity
PPTX
Norwegian farmhouse ale
PPTX
Archive integration with RDF
PPTX
The Euro crisis in 10 minutes
PPTX
Using the search engine as recommendation engine
PPTX
Linked Open Data for the Cultural Sector
PPTX
NoSQL databases, the CAP theorem, and the theory of relativity
PPTX
Bitcoin - digital gold
PPTX
Introduction to Big Data/Machine Learning
PPTX
Hops - the green gold
PPTX
Big data 101
PPTX
Linked Open Data
PPTX
Hafslund SESAM - Semantic integration in practice
PPTX
Approximate string comparators
JSLT: JSON querying and transformation
Data collection in AWS at Schibsted
Kveik - what is it?
Nature-inspired algorithms
Collecting 600M events/day
History of writing
NoSQL and Einstein's theory of relativity
Norwegian farmhouse ale
Archive integration with RDF
The Euro crisis in 10 minutes
Using the search engine as recommendation engine
Linked Open Data for the Cultural Sector
NoSQL databases, the CAP theorem, and the theory of relativity
Bitcoin - digital gold
Introduction to Big Data/Machine Learning
Hops - the green gold
Big data 101
Linked Open Data
Hafslund SESAM - Semantic integration in practice
Approximate string comparators

Recently uploaded (20)

DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
A Presentation on Artificial Intelligence
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Big Data Technologies - Introduction.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Approach and Philosophy of On baking technology
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPT
Teaching material agriculture food technology
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
The AUB Centre for AI in Media Proposal.docx
A Presentation on Artificial Intelligence
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Understanding_Digital_Forensics_Presentation.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
Big Data Technologies - Introduction.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
Approach and Philosophy of On baking technology
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Teaching material agriculture food technology
Chapter 3 Spatial Domain Image Processing.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
“AI and Expert System Decision Support & Business Intelligence Systems”
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
Unlocking AI with Model Context Protocol (MCP)
20250228 LYD VKU AI Blended-Learning.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...

Deduplication

  • 1. DeduplicationBouvet BigOne, 2011-04-13Lars Marius Garshol, <larsga@bouvet.no>http://twitter.com/larsga
  • 3. The problemThe suppliers tableReal-world data is very, very messy
  • 4. The problem – take 2SuppliersCustomersCustomersCustomersCompaniesCRMBillingERPEach of these has internal duplicates,plus duplicates across the tables. Noeasy fix.
  • 5. But ... what about identifiers?No, there are no system IDs across these tablesYes, there are outside identifiersorganization number for companiespersonal number for peopleBut, these are problematicmany records don't have themthey are inconsistently formattedsometimes they are misspelledsome parts of huge organizations have the same org number, but need to be treated as separate
  • 6. First attempt at solutionI wrote a simple Python script in ~2 hoursIt does the following:load all recordsnormalize the datastrip extra whitespace, lowercase, remove letters from org codes...use Bayesian inferencing for matching
  • 8. MatchingThis sums out to 0.93 probability
  • 9. ProblemsThe functions comparing values are still pretty primitivePerformance is abysmal90 minutes to process 14,500 recordsperformance is O(n2)total number of records is ~2.5 milliontime to process all records: 1 year 10 monthsNow what?
  • 10. An ideaWell, we don't necessarily need to compare each record with all others if we have indexeswe can look up the records which have matching valuesUse DBM for the indexes, for exampleUnfortunately, these only allow exact matchingBut, we can break up complex values into tokens, and index thoseHang on, isn't this rather like a search engine?Bing!Let's try Lucene!
  • 11. Lucene-based prototypeI whip out Jython and try itNew script first builds Lucene indexThen searches all records against the indexTime to process 14,500 records: 1 minuteNow we're talking...
  • 12. Reality sets inA splash of cold water to the face
  • 13. Prior artIt turns out people have been doing this beforeThey call itentity resolutionidentity resolutionmerge/purgededuplicationrecord linkage...This makes Googling for information an absolute nightmare
  • 14. Existing toolsSeveral commercial toolsthey look big and expensive: we skip thoseStian found some open source toolsOyster: slow, bad architecture, primitive matchingSERF: slow, bad architectureI’ve later found more, but was not impressedSo, it seems we still have to do it ourselves
  • 15. Finds in the research literatureGeneralproblem is well-understood"naïve Bayes" is naïvelots of interesting work on value comparisonsperformance problem 'solved' with "blocking"build a key from parts of the datasort records by keycompare each record with m nearest neighboursperformance goes from O(n2) to O(n m)parallel processing widely usedSwoosh papercompare and merge should have ICAR1 propertiesoptimal algorithms for general merge foundrun-time for 14,000 records ~1.5 hours...1 Idempotence, commutativity, associativity, reflexivity
  • 16. Good research papersThreat and Fraud Intelligence, Las Vegas Style, Jeff Jonashttp://jeffjonas.typepad.com/IEEE.Identity.Resolution.pdfReal-world data is dirty: Data Cleansing and the Merge/Purge Problem, Hernandez & Stolfohttp://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.30.3496&rep=rep1&type=pdfSwoosh: a generic approach to entity resolution, Benjelloun, Garcia-Molina et alhttp://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.122.5696&rep=rep1&type=pdf
  • 18. Java deduplication engineWork in progressso far spent only ~20 hours on itonly command-line batch client built so farBased on Lucene 3.1Open source (on Google Code)http://code.google.com/p/duke/Blazingly fast960,000 records in 11 minutes on this laptop
  • 19. Architecturedata inequivalences outSDshare clientSDshare serverRDF frontendDatastore APIDuke engineLuceneH2 database
  • 20. Architecture #2data inlink file outCommand-line clientMore frontends: JDBC
  • 23. ...CSV frontendDatastore APIDuke engineLucene
  • 24. Architecture #3data inequivalences outREST interfaceX frontendDatastore APIDuke engineLuceneH2 database
  • 25. WeaknessesTied to naïve Bayes modelresearch shows more sophisticated models perform betternon-trivial to reconcile these with index lookupValue comparison sophistication limitedLucene does support Levenshtein queries(these are slow, though. will be fast in 4.x)