Deduplication

DeduplicationBouvet BigOne, 2011-04-13Lars Marius Garshol, <larsga@bouvet.no>http://twitter.com/larsga

The problemThe suppliers tableReal-world data is very, very messy

The problem – take 2SuppliersCustomersCustomersCustomersCompaniesCRMBillingERPEach of these has internal duplicates,plus duplicates across the tables. Noeasy fix.

But ... what about identifiers?No, there are no system IDs across these tablesYes, there are outside identifiersorganization number for companiespersonal number for peopleBut, these are problematicmany records don't have themthey are inconsistently formattedsometimes they are misspelledsome parts of huge organizations have the same org number, but need to be treated as separate

First attempt at solutionI wrote a simple Python script in ~2 hoursIt does the following:load all recordsnormalize the datastrip extra whitespace, lowercase, remove letters from org codes...use Bayesian inferencing for matching

MatchingThis sums out to 0.93 probability

ProblemsThe functions comparing values are still pretty primitivePerformance is abysmal90 minutes to process 14,500 recordsperformance is O(n2)total number of records is ~2.5 milliontime to process all records: 1 year 10 monthsNow what?

An ideaWell, we don't necessarily need to compare each record with all others if we have indexeswe can look up the records which have matching valuesUse DBM for the indexes, for exampleUnfortunately, these only allow exact matchingBut, we can break up complex values into tokens, and index thoseHang on, isn't this rather like a search engine?Bing!Let's try Lucene!

Lucene-based prototypeI whip out Jython and try itNew script first builds Lucene indexThen searches all records against the indexTime to process 14,500 records: 1 minuteNow we're talking...

Reality sets inA splash of cold water to the face

Prior artIt turns out people have been doing this beforeThey call itentity resolutionidentity resolutionmerge/purgededuplicationrecord linkage...This makes Googling for information an absolute nightmare

Existing toolsSeveral commercial toolsthey look big and expensive: we skip thoseStian found some open source toolsOyster: slow, bad architecture, primitive matchingSERF: slow, bad architectureI’ve later found more, but was not impressedSo, it seems we still have to do it ourselves

Finds in the research literatureGeneralproblem is well-understood"naïve Bayes" is naïvelots of interesting work on value comparisonsperformance problem 'solved' with "blocking"build a key from parts of the datasort records by keycompare each record with m nearest neighboursperformance goes from O(n2) to O(n m)parallel processing widely usedSwoosh papercompare and merge should have ICAR1 propertiesoptimal algorithms for general merge foundrun-time for 14,000 records ~1.5 hours...1 Idempotence, commutativity, associativity, reflexivity

Good research papersThreat and Fraud Intelligence, Las Vegas Style, Jeff Jonashttp://jeffjonas.typepad.com/IEEE.Identity.Resolution.pdfReal-world data is dirty: Data Cleansing and the Merge/Purge Problem, Hernandez & Stolfohttp://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.30.3496&rep=rep1&type=pdfSwoosh: a generic approach to entity resolution, Benjelloun, Garcia-Molina et alhttp://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.122.5696&rep=rep1&type=pdf

Java deduplication engineWork in progressso far spent only ~20 hours on itonly command-line batch client built so farBased on Lucene 3.1Open source (on Google Code)http://code.google.com/p/duke/Blazingly fast960,000 records in 11 minutes on this laptop

Architecturedata inequivalences outSDshare clientSDshare serverRDF frontendDatastore APIDuke engineLuceneH2 database

Architecture #2data inlink file outCommand-line clientMore frontends: JDBC

...CSV frontendDatastore APIDuke engineLucene

Deduplication

More Related Content

What's hot (20)

Similar to Deduplication (20)

More from Lars Marius Garshol (20)

Recently uploaded (20)

Deduplication