Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010

Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica

Motivation: Data Cleaning Star Title Year Genre Keanu Reeves The Matrix 1999 Sci-Fi Tom Hanks Toy Story 3 2010 Animation Schwarzenegger The Terminator 1984 Sci-Fi Samuel Jackson The man 2006 Crime Find movies starring Tom Hanks

Movies starring S..warz…ne…ger? Star Title Year Genre Keanu Reeves The Matrix 1999 Sci-Fi Tom Hanks Toy Story 3 2010 Animation Schwarzenegger The Terminator 1984 Sci-Fi Samuel Jackson The man 2006 Crime

Similarity Search Find movies with a star “ similar to ” Schwarrzenger . Star Title Year Genre Keanu Reeves The Matrix 1999 Sci-Fi Samuel Jackson Iron man 2008 Sci-Fi Schwarzenegger The Terminator 1984 Sci-Fi Samuel Jackson The man 2006 Crime

Record linkage Table R Table S Star Keanu Reeves Samuel Jackson Schwarzenegger … Star Keanu Reeves Samuel L. Jackson Schwarzenegger …

Two-step solution Table R Table S Step 2: Verification Star … Star … Step 1: Similarity Join

Similarity join for large data sets Techniques applicable to other domains, e.g.: Finding similar documents Finding customers with similar patterns Focus of this talk

Formulation: set-similarity join Hadoop-based solutions Experiments More results: see SIGMOD2010 paper Talk Outline

Set-Similarity Join Finding pairs of records with a similarity on their join attributes > t

Why this formulation? Word tokens: Gram tokens: “ Samuel L. Jackson”  {Samuel, L., Jackson} “ Samuel Jackson”  {Samuel, Jackson} S c h w a r z e n e g g e r

Set-similarity functions Jaccard Dice Cosine Hamming … All solvable in this framework

Formulation of set-similarity join  Hadoop-based solutions Experiments Talk Outline

Large amounts of data Data or processing does not fit in one machine Assumptions: Self join: R = S Two similar sets share at least 1 token Why Hadoop?

Map: <23, (a,b,c)>  (a, 23), (b, 23), (c, 23) A naïve solution Too much data to transfer  Too many pairs to verify  . Reduce:(a,23),(a,29),(a,50), …  Verify each pair

Solving frequency skew: prefix filtering Prefixes of similar sets should share tokens Sort tokens by frequency (ascending) Prefix of a set: least frequent tokens prefix r1 r2 Sorted by frequency Chaudhuri, Ganti, Kaushik: A Primitive Operator for Similarity Joins in Data Cleaning. ICDE 2006: 5

Prefix filtering: example Each set has 5 tokens “ Similar”: they share at least 4 tokens Prefix length: 2 Record 1 Record 2

Stage 1: Order tokens by frequency Stage 2: Finding “similar” id pairs Stage 3: id pairs  record paris Hadoop Solution: Overview

Stage 1: Sort tokens by frequency Compute token frequencies Sort them MapReduce phase 1 MapReduce phase 2

Stage 2: Find “similar” id pairs Partition using prefixes Verify similarity

Stage 3: id pairs  record pairs (phase 1) Bring records for each id in each pair

Stage 3: id pairs  record pairs (phase 2) Join two half filled records

Formulation of set-similarity join Hadoop-based solutions  Experiments Talk Outline

Hardware 10-node IBM x3650 cluster Intel Xeon processor E5520 2.26GHz with four cores Four 300GB hard disks 12GB RAM Software Ubuntu 9.06, 64-bit, server edition OS Java 1.6, 64-bit, server Hadoop 0.20.1 Datasets: publications (DBLP and CITESEERX) Experimental Setting

Running time Stage 2 Stage 1 Stage 3

Speedup Breakdown Stage 2 has good speedup

Other methods for the 3 stages Case: R <> S Dealing with limited memory Additional results

Set-similarity joins in Hadoop: Three-stage approach using Hadoop Experimental study Summary

Thank you Chen Li @ UC Irvine Source code available at: http://asterix.ics.uci.edu/fuzzyjoin-mapreduce/ Acknowledgements: NSF, Google, IBM.

Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010

More Related Content

Similar to Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010 (13)

More from Yahoo Developer Network (20)

Recently uploaded (20)

Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010