SlideShare a Scribd company logo
Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with  Michael Carey and Rares Vernica
Motivation: Data Cleaning Star Title Year Genre Keanu Reeves The Matrix 1999 Sci-Fi Tom Hanks Toy Story 3 2010 Animation Schwarzenegger The Terminator 1984 Sci-Fi Samuel Jackson The man 2006 Crime Find movies starring  Tom Hanks
Movies starring S..warz…ne…ger? Star Title Year Genre Keanu Reeves The Matrix 1999 Sci-Fi Tom Hanks Toy Story 3 2010 Animation Schwarzenegger The Terminator 1984 Sci-Fi Samuel Jackson The man 2006 Crime
Similarity Search Find movies with a star  “ similar to ”  Schwarrzenger . Star Title Year Genre Keanu Reeves The Matrix 1999 Sci-Fi Samuel Jackson Iron man 2008 Sci-Fi Schwarzenegger The Terminator 1984 Sci-Fi Samuel Jackson The man 2006 Crime
Record linkage Table R Table S Star Keanu Reeves Samuel Jackson Schwarzenegger … Star Keanu Reeves Samuel  L.  Jackson Schwarzenegger …
Two-step solution Table R Table S Step 2: Verification Star … Star … Step 1: Similarity Join
Similarity join for large data sets Techniques applicable to other domains, e.g.: Finding similar documents Finding customers with similar patterns Focus of this talk
Formulation: set-similarity join Hadoop-based solutions Experiments More results: see SIGMOD2010 paper Talk Outline
Set-Similarity Join Finding pairs of records with a  similarity  on their join attributes > t
Why this formulation? Word tokens: Gram tokens: “ Samuel  L.  Jackson”    {Samuel,  L.,  Jackson} “ Samuel Jackson”    {Samuel, Jackson} S c h w a r z e n e g g e r
Set-similarity functions Jaccard Dice Cosine Hamming … All solvable in this framework
Formulation of set-similarity join    Hadoop-based solutions Experiments Talk Outline
Large amounts of data Data or processing does not fit in one machine Assumptions:  Self join: R = S Two similar sets share at least 1 token  Why Hadoop?
Map:  <23, (a,b,c)>    (a, 23), (b, 23), (c, 23) A naïve solution Too much data to transfer   Too many pairs to verify   . Reduce:(a,23),(a,29),(a,50), …   Verify each pair
Solving frequency skew: prefix filtering Prefixes of similar sets should share tokens Sort tokens by frequency (ascending) Prefix  of a set: least frequent tokens prefix r1 r2 Sorted by frequency Chaudhuri, Ganti, Kaushik: A Primitive Operator for Similarity Joins in Data Cleaning. ICDE 2006: 5
Prefix filtering: example Each set has 5 tokens “ Similar”: they share at least 4 tokens Prefix length: 2 Record 1 Record 2
Stage 1: Order tokens by frequency Stage 2: Finding “similar” id pairs Stage 3: id pairs    record paris Hadoop Solution: Overview
Stage 1: Sort tokens by frequency Compute token frequencies Sort them MapReduce phase 1 MapReduce phase 2
Stage 2: Find “similar” id pairs  Partition using prefixes Verify similarity
Stage 3: id pairs    record pairs (phase 1) Bring records for each id in each pair
Stage 3: id pairs    record pairs (phase 2)  Join two half filled records
Formulation of set-similarity join Hadoop-based solutions    Experiments Talk Outline
Hardware 10-node  IBM x3650 cluster Intel Xeon processor E5520 2.26GHz with four cores Four 300GB hard disks 12GB RAM Software Ubuntu 9.06, 64-bit, server edition OS Java 1.6, 64-bit, server Hadoop 0.20.1 Datasets: publications (DBLP and CITESEERX) Experimental Setting
Running time Stage 2 Stage 1 Stage 3
Speedup
Speedup Breakdown Stage 2 has good speedup
Scaleup Good scaleup
Other methods for the 3 stages Case: R <> S Dealing with limited memory Additional results
Set-similarity joins in Hadoop:  Three-stage approach using Hadoop Experimental study Summary
Thank you Chen Li @ UC Irvine Source code available at:  http://asterix.ics.uci.edu/fuzzyjoin-mapreduce/ Acknowledgements:  NSF, Google, IBM.

More Related Content

PPTX
PDF
Finding similar items in high dimensional spaces locality sensitive hashing
PPTX
Locality sensitive hashing
PDF
Locality Sensitive Hashing By Spark
PDF
It's a trap - java pitfalls
PDF
Presentation final
PDF
Benchmark MinHash+LSH algorithm on Spark
PPT
Distributed System by Pratik Tambekar
Finding similar items in high dimensional spaces locality sensitive hashing
Locality sensitive hashing
Locality Sensitive Hashing By Spark
It's a trap - java pitfalls
Presentation final
Benchmark MinHash+LSH algorithm on Spark
Distributed System by Pratik Tambekar

Similar to Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010 (13)

PDF
Efficient Duplicate Detection Over Massive Data Sets
PDF
GPU Acceleration of Set Similarity Joins
PPTX
3 - Finding similar items
PPTX
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
PPTX
Optimizing Set-Similarity Join and Search with Different Prefix Schemes
PPTX
PDF
Seminar - Similarity Joins in SQL (performance and semantic joins)
PDF
Hadoop exercise
PPTX
Efficient Parallel Set-Similarity Joins Using MapReduce
PPT
Similarity at scale
PDF
Duplicate Detection of Records in Queries using Clustering
PDF
New Directions in Mahout's Recommenders
PDF
EPAS: A SAMPLING BASED SIMILARITY IDENTIFICATION ALGORITHM FOR THE CLOUD
Efficient Duplicate Detection Over Massive Data Sets
GPU Acceleration of Set Similarity Joins
3 - Finding similar items
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
Optimizing Set-Similarity Join and Search with Different Prefix Schemes
Seminar - Similarity Joins in SQL (performance and semantic joins)
Hadoop exercise
Efficient Parallel Set-Similarity Joins Using MapReduce
Similarity at scale
Duplicate Detection of Records in Queries using Clustering
New Directions in Mahout's Recommenders
EPAS: A SAMPLING BASED SIMILARITY IDENTIFICATION ALGORITHM FOR THE CLOUD
Ad

More from Yahoo Developer Network (20)

PDF
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
PDF
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
PDF
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
PDF
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
PDF
CICD at Oath using Screwdriver
PDF
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
PPTX
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
PDF
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
PPTX
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
PPTX
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
PDF
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
PPTX
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
PDF
Moving the Oath Grid to Docker, Eric Badger, Oath
PDF
Architecting Petabyte Scale AI Applications
PDF
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
PPTX
Jun 2017 HUG: YARN Scheduling – A Step Beyond
PDF
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
PPTX
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
PPTX
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
PPTX
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
CICD at Oath using Screwdriver
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Moving the Oath Grid to Docker, Eric Badger, Oath
Architecting Petabyte Scale AI Applications
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
Ad

Recently uploaded (20)

PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Approach and Philosophy of On baking technology
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Cloud computing and distributed systems.
Digital-Transformation-Roadmap-for-Companies.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
20250228 LYD VKU AI Blended-Learning.pptx
NewMind AI Monthly Chronicles - July 2025
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Network Security Unit 5.pdf for BCA BBA.
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Understanding_Digital_Forensics_Presentation.pptx
The AUB Centre for AI in Media Proposal.docx
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Unlocking AI with Model Context Protocol (MCP)
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Approach and Philosophy of On baking technology
NewMind AI Weekly Chronicles - August'25 Week I
MYSQL Presentation for SQL database connectivity
Cloud computing and distributed systems.

Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010

  • 1. Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica
  • 2. Motivation: Data Cleaning Star Title Year Genre Keanu Reeves The Matrix 1999 Sci-Fi Tom Hanks Toy Story 3 2010 Animation Schwarzenegger The Terminator 1984 Sci-Fi Samuel Jackson The man 2006 Crime Find movies starring Tom Hanks
  • 3. Movies starring S..warz…ne…ger? Star Title Year Genre Keanu Reeves The Matrix 1999 Sci-Fi Tom Hanks Toy Story 3 2010 Animation Schwarzenegger The Terminator 1984 Sci-Fi Samuel Jackson The man 2006 Crime
  • 4. Similarity Search Find movies with a star “ similar to ” Schwarrzenger . Star Title Year Genre Keanu Reeves The Matrix 1999 Sci-Fi Samuel Jackson Iron man 2008 Sci-Fi Schwarzenegger The Terminator 1984 Sci-Fi Samuel Jackson The man 2006 Crime
  • 5. Record linkage Table R Table S Star Keanu Reeves Samuel Jackson Schwarzenegger … Star Keanu Reeves Samuel L. Jackson Schwarzenegger …
  • 6. Two-step solution Table R Table S Step 2: Verification Star … Star … Step 1: Similarity Join
  • 7. Similarity join for large data sets Techniques applicable to other domains, e.g.: Finding similar documents Finding customers with similar patterns Focus of this talk
  • 8. Formulation: set-similarity join Hadoop-based solutions Experiments More results: see SIGMOD2010 paper Talk Outline
  • 9. Set-Similarity Join Finding pairs of records with a similarity on their join attributes > t
  • 10. Why this formulation? Word tokens: Gram tokens: “ Samuel L. Jackson”  {Samuel, L., Jackson} “ Samuel Jackson”  {Samuel, Jackson} S c h w a r z e n e g g e r
  • 11. Set-similarity functions Jaccard Dice Cosine Hamming … All solvable in this framework
  • 12. Formulation of set-similarity join  Hadoop-based solutions Experiments Talk Outline
  • 13. Large amounts of data Data or processing does not fit in one machine Assumptions: Self join: R = S Two similar sets share at least 1 token Why Hadoop?
  • 14. Map: <23, (a,b,c)>  (a, 23), (b, 23), (c, 23) A naïve solution Too much data to transfer  Too many pairs to verify  . Reduce:(a,23),(a,29),(a,50), …  Verify each pair
  • 15. Solving frequency skew: prefix filtering Prefixes of similar sets should share tokens Sort tokens by frequency (ascending) Prefix of a set: least frequent tokens prefix r1 r2 Sorted by frequency Chaudhuri, Ganti, Kaushik: A Primitive Operator for Similarity Joins in Data Cleaning. ICDE 2006: 5
  • 16. Prefix filtering: example Each set has 5 tokens “ Similar”: they share at least 4 tokens Prefix length: 2 Record 1 Record 2
  • 17. Stage 1: Order tokens by frequency Stage 2: Finding “similar” id pairs Stage 3: id pairs  record paris Hadoop Solution: Overview
  • 18. Stage 1: Sort tokens by frequency Compute token frequencies Sort them MapReduce phase 1 MapReduce phase 2
  • 19. Stage 2: Find “similar” id pairs Partition using prefixes Verify similarity
  • 20. Stage 3: id pairs  record pairs (phase 1) Bring records for each id in each pair
  • 21. Stage 3: id pairs  record pairs (phase 2) Join two half filled records
  • 22. Formulation of set-similarity join Hadoop-based solutions  Experiments Talk Outline
  • 23. Hardware 10-node IBM x3650 cluster Intel Xeon processor E5520 2.26GHz with four cores Four 300GB hard disks 12GB RAM Software Ubuntu 9.06, 64-bit, server edition OS Java 1.6, 64-bit, server Hadoop 0.20.1 Datasets: publications (DBLP and CITESEERX) Experimental Setting
  • 24. Running time Stage 2 Stage 1 Stage 3
  • 26. Speedup Breakdown Stage 2 has good speedup
  • 28. Other methods for the 3 stages Case: R <> S Dealing with limited memory Additional results
  • 29. Set-similarity joins in Hadoop: Three-stage approach using Hadoop Experimental study Summary
  • 30. Thank you Chen Li @ UC Irvine Source code available at: http://asterix.ics.uci.edu/fuzzyjoin-mapreduce/ Acknowledgements: NSF, Google, IBM.