Fast and Scalable Inequality Joins
-- for Data Cleansing on Scale --
Zuhair Khayyat
PhD Candidate @ InfoCloud group
King Abdullah University of Science and Technology (KAUST)
● Two customers having the same zip cannot be in different cities
Data Cleansing
Name Zip City
Winnie 91340 San Francisco
Robbert 91340 New York
Emma 91340 San Francisco
● Two customers having the same zip cannot be in different cities
● “inaccurate data has a direct impact ... the average company
losing 12% of its revenue” -- Ben Davis (Econsultancy)
Data Cleansing
Name Zip City
Winnie 91340 San Francisco
Robbert 91340 New York
Emma 91340 San Francisco
● Two customers having the same zip cannot be in different cities
● “inaccurate data has a direct impact ... the average company
losing 12% of its revenue” -- Ben Davis (Econsultancy)
● “This is the digital universe. It is growing 40% a year into the
next decade” -- EMC2
Big
Data Cleansing
Name Zip City
Winnie 91340 San Francisco
Robbert 91340 New York
Emma 91340 San Francisco
Data Cleansing System
Data
(Dirty)
Quality Rules Violations
Data Cleansing System
Data
(Dirty)
Quality Rules Violations
Repair AlgorithmsFixes
Data Cleansing System
Data
(Partially
Clean)
Quality Rules Violations
Repair AlgorithmsFixes
Big Data Cleansing System
Data
(Partially
Clean)
Quality Rules Violations
Repair AlgorithmsFixes
Big
BigDansing: A System for Big Data Cleansing
In SIGMOD 2015
BigDansing
Quality rules
Repair Algorithms
Dirty
datasets
Clean
datasets
BigDansing: A System for Big Data Cleansing
In SIGMOD 2015
BigDansing
Quality rules
Repair Algorithms
Dirty
datasets
Clean
datasets
BigDansing: A System for Big Data Cleansing
Functional
dependencies
Inclusion
dependencies
Denial
constraints
Entity
resolution
Domain Specific Language
Optimized execution plan
In SIGMOD 2015
BigDansing: A System for Big Data Cleansing
In SIGMOD 2015
Violations
S1
S2
S3
Repair Algorithm
Repair Algorithm
Repair Algorithm
In SIGMOD 2015
10M 20M 40M
0
100
200
300
400
500
600
700
BigDansing on Spark Spark SQL
Dataset Size
Time(Sec)
Two customers having the same zip cannot be in different cities
(FD: Zip → City)
BigDansing: A System for Big Data Cleansing
Complex Quality rule: Inequality joins
● If a person has a higher salary, he must pay more taxes
compared to others
Complex Quality rule: Inequality joins
● If a person has a higher salary, he must pay more taxes
compared to others
● Select * from D t1 JOIN D t2 on
t1.Salary > t2.Salary AND
t1.Tax < t2.Tax;
●
Processed as a Cartesian product: O(n2
)
Lightning Fast and Space Efficient Inequality Joins
● Sort on Salary:
● Sort on Tax:
● Bit-array:
In VLDB 2015
Tuple Salary Tax
t1 100 5
t2 90 9
t3 150 15
t4 120 10
t3(150) t4(120) t1(100) t2(90)
t3(15) t4(10) t2(9) t2(5)
0 1 2 3
0 1 3 2
0 0 0 0
Permutation array:
Lightning Fast and Space Efficient Inequality Joins
● Sort on Salary:
● Sort on Tax:
● Bit-array:
In VLDB 2015
Tuple Salary Tax
t1 100 5
t2 90 9
t3 150 15
t4 120 10
t3(150) t4(120) t1(100) t2(90)
t3(15) t4(10) t2(9) t2(5)
0 1 2 3
0 1 3 2
0 0 0 0
Permutation array:
Lightning Fast and Space Efficient Inequality Joins
● Sort on Salary:
● Sort on Tax:
● Bit-array:
In VLDB 2015
Tuple Salary Tax
t1 100 5
t2 90 9
t3 150 15
t4 120 10
t3(150) t4(120) t1(100) t2(90)
t3(15) t4(10) t2(9) t1(5)
0 1 2 3
0 1 3 2
0 0 0 0
Permutation array:
O(n log n)
IEJoin vs. DBMS (Single machine)
10K 50K 100K
0.01
0.1
1
10
100
1000
10000
PG-IEJoin Postgres MonetDB DBMS-X
Dataset size
Runtime(Sec)
IEJoin vs. Spark SQL (Distributed)
Salary-Tax Range Intersection
0
4
8
12
16
20
24
IEJoin on Spark SQL Spark SQL
Runtime(Hours)
100M rows on 6 machines
IEJoin on 8B Rows
● A cluster of 16 workers
● 8B rows, 287 GB on Disk
● Runtime in 13 hours
● Close to the ideal speedup
0 2 4 6 8 10 12 14 16 18
0
20
40
60
80
100
IEJoin on Spark SQL
Ideal Speedup
Cluster Size
Hours
Visit us!
● Zuhair Khayyat
– cloud.kaust.edu.sa
● SIGMOD 15 – BigDansing paper
● VLDB 15 – IEJoin Paper -----> to be presented in VLDB 16
● SIGMOD 16 – Demo Paper

More Related Content

PDF
BigDansing presentation slides for KAUST
PDF
BigDansing presentation slides for SIGMOD 2015
PDF
Market Launch Guide
PPT
Customer Experience Mastery make every interaction customer worthy
PDF
Machine design january 2015
PPTX
ScaleJoin: a Deterministic, Disjoint-Parallel and Skew-Resilient Stream Join
PDF
Course material of the life sciences .pdf
PPTX
Tracxn - Top Business Models - Canada Tech - Feb 2022
BigDansing presentation slides for KAUST
BigDansing presentation slides for SIGMOD 2015
Market Launch Guide
Customer Experience Mastery make every interaction customer worthy
Machine design january 2015
ScaleJoin: a Deterministic, Disjoint-Parallel and Skew-Resilient Stream Join
Course material of the life sciences .pdf
Tracxn - Top Business Models - Canada Tech - Feb 2022

Similar to IEJoin and Big Data Cleansing (20)

PPT
Test2008 Resurrecting The Prodigal Son Data Quality (http://www.geektest...
PDF
Designing States, Actions, and Rewards for Using POMDP in Session Search
PPTX
Beyond Process Mining: Discovering Business Rules From Event Logs
PDF
sCorrecting for country skew: How APNIC adjusts for sample bias in the counts
PDF
Peering Talk 101 by Douglas Wilson
PDF
RTX Wireline-9-22-2015
PPTX
Tracxn - Top Business Models - Canada Tech - Apr 2022
PDF
Raytheon University Programs Open Job List- December 2017
PPT
Evaluation of Supply Chain Design Decisions Under Uncertainty.ppt
PDF
Stonitsch antietam profile
PDF
Raytheon University Programs Open Job List- October 2017
PDF
"State of the Cloud" Report -- Bessemer Venture Partners (June 2015)
PPTX
State of the Cloud Report 7.28.2015
PDF
ChakraView – A 360° Approach to Data Quality
PPTX
Accurate and Reliable What-If Analysis of Business Processes: Is it Achievable?
PDF
CAPSTONE NYC 311 DATA (Technical)
PDF
Decision CAMP 2014 - Howard Rogers - Programming with decision tables v01
PDF
Javier Garcia - Verdugo Sanchez - Six Sigma Training - W1 Cause and Effect An...
PPTX
Mike Tuche, CEO of Talend: Enabling the Data Driven Enterprise
PPTX
Tracxn - Top Business Models - Canada Tech - Jan 2022
Test2008 Resurrecting The Prodigal Son Data Quality (http://www.geektest...
Designing States, Actions, and Rewards for Using POMDP in Session Search
Beyond Process Mining: Discovering Business Rules From Event Logs
sCorrecting for country skew: How APNIC adjusts for sample bias in the counts
Peering Talk 101 by Douglas Wilson
RTX Wireline-9-22-2015
Tracxn - Top Business Models - Canada Tech - Apr 2022
Raytheon University Programs Open Job List- December 2017
Evaluation of Supply Chain Design Decisions Under Uncertainty.ppt
Stonitsch antietam profile
Raytheon University Programs Open Job List- October 2017
"State of the Cloud" Report -- Bessemer Venture Partners (June 2015)
State of the Cloud Report 7.28.2015
ChakraView – A 360° Approach to Data Quality
Accurate and Reliable What-If Analysis of Business Processes: Is it Achievable?
CAPSTONE NYC 311 DATA (Technical)
Decision CAMP 2014 - Howard Rogers - Programming with decision tables v01
Javier Garcia - Verdugo Sanchez - Six Sigma Training - W1 Cause and Effect An...
Mike Tuche, CEO of Talend: Enabling the Data Driven Enterprise
Tracxn - Top Business Models - Canada Tech - Jan 2022
Ad

More from Zuhair khayyat (9)

PDF
Scaling Big Data Cleansing
PDF
Presentation on "Mizan: A System for Dynamic Load Balancing in Large-scale Gr...
PDF
Large Graph Processing
PDF
Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing
PDF
Google appengine
PDF
MapReduce
PDF
Kineograph
PDF
Graphlab under the hood
PDF
Dynamo db
Scaling Big Data Cleansing
Presentation on "Mizan: A System for Dynamic Load Balancing in Large-scale Gr...
Large Graph Processing
Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing
Google appengine
MapReduce
Kineograph
Graphlab under the hood
Dynamo db
Ad

Recently uploaded (20)

PPTX
LIPID & AMINO ACID METABOLISM UNIT-III, B PHARM II SEMESTER
PPTX
AP CHEM 1.2 Mass spectroscopy of elements
PDF
The Future of Telehealth: Engineering New Platforms for Care (www.kiu.ac.ug)
PPTX
TORCH INFECTIONS in pregnancy with toxoplasma
PPTX
Substance Disorders- part different drugs change body
PPTX
Presentation1 INTRODUCTION TO ENZYMES.pptx
PDF
Integrative Oncology: Merging Conventional and Alternative Approaches (www.k...
PPT
1. INTRODUCTION TO EPIDEMIOLOGY.pptx for community medicine
PPTX
Understanding the Circulatory System……..
PPT
Mutation in dna of bacteria and repairss
PPTX
HAEMATOLOGICAL DISEASES lack of red blood cells, which carry oxygen throughou...
PPTX
GREEN FIELDS SCHOOL PPT ON HOLIDAY HOMEWORK
PDF
Communicating Health Policies to Diverse Populations (www.kiu.ac.ug)
PPTX
Introduction to Immunology (Unit-1).pptx
PDF
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
PPTX
A powerpoint on colorectal cancer with brief background
PDF
Packaging materials of fruits and vegetables
PPTX
ELISA(Enzyme linked immunosorbent assay)
PDF
Chapter 3 - Human Development Poweroint presentation
PDF
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
LIPID & AMINO ACID METABOLISM UNIT-III, B PHARM II SEMESTER
AP CHEM 1.2 Mass spectroscopy of elements
The Future of Telehealth: Engineering New Platforms for Care (www.kiu.ac.ug)
TORCH INFECTIONS in pregnancy with toxoplasma
Substance Disorders- part different drugs change body
Presentation1 INTRODUCTION TO ENZYMES.pptx
Integrative Oncology: Merging Conventional and Alternative Approaches (www.k...
1. INTRODUCTION TO EPIDEMIOLOGY.pptx for community medicine
Understanding the Circulatory System……..
Mutation in dna of bacteria and repairss
HAEMATOLOGICAL DISEASES lack of red blood cells, which carry oxygen throughou...
GREEN FIELDS SCHOOL PPT ON HOLIDAY HOMEWORK
Communicating Health Policies to Diverse Populations (www.kiu.ac.ug)
Introduction to Immunology (Unit-1).pptx
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
A powerpoint on colorectal cancer with brief background
Packaging materials of fruits and vegetables
ELISA(Enzyme linked immunosorbent assay)
Chapter 3 - Human Development Poweroint presentation
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...

IEJoin and Big Data Cleansing

  • 1. Fast and Scalable Inequality Joins -- for Data Cleansing on Scale -- Zuhair Khayyat PhD Candidate @ InfoCloud group King Abdullah University of Science and Technology (KAUST)
  • 2. ● Two customers having the same zip cannot be in different cities Data Cleansing Name Zip City Winnie 91340 San Francisco Robbert 91340 New York Emma 91340 San Francisco
  • 3. ● Two customers having the same zip cannot be in different cities ● “inaccurate data has a direct impact ... the average company losing 12% of its revenue” -- Ben Davis (Econsultancy) Data Cleansing Name Zip City Winnie 91340 San Francisco Robbert 91340 New York Emma 91340 San Francisco
  • 4. ● Two customers having the same zip cannot be in different cities ● “inaccurate data has a direct impact ... the average company losing 12% of its revenue” -- Ben Davis (Econsultancy) ● “This is the digital universe. It is growing 40% a year into the next decade” -- EMC2 Big Data Cleansing Name Zip City Winnie 91340 San Francisco Robbert 91340 New York Emma 91340 San Francisco
  • 6. Data Cleansing System Data (Dirty) Quality Rules Violations Repair AlgorithmsFixes
  • 7. Data Cleansing System Data (Partially Clean) Quality Rules Violations Repair AlgorithmsFixes
  • 8. Big Data Cleansing System Data (Partially Clean) Quality Rules Violations Repair AlgorithmsFixes Big
  • 9. BigDansing: A System for Big Data Cleansing In SIGMOD 2015 BigDansing Quality rules Repair Algorithms Dirty datasets Clean datasets
  • 10. BigDansing: A System for Big Data Cleansing In SIGMOD 2015 BigDansing Quality rules Repair Algorithms Dirty datasets Clean datasets
  • 11. BigDansing: A System for Big Data Cleansing Functional dependencies Inclusion dependencies Denial constraints Entity resolution Domain Specific Language Optimized execution plan In SIGMOD 2015
  • 12. BigDansing: A System for Big Data Cleansing In SIGMOD 2015 Violations S1 S2 S3 Repair Algorithm Repair Algorithm Repair Algorithm
  • 13. In SIGMOD 2015 10M 20M 40M 0 100 200 300 400 500 600 700 BigDansing on Spark Spark SQL Dataset Size Time(Sec) Two customers having the same zip cannot be in different cities (FD: Zip → City) BigDansing: A System for Big Data Cleansing
  • 14. Complex Quality rule: Inequality joins ● If a person has a higher salary, he must pay more taxes compared to others
  • 15. Complex Quality rule: Inequality joins ● If a person has a higher salary, he must pay more taxes compared to others ● Select * from D t1 JOIN D t2 on t1.Salary > t2.Salary AND t1.Tax < t2.Tax; ● Processed as a Cartesian product: O(n2 )
  • 16. Lightning Fast and Space Efficient Inequality Joins ● Sort on Salary: ● Sort on Tax: ● Bit-array: In VLDB 2015 Tuple Salary Tax t1 100 5 t2 90 9 t3 150 15 t4 120 10 t3(150) t4(120) t1(100) t2(90) t3(15) t4(10) t2(9) t2(5) 0 1 2 3 0 1 3 2 0 0 0 0 Permutation array:
  • 17. Lightning Fast and Space Efficient Inequality Joins ● Sort on Salary: ● Sort on Tax: ● Bit-array: In VLDB 2015 Tuple Salary Tax t1 100 5 t2 90 9 t3 150 15 t4 120 10 t3(150) t4(120) t1(100) t2(90) t3(15) t4(10) t2(9) t2(5) 0 1 2 3 0 1 3 2 0 0 0 0 Permutation array:
  • 18. Lightning Fast and Space Efficient Inequality Joins ● Sort on Salary: ● Sort on Tax: ● Bit-array: In VLDB 2015 Tuple Salary Tax t1 100 5 t2 90 9 t3 150 15 t4 120 10 t3(150) t4(120) t1(100) t2(90) t3(15) t4(10) t2(9) t1(5) 0 1 2 3 0 1 3 2 0 0 0 0 Permutation array: O(n log n)
  • 19. IEJoin vs. DBMS (Single machine) 10K 50K 100K 0.01 0.1 1 10 100 1000 10000 PG-IEJoin Postgres MonetDB DBMS-X Dataset size Runtime(Sec)
  • 20. IEJoin vs. Spark SQL (Distributed) Salary-Tax Range Intersection 0 4 8 12 16 20 24 IEJoin on Spark SQL Spark SQL Runtime(Hours) 100M rows on 6 machines
  • 21. IEJoin on 8B Rows ● A cluster of 16 workers ● 8B rows, 287 GB on Disk ● Runtime in 13 hours ● Close to the ideal speedup 0 2 4 6 8 10 12 14 16 18 0 20 40 60 80 100 IEJoin on Spark SQL Ideal Speedup Cluster Size Hours
  • 22. Visit us! ● Zuhair Khayyat – cloud.kaust.edu.sa ● SIGMOD 15 – BigDansing paper ● VLDB 15 – IEJoin Paper -----> to be presented in VLDB 16 ● SIGMOD 16 – Demo Paper