This document summarizes Zuhair Khayyat's work on fast and scalable inequality joins for data cleansing. It describes an approach called IEJoin that sorts data by key columns and uses a bit array to efficiently identify violations of inequality rules, such as a rule that people with higher salaries must pay higher taxes. The approach runs in O(n log n) time and scales well in distributed systems. Experimental results show IEJoin outperforms database systems on inequality joins and can process billions of rows efficiently on a cluster. The work was presented at VLDB 2015 and will be presented at VLDB 2016.
Related topics: