SlideShare a Scribd company logo
De-dup on Hadoop
Neeta Pande
@intuit
Context
• Master Data Management or Entity De-duplication
– Seems new? No, it seems very familiar
– Wikipedia Definition: MDM comprises the processes, standards and
tools that consistently define and manage the critical data of an
organization to provide a single point of reference
– Customer MDM is the most common in all enterprises.
• Traditional approach
– Enterprise data stored in RDBMS
– Tools from leading vendors (IBM, Informatica, SAP…..)
– Used to provide transactional and analytical value
A solved problem in RDBMS world, not a challenging problem in developer community
Why MDM in Big Data World?
• Large Scale
– Webscale, huge customer base (visitors, trailers, subscribers) vs paid users
– Clickstream, transaction data also used for
– Master data at ecosystem level
– Social data available to leverage
• Real time
– Use Master data for better user experience
– De-dup in real time i.e before data enters the transactional systems
• All data available in Hadoop
– Very common today to collect all the organization’s data in central location
Leverage data for better user experience and to innovate new capabilities or offerings
Matching and Mastering Components
Serve master
• Real time lookup
• Search
• Batch Extraction
Mastering (Golden
Records Creation)
Dedup(Matching)
Data Cleansing and
Standardization
1
Clickstream
and other
usage
sources
Offerings
data from
web,
mobile,
desktop
Enterprise
data (Billing,
CRM….)
Social Data
(and other
external
datasets)
2
3
4
1 Standardization library and Web Services
2 Matching framework, library and Web Services
3 Reconciliation Library and Web Service
4 Search and Real time lookup Web Services
Matching and Mastering components in Hadoop
Custom built Cleansing and Standardization java libraries
Opensource Libraries: OpenNLP, OpenStreetMaps, Postal ref data etc
REST based Service for real time use
Batch Cleansing and Standardization using PIG
Custom Built Matching Probabilistic Algorithm based on heuristics
Configurable framework: incoming data, thresholds and algorithm selection
Separate framework development from algorithm enrichment (engg vs sciences)
Library, UDFs and web services deployment
Mastering in PIG and store into Hbase
HBase support for sparse data, versioning, wide set of attributes etc
PIG HBase integration and hfile utilities
HBase for real time serving of the master data
Solr on Hbase for real time search and serving of master data
Hive Hbase Integration helps in batch operations on the master data
More on Matching Techniques
• Simple Probabilistic matching based on Heuristics
– Easy to implement yet powerful, got us going right away
– Well suited for Map Reduce Paradigm
– Can be enhanced using Linear Regression techniques with new data
– Domain specific, difficult to generalize
• Clustering techniques for Matching
– Explored Canopy clustering technique in Mahout
– Vectorized strings and distance measures like cosine similarity
– Builds and provides index for real time matching lookup
– Complex and need to invest more development time upfront
– Efficient, better accuracy and scalable
Summary
• Master data leveraged today for recommendations or
innovation of new offerings/capabilities
• High level view of capability/patterns on Hadoop platform
• Building a MDM solution consists of Data Engineering ,
Analysis, Sciences
• De-dup is at the core and several techniques exist and are
being researched for Big Data
Q&A

More Related Content

PPTX
Using Hadoop as a platform for Master Data Management
PDF
Hadoop 2.0 - Solving the Data Quality Challenge
PPTX
Big Data Analytics with Hadoop, MongoDB and SQL Server
PPT
Big Data Analytics 2014
PPTX
Big data analytics - hadoop
PDF
Big Data Real Time Applications
PPTX
Hadoop: An Industry Perspective
PPSX
Big data with Hadoop - Introduction
Using Hadoop as a platform for Master Data Management
Hadoop 2.0 - Solving the Data Quality Challenge
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics 2014
Big data analytics - hadoop
Big Data Real Time Applications
Hadoop: An Industry Perspective
Big data with Hadoop - Introduction

What's hot (20)

PDF
Introduction to Big Data and Hadoop
PPTX
Big Data Analytics with Hadoop
PPTX
Big data analytics with hadoop volume 2
PDF
Making Big Data Analytics with Hadoop fast & easy (webinar slides)
PPTX
Introduction to Apache Hadoop Eco-System
PDF
Hadoop,Big Data Analytics and More
PPTX
Dev Lakhani, Data Scientist at Batch Insights "Real Time Big Data Applicatio...
PDF
Complement Your Existing Data Warehouse with Big Data & Hadoop
PDF
Rob peglar introduction_analytics _big data_hadoop
PPT
BigData & CDN - OOP2011 (Pavlo Baron)
PDF
Organising the Data Lake - Information Management in a Big Data World
PDF
Introduction to Bigdata and HADOOP
PDF
The Emerging Data Lake IT Strategy
PPT
Big Data: An Overview
PPT
Big data introduction, Hadoop in details
PDF
Introduction to Big Data Analytics on Apache Hadoop
PDF
Incorporating the Data Lake into Your Analytic Architecture
PPTX
Big Data Technology Stack : Nutshell
PPT
Big Data and Hadoop Basics
PDF
Emergent Distributed Data Storage
Introduction to Big Data and Hadoop
Big Data Analytics with Hadoop
Big data analytics with hadoop volume 2
Making Big Data Analytics with Hadoop fast & easy (webinar slides)
Introduction to Apache Hadoop Eco-System
Hadoop,Big Data Analytics and More
Dev Lakhani, Data Scientist at Batch Insights "Real Time Big Data Applicatio...
Complement Your Existing Data Warehouse with Big Data & Hadoop
Rob peglar introduction_analytics _big data_hadoop
BigData & CDN - OOP2011 (Pavlo Baron)
Organising the Data Lake - Information Management in a Big Data World
Introduction to Bigdata and HADOOP
The Emerging Data Lake IT Strategy
Big Data: An Overview
Big data introduction, Hadoop in details
Introduction to Big Data Analytics on Apache Hadoop
Incorporating the Data Lake into Your Analytic Architecture
Big Data Technology Stack : Nutshell
Big Data and Hadoop Basics
Emergent Distributed Data Storage
Ad

Similar to Dedup with hadoop (20)

PPTX
New Innovations in Information Management for Big Data - Smarter Business 2013
PDF
DAMA - Innovations in DG Architecture and Analytics (online)
PDF
Hadoop and Your Enterprise Data Warehouse
PDF
An Efficient Approach for Clustering High Dimensional Data
PDF
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
PDF
Track B-1 建構新世代的智慧數據平台
PPTX
Deutsche Telekom on Big Data
PDF
Synergizing Master Data Management and Big Data
PDF
Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution
PPTX
Big Data Practice_Planning_steps_RK
PDF
Combining hadoop with big data analytics
PDF
Modern data warehouse
PDF
Modern data warehouse
PPTX
Café da manhã - São Paulo - Use-cases and opportunities in BigData with Hadoop
PDF
A New Way of Thinking About MDM
PDF
Hadoop and the Data Warehouse: When to Use Which
PPTX
Hadoop for Data Warehousing professionals
PPTX
Master Data Management.pptx
PDF
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
PPTX
Integrating hadoop - Big Data TechCon 2013
New Innovations in Information Management for Big Data - Smarter Business 2013
DAMA - Innovations in DG Architecture and Analytics (online)
Hadoop and Your Enterprise Data Warehouse
An Efficient Approach for Clustering High Dimensional Data
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Track B-1 建構新世代的智慧數據平台
Deutsche Telekom on Big Data
Synergizing Master Data Management and Big Data
Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution
Big Data Practice_Planning_steps_RK
Combining hadoop with big data analytics
Modern data warehouse
Modern data warehouse
Café da manhã - São Paulo - Use-cases and opportunities in BigData with Hadoop
A New Way of Thinking About MDM
Hadoop and the Data Warehouse: When to Use Which
Hadoop for Data Warehousing professionals
Master Data Management.pptx
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
Integrating hadoop - Big Data TechCon 2013
Ad

Recently uploaded (20)

PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PPTX
ai tools demonstartion for schools and inter college
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
PTS Company Brochure 2025 (1).pdf.......
PPTX
assetexplorer- product-overview - presentation
PDF
medical staffing services at VALiNTRY
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
top salesforce developer skills in 2025.pdf
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
Designing Intelligence for the Shop Floor.pdf
PPT
Introduction Database Management System for Course Database
PDF
Softaken Excel to vCard Converter Software.pdf
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PPTX
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
ai tools demonstartion for schools and inter college
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PTS Company Brochure 2025 (1).pdf.......
assetexplorer- product-overview - presentation
medical staffing services at VALiNTRY
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
How to Migrate SBCGlobal Email to Yahoo Easily
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
top salesforce developer skills in 2025.pdf
CHAPTER 2 - PM Management and IT Context
Designing Intelligence for the Shop Floor.pdf
Introduction Database Management System for Course Database
Softaken Excel to vCard Converter Software.pdf
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
How to Choose the Right IT Partner for Your Business in Malaysia
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool

Dedup with hadoop

  • 1. De-dup on Hadoop Neeta Pande @intuit
  • 2. Context • Master Data Management or Entity De-duplication – Seems new? No, it seems very familiar – Wikipedia Definition: MDM comprises the processes, standards and tools that consistently define and manage the critical data of an organization to provide a single point of reference – Customer MDM is the most common in all enterprises. • Traditional approach – Enterprise data stored in RDBMS – Tools from leading vendors (IBM, Informatica, SAP…..) – Used to provide transactional and analytical value A solved problem in RDBMS world, not a challenging problem in developer community
  • 3. Why MDM in Big Data World? • Large Scale – Webscale, huge customer base (visitors, trailers, subscribers) vs paid users – Clickstream, transaction data also used for – Master data at ecosystem level – Social data available to leverage • Real time – Use Master data for better user experience – De-dup in real time i.e before data enters the transactional systems • All data available in Hadoop – Very common today to collect all the organization’s data in central location Leverage data for better user experience and to innovate new capabilities or offerings
  • 4. Matching and Mastering Components Serve master • Real time lookup • Search • Batch Extraction Mastering (Golden Records Creation) Dedup(Matching) Data Cleansing and Standardization 1 Clickstream and other usage sources Offerings data from web, mobile, desktop Enterprise data (Billing, CRM….) Social Data (and other external datasets) 2 3 4 1 Standardization library and Web Services 2 Matching framework, library and Web Services 3 Reconciliation Library and Web Service 4 Search and Real time lookup Web Services
  • 5. Matching and Mastering components in Hadoop Custom built Cleansing and Standardization java libraries Opensource Libraries: OpenNLP, OpenStreetMaps, Postal ref data etc REST based Service for real time use Batch Cleansing and Standardization using PIG Custom Built Matching Probabilistic Algorithm based on heuristics Configurable framework: incoming data, thresholds and algorithm selection Separate framework development from algorithm enrichment (engg vs sciences) Library, UDFs and web services deployment Mastering in PIG and store into Hbase HBase support for sparse data, versioning, wide set of attributes etc PIG HBase integration and hfile utilities HBase for real time serving of the master data Solr on Hbase for real time search and serving of master data Hive Hbase Integration helps in batch operations on the master data
  • 6. More on Matching Techniques • Simple Probabilistic matching based on Heuristics – Easy to implement yet powerful, got us going right away – Well suited for Map Reduce Paradigm – Can be enhanced using Linear Regression techniques with new data – Domain specific, difficult to generalize • Clustering techniques for Matching – Explored Canopy clustering technique in Mahout – Vectorized strings and distance measures like cosine similarity – Builds and provides index for real time matching lookup – Complex and need to invest more development time upfront – Efficient, better accuracy and scalable
  • 7. Summary • Master data leveraged today for recommendations or innovation of new offerings/capabilities • High level view of capability/patterns on Hadoop platform • Building a MDM solution consists of Data Engineering , Analysis, Sciences • De-dup is at the core and several techniques exist and are being researched for Big Data
  • 8. Q&A