Dedup with hadoop

De-dup on Hadoop
Neeta Pande
@intuit

Context
• Master Data Management or Entity De-duplication
– Seems new? No, it seems very familiar
– Wikipedia Definition: MDM comprises the processes, standards and
tools that consistently define and manage the critical data of an
organization to provide a single point of reference
– Customer MDM is the most common in all enterprises.
• Traditional approach
– Enterprise data stored in RDBMS
– Tools from leading vendors (IBM, Informatica, SAP…..)
– Used to provide transactional and analytical value
A solved problem in RDBMS world, not a challenging problem in developer community

Why MDM in Big Data World?
• Large Scale
– Webscale, huge customer base (visitors, trailers, subscribers) vs paid users
– Clickstream, transaction data also used for
– Master data at ecosystem level
– Social data available to leverage
• Real time
– Use Master data for better user experience
– De-dup in real time i.e before data enters the transactional systems
• All data available in Hadoop
– Very common today to collect all the organization’s data in central location
Leverage data for better user experience and to innovate new capabilities or offerings

Matching and Mastering Components
Serve master
• Real time lookup
• Search
• Batch Extraction
Mastering (Golden
Records Creation)
Dedup(Matching)
Data Cleansing and
Standardization
1
Clickstream
and other
usage
sources
Offerings
data from
web,
mobile,
desktop
Enterprise
data (Billing,
CRM….)
Social Data
(and other
external
datasets)
2
3
4
1 Standardization library and Web Services
2 Matching framework, library and Web Services
3 Reconciliation Library and Web Service
4 Search and Real time lookup Web Services

Matching and Mastering components in Hadoop
Custom built Cleansing and Standardization java libraries
Opensource Libraries: OpenNLP, OpenStreetMaps, Postal ref data etc
REST based Service for real time use
Batch Cleansing and Standardization using PIG
Custom Built Matching Probabilistic Algorithm based on heuristics
Configurable framework: incoming data, thresholds and algorithm selection
Separate framework development from algorithm enrichment (engg vs sciences)
Library, UDFs and web services deployment
Mastering in PIG and store into Hbase
HBase support for sparse data, versioning, wide set of attributes etc
PIG HBase integration and hfile utilities
HBase for real time serving of the master data
Solr on Hbase for real time search and serving of master data
Hive Hbase Integration helps in batch operations on the master data

More on Matching Techniques
• Simple Probabilistic matching based on Heuristics
– Easy to implement yet powerful, got us going right away
– Well suited for Map Reduce Paradigm
– Can be enhanced using Linear Regression techniques with new data
– Domain specific, difficult to generalize
• Clustering techniques for Matching
– Explored Canopy clustering technique in Mahout
– Vectorized strings and distance measures like cosine similarity
– Builds and provides index for real time matching lookup
– Complex and need to invest more development time upfront
– Efficient, better accuracy and scalable

Summary
• Master data leveraged today for recommendations or
innovation of new offerings/capabilities
• High level view of capability/patterns on Hadoop platform
• Building a MDM solution consists of Data Engineering ,
Analysis, Sciences
• De-dup is at the core and several techniques exist and are
being researched for Big Data

Dedup with hadoop

More Related Content

What's hot (20)

Similar to Dedup with hadoop (20)

Recently uploaded (20)

Dedup with hadoop