This document discusses building a master data management (MDM) solution on Hadoop to enable entity de-duplication at scale. It presents the key components of an MDM solution for Hadoop, including data cleansing, standardization, matching/deduplication, and mastering. It describes using probabilistic and clustering techniques for matching large and diverse data sources in real-time and batch modes. The solution architecture stores mastered data in HBase for serving and analytics, integrating various Hadoop ecosystems like Pig, Hive, and Solr.