This document discusses migrating complex data aggregations from Hadoop to Spark. It outlines PubMatic's use cases involving large scale data and complex data flows. PubMatic developed an industry-first real-time analytics solution dealing with ever-increasing data scale and complexity. They faced challenges with hardware costs, complex data flows, and cardinality estimation for billions of users. Three use cases are presented showing Spark is faster than Hive for multi-stage workflows by 85%, cardinality estimation by 25-30%, and grouping sets by 150%. Tuning Spark configuration and challenges faced are also discussed.
Related topics: