SlideShare a Scribd company logo
Migrating Complex Data
Aggregations from
Hadoop to Spark
puneet.kumar@pubmatic.com
ashish.singh@pubmatic.com
Agenda
• Aggregation and Scale @PubMatic
• Problem Statement: Why Spark when we run Hadoop
• 3 Use Cases
• Configuration Tuning
• Challenges & Learnings
Who we are?
• Marketing Automation Software Company
• Developed Industry’s first Real Time
Analytics Solution
Pubmatic Analytics
Data : Scale & Complexity
Challenges on Current Stack
• Ever Increasing Hardware Costs
• Complex Data Flows
• Cardinality Estimation : Estimating
Billion distinct users
• Multiple Grouping Sets
• Different flows for Real Time and
Batch Analytics
Data Flow Diagram : Batch
3 Use cases
Data Flow Diagram : Batch
Cardinality Estimation Multi Stage Workflows Grouping sets
Why Spark ?
• Efficient for dependent Data Flows
• Memory : Cheaper (Moore’s Law)
• Optimized Hardware Usage
• Unified stack for Real Time & Batch
• Awesome Scala API’s
Case 1: Cardinality Estimation
Sec
Size
Spark is ~ 25-30 % faster than Hive on MR
Case 2 :Multi Stage Data Flow
Sec
Size
Spark is ~ 85 % faster than Hive on MR
Case 3 : Grouping Sets
192 GB 384 GB 768 GB
0
200
400
600
800
Spark(Sec)
Hive
Queries(Sec)
Spark is ~ 150 % faster than Hive on MR
Sec
Challenges faced
• Spark on YARN : executors did not use full memory
• Reading Nested Avro Schemas until Spark 1.2 was tedious
• Had to rewrite code to leverage Spark-Avro with Spark 1.3(DataFrames)
• Join and Deduplication was slow for Spark vs Hive
Important Performance Params
• SET spark.default.parallelism;
• SET spark.serializer : Kyro Serialization improved the runtime.
• SET spark.sql.inMemoryColumnarStorage.compressed : Snappy
compassion set to true
• SET spark.sql.inMemoryColumnarStorage.batchSize : Increasing it to a
higher optimum value.
• SET spark.shuffle.memorySize
Memory Based Architecture
In Memory Distributed Store
HDFS S3
Flow 1 Flow 2 Flow 3
Conclusions :
• Spark Multi Stage workflows were faster by 85 % over Hive on MR
• Single stage workflows did not see huge benefits
• HLL mask generation and heavy jobs finished 20-30% faster
• Use In Memory Distributed Storage with Spark for multiple jobs on same
Input
• Overall Hardware cost is expected to decrease by ~35% due to Spark
usage(more memory , less nodes)
THANK YOU!

More Related Content

PDF
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
PPTX
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
PDF
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
PDF
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
PDF
Introduction to Apache Spark
PDF
Spark Summit EU talk by Berni Schiefer
PDF
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
PDF
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Introduction to Apache Spark
Spark Summit EU talk by Berni Schiefer
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...

What's hot (20)

PDF
A Journey into Databricks' Pipelines: Journey and Lessons Learned
PPTX
Robust and Scalable ETL over Cloud Storage with Apache Spark
PDF
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
PDF
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
PDF
Spark Summit EU talk by Mike Percy
PDF
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
PDF
SSR: Structured Streaming for R and Machine Learning
PDF
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
PDF
Spark Summit EU talk by Heiko Korndorf
PDF
Processing 70Tb Of Genomics Data With ADAM And Toil
PDF
Deep Learning on Apache® Spark™ : Workflows and Best Practices
PDF
Deep Dive Into Apache Spark Multi-User Performance Michael Feiman, Mikhail Ge...
PDF
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
PPTX
Producing Spark on YARN for ETL
PDF
Writing Continuous Applications with Structured Streaming in PySpark
PDF
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
PDF
Clipper: A Low-Latency Online Prediction Serving System
PDF
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
PDF
Using Spark with Tachyon by Gene Pang
PDF
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
A Journey into Databricks' Pipelines: Journey and Lessons Learned
Robust and Scalable ETL over Cloud Storage with Apache Spark
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
Spark Summit EU talk by Mike Percy
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
SSR: Structured Streaming for R and Machine Learning
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Spark Summit EU talk by Heiko Korndorf
Processing 70Tb Of Genomics Data With ADAM And Toil
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Dive Into Apache Spark Multi-User Performance Michael Feiman, Mikhail Ge...
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Producing Spark on YARN for ETL
Writing Continuous Applications with Structured Streaming in PySpark
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Clipper: A Low-Latency Online Prediction Serving System
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
Using Spark with Tachyon by Gene Pang
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Ad

Similar to Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPuneet Kumar, PubMatic) (20)

PDF
spark_v1_2
PPTX
Introduction to spark
PDF
The state of Spark in the cloud
PDF
Bds session 13 14
PDF
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PDF
Migrating to Spark 2.0 - Part 2
PPTX
Processing Large Data with Apache Spark -- HasGeek
PDF
Dev Ops Training
PPTX
Spark - Migration Story
PDF
An Insider’s Guide to Maximizing Spark SQL Performance
PDF
Optimizing Spark-based data pipelines - are you up for it?
PPTX
Apache spark
PDF
Scaling Apache Spark at Facebook
PDF
The state of Hive and Spark in the Cloud (July 2017)
PDF
Apache Spark and Python: unified Big Data analytics
PPTX
In Memory Analytics with Apache Spark
PDF
Started with-apache-spark
PDF
Hadoop to spark_v2
PDF
Unified Big Data Processing with Apache Spark
PPTX
Explore big data at speed of thought with Spark 2.0 and Snappydata
spark_v1_2
Introduction to spark
The state of Spark in the cloud
Bds session 13 14
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
Migrating to Spark 2.0 - Part 2
Processing Large Data with Apache Spark -- HasGeek
Dev Ops Training
Spark - Migration Story
An Insider’s Guide to Maximizing Spark SQL Performance
Optimizing Spark-based data pipelines - are you up for it?
Apache spark
Scaling Apache Spark at Facebook
The state of Hive and Spark in the Cloud (July 2017)
Apache Spark and Python: unified Big Data analytics
In Memory Analytics with Apache Spark
Started with-apache-spark
Hadoop to spark_v2
Unified Big Data Processing with Apache Spark
Explore big data at speed of thought with Spark 2.0 and Snappydata
Ad

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
PDF
Powering a Startup with Apache Spark with Kevin Kim
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
PDF
Goal Based Data Production with Sim Simeonov
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Next CERN Accelerator Logging Service with Jakub Wozniak
Powering a Startup with Apache Spark with Kevin Kim
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Goal Based Data Production with Sim Simeonov
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...

Recently uploaded (20)

PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
annual-report-2024-2025 original latest.
PPTX
Database Infoormation System (DBIS).pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Computer network topology notes for revision
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
Business Analytics and business intelligence.pdf
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Galatica Smart Energy Infrastructure Startup Pitch Deck
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
annual-report-2024-2025 original latest.
Database Infoormation System (DBIS).pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
Computer network topology notes for revision
Qualitative Qantitative and Mixed Methods.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Business Analytics and business intelligence.pdf
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
IB Computer Science - Internal Assessment.pptx
.pdf is not working space design for the following data for the following dat...
Introduction-to-Cloud-ComputingFinal.pptx
Introduction to Knowledge Engineering Part 1
oil_refinery_comprehensive_20250804084928 (1).pptx
Data_Analytics_and_PowerBI_Presentation.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx

Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPuneet Kumar, PubMatic)

  • 1. Migrating Complex Data Aggregations from Hadoop to Spark puneet.kumar@pubmatic.com ashish.singh@pubmatic.com
  • 2. Agenda • Aggregation and Scale @PubMatic • Problem Statement: Why Spark when we run Hadoop • 3 Use Cases • Configuration Tuning • Challenges & Learnings
  • 3. Who we are? • Marketing Automation Software Company • Developed Industry’s first Real Time Analytics Solution
  • 5. Data : Scale & Complexity
  • 6. Challenges on Current Stack • Ever Increasing Hardware Costs • Complex Data Flows • Cardinality Estimation : Estimating Billion distinct users • Multiple Grouping Sets • Different flows for Real Time and Batch Analytics Data Flow Diagram : Batch
  • 7. 3 Use cases Data Flow Diagram : Batch Cardinality Estimation Multi Stage Workflows Grouping sets
  • 8. Why Spark ? • Efficient for dependent Data Flows • Memory : Cheaper (Moore’s Law) • Optimized Hardware Usage • Unified stack for Real Time & Batch • Awesome Scala API’s
  • 9. Case 1: Cardinality Estimation Sec Size Spark is ~ 25-30 % faster than Hive on MR
  • 10. Case 2 :Multi Stage Data Flow Sec Size Spark is ~ 85 % faster than Hive on MR
  • 11. Case 3 : Grouping Sets 192 GB 384 GB 768 GB 0 200 400 600 800 Spark(Sec) Hive Queries(Sec) Spark is ~ 150 % faster than Hive on MR Sec
  • 12. Challenges faced • Spark on YARN : executors did not use full memory • Reading Nested Avro Schemas until Spark 1.2 was tedious • Had to rewrite code to leverage Spark-Avro with Spark 1.3(DataFrames) • Join and Deduplication was slow for Spark vs Hive
  • 13. Important Performance Params • SET spark.default.parallelism; • SET spark.serializer : Kyro Serialization improved the runtime. • SET spark.sql.inMemoryColumnarStorage.compressed : Snappy compassion set to true • SET spark.sql.inMemoryColumnarStorage.batchSize : Increasing it to a higher optimum value. • SET spark.shuffle.memorySize
  • 14. Memory Based Architecture In Memory Distributed Store HDFS S3 Flow 1 Flow 2 Flow 3
  • 15. Conclusions : • Spark Multi Stage workflows were faster by 85 % over Hive on MR • Single stage workflows did not see huge benefits • HLL mask generation and heavy jobs finished 20-30% faster • Use In Memory Distributed Storage with Spark for multiple jobs on same Input • Overall Hardware cost is expected to decrease by ~35% due to Spark usage(more memory , less nodes)