SlideShare a Scribd company logo
Democratizing Data Science using Apache Spark, Hive and Druid
● Pushkar Priyadarshi
● Igor Yurinok
● Michael Dreibelbis
Intro
● Game studio produces massive mobile games that break
down linguistic and geographic barriers by uniting an
unprecedented number of global players in one gaming
world. Games are played in 180+ countries.
● Performance marketing platform Cognant enables marketing
for our internal games as well as external businesses over
250+ channels. It merges extensive mobile ad buying
expertise with a live data platform to deliver not only true ROI
on mobile marketing spend but eliminate endless fraud and
tiresome make-goods in the process.
Machine Zone(mz.com)
● 40 billion messages/day
● kafka cluster handling 250+ topics over 4k partitions
● 3 hadoop clusters largest one spanning 300 nodes
● 5 PB of unreplicated data in hadoop eco system
● Ads published on 100k apps in nearly 200 countries serving average
750 millions impression a day peaking at 1B/day
● Data from 300 distinct sources
● Druid cluster containing 30+ data sources holding 50 TBs of data
Data @ MZ
● Data Ingestion
○ Ingest raw data from external entities
● Data Normalization
○ Normalize data using transformation framework
● Model Generation
○ Create Model using model generation framework
● Generate predictions
● Second layer of Intelligence
○ Campaign Initialization
○ Campaign Optimization
● Data Service Framework
Overview
Data Ingestion
RAW Store
S3
FTP
REST
Email
Reader
Delegator
WriterReader
WriterReader
Data Ingestion (cont’d)
● DataReaders extract data from various types of sources
○ S3 - Amazon S3 bucket accessed reporting data
○ REST - HTTP endpoint reporting data
○ FTP Similar to S3, loads from FileSystem
○ Email - Scan inbox and extract valid reports
● DataWriters output data to HDFS
○ HIVE external tables
Data Normalization
RT
RAW
Rules
Loader
Rules
Store HDFS
Rules
Parser
Rules
Applier
Druid
Rule Based Transformation Engine
● Streaming Real time data source
○ Kafka + Spark Streaming => Tranquility => Druid
● Batch historical backfill raw data source
○ Spark => Druid
● Rule based transformation engine (normalizer)
○ Built using Apache Spark
○ Custom DDL for defining column transformation rules
Data Normalization (cont’d)
● Machine Learning Pipeline based on Apache Spark ML
○ Feature Engineering
○ Model Training
○ Predictions
○ Model Testing/Tuning
○ Model Deployment
MLPlatform
● Feature Engineering extensions
○ Aggregator => NumericAggregator
● Perform aggregate transformation on input Dataset
MLPlatform (cont’d)
● Feature Engineering extensions
○ ParallelCountVectorizer
■ Compute CountVectorizer per input column
○ ParallelIDF
■ Compute IDF per input column
MLPlatform (cont’d)
● Feature Engineering extensions
○ DAGPipeline
■ Support multi-input dataset DAG based feature extraction
MLPlatform (cont’d)
n1 n4
n3
n2
DAGModel generated:
● Model Testing/Tuning
○ Feature Store
■ Rapid iterative model testing
○ Configurable Split-Testing
○ Model Store
■ Based on SparkML MLWritable
● Predictions
○ Can be generated using any version of model
○ Compared across model implementations
MLPlatform (cont’d)
● Predictions using Apache Zeppelin based visualization layer
○ Notebooks allow for rapid testing and model iteration
○ Graphing library allows for instant visual feedback
MLPlatform (cont’d)
What is output from ML Models?
● Predictions
What is business value of it?
● Not much
What does business need?
● Translate predictions in ad partner instructions
Second Layer of Intelligence
Partner instruction is a command which partner can/should execute:
● Create a new campaign
● Update Budget
● Update Bid
● Update Targeting
● Update Creative Asset
What is Partner Instructions?
Campaign Initialization:
● Bid
○ Finds the best possible bid to create campaigns
● Budget
○ Splits total budget between partners
● Targeting
○ Generates sets of possible targeting groups (Gender, Age, GEO)
● Creative
○ Generates and assign creatives
Campaign Initialization Process
Campaign Optimization:
● Bid
○ Increase, Decrease bids per campaign based on performance prediction
● Budget
○ Increase, Decrease and Reshuffle budget across partners/campaigns
● Targeting
○ Update targeting based on performance
● Creative
○ Reassign creatives based on performance
Campaign Optimization Process
Campaign InitializationOptimization
Process
ML Output
(Predictions)
Historical
Data
Initializer
Optimizer
Ad Partner
Instructions
Ad
Partner
Where to store metadata for Data Pipelines?
Where to store Ad Partner Instructions?
How to deliver Ad Partner Instructions?
Data Service Framework
Possible Microservices:
● Ad Partner Data Service
● Campaign Data Service
● ASP Data Service
● Ad Partner Instruction Service
Data Service Framework (cont’d)
Technologies:
● REST API
● Spring Boot
● Openshift Kubernetis
● Gradle + Jenkins Pipelines for CI/CD
Data Service Framework (cont’d)
Connect All Components Together
Data
Ingestion
Data
Normalization
MLPlatform
Ad PartnerData Services
Questions???
Democratizing data science Using spark, hive and druid

More Related Content

PPTX
Graph representation learning to prevent payment collusion fraud
PPTX
Optimizing industrial operations using the big data ecosystem
PPTX
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
PDF
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
PPTX
Adding structure to your streaming pipelines: moving from Spark streaming to ...
PDF
On Demand HDP Clusters using Cloudbreak and Ambari
PPTX
Security, ETL, BI & Analytics, and Software Integration
PPTX
Enterprise large scale graph analytics and computing base on distribute graph...
Graph representation learning to prevent payment collusion fraud
Optimizing industrial operations using the big data ecosystem
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Adding structure to your streaming pipelines: moving from Spark streaming to ...
On Demand HDP Clusters using Cloudbreak and Ambari
Security, ETL, BI & Analytics, and Software Integration
Enterprise large scale graph analytics and computing base on distribute graph...

What's hot (20)

PDF
Benefits of Hadoop as Platform as a Service
PDF
High Performance Spatial-Temporal Trajectory Analysis with Spark
PDF
Cloud Experience: Data-driven Applications Made Simple and Fast
PDF
Verizon Centralizes Data into a Data Lake in Real Time for Analytics
PDF
Architect’s Open-Source Guide for a Data Mesh Architecture
PPTX
[DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloud
PDF
Building a Federated Data Directory Platform for Public Health
PPTX
Highly configurable and extensible data processing framework at PubMatic
PDF
Building Robust Production Data Pipelines with Databricks Delta
PPT
Google App Engine
PDF
Databricks + Snowflake: Catalyzing Data and AI Initiatives
PPTX
Geospatial data platform at Uber
PPTX
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
PDF
Spark and Online Analytics: Spark Summit East talky by Shubham Chopra
PPTX
Dataworks | 2018-06-20 | Gimel data platform
PDF
IBM Cloud Native Day April 2021: Serverless Data Lake
PPTX
Using Hadoop to build a Data Quality Service for both real-time and batch data
PPTX
Hadoop Journey at Walgreens
PPTX
Loan Decisioning Transformation
PDF
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Benefits of Hadoop as Platform as a Service
High Performance Spatial-Temporal Trajectory Analysis with Spark
Cloud Experience: Data-driven Applications Made Simple and Fast
Verizon Centralizes Data into a Data Lake in Real Time for Analytics
Architect’s Open-Source Guide for a Data Mesh Architecture
[DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloud
Building a Federated Data Directory Platform for Public Health
Highly configurable and extensible data processing framework at PubMatic
Building Robust Production Data Pipelines with Databricks Delta
Google App Engine
Databricks + Snowflake: Catalyzing Data and AI Initiatives
Geospatial data platform at Uber
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Spark and Online Analytics: Spark Summit East talky by Shubham Chopra
Dataworks | 2018-06-20 | Gimel data platform
IBM Cloud Native Day April 2021: Serverless Data Lake
Using Hadoop to build a Data Quality Service for both real-time and batch data
Hadoop Journey at Walgreens
Loan Decisioning Transformation
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Ad

Similar to Democratizing data science Using spark, hive and druid (20)

PDF
Building Data Products with BigQuery for PPC and SEO (SMX 2022)
PDF
Analytics in Your Enterprise
PDF
Building Pinterest Real-Time Ads Platform Using Kafka Streams
PDF
Big Query Basics
PDF
SFSCON23 - Martin Rabanser - Real-time aeroplane tracking and the Open Data Hub
PDF
AI/ML Infra Meetup | Scaling Vector Databases for E-Commerce Visual Search: A...
PDF
Presentation Data Council Meetup: F. Mekkenholt, R. Vlijm
PPTX
A Big (Query) Frog in a Small Pond, Jakub Motyl, BuffPanel
PDF
Extracting Insights from Data at Twitter
PDF
Machine learning at scale with Google Cloud Platform
PDF
Supercharge your data analytics with BigQuery
PDF
ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed...
PDF
How to Suceed in Hadoop
PDF
How to Succeed in Hadoop: comScore’s Deceptively Simple Secrets to Deploying ...
PPTX
WSO2Con USA 2015: An Introduction to the WSO2 Analytics Platform
PDF
Continuous delivery for machine learning
PDF
Big Query - Women Techmarkers (Ukraine - March 2014)
PDF
MicroStrategy at Badoo
PDF
Data Science in the Cloud @StitchFix
PDF
Understanding Business APIs through statistics
Building Data Products with BigQuery for PPC and SEO (SMX 2022)
Analytics in Your Enterprise
Building Pinterest Real-Time Ads Platform Using Kafka Streams
Big Query Basics
SFSCON23 - Martin Rabanser - Real-time aeroplane tracking and the Open Data Hub
AI/ML Infra Meetup | Scaling Vector Databases for E-Commerce Visual Search: A...
Presentation Data Council Meetup: F. Mekkenholt, R. Vlijm
A Big (Query) Frog in a Small Pond, Jakub Motyl, BuffPanel
Extracting Insights from Data at Twitter
Machine learning at scale with Google Cloud Platform
Supercharge your data analytics with BigQuery
ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed...
How to Suceed in Hadoop
How to Succeed in Hadoop: comScore’s Deceptively Simple Secrets to Deploying ...
WSO2Con USA 2015: An Introduction to the WSO2 Analytics Platform
Continuous delivery for machine learning
Big Query - Women Techmarkers (Ukraine - March 2014)
MicroStrategy at Badoo
Data Science in the Cloud @StitchFix
Understanding Business APIs through statistics
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PPTX
Managing the Dewey Decimal System
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PPTX
Security Framework for Multitenant Architecture
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Extending Twitter's Data Platform to Google Cloud
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
PDF
Computer Vision: Coming to a Store Near You
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Recently uploaded (20)

PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Encapsulation theory and applications.pdf
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
cuic standard and advanced reporting.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
A Presentation on Artificial Intelligence
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Spectroscopy.pptx food analysis technology
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Electronic commerce courselecture one. Pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
The AUB Centre for AI in Media Proposal.docx
Encapsulation theory and applications.pdf
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
cuic standard and advanced reporting.pdf
Programs and apps: productivity, graphics, security and other tools
“AI and Expert System Decision Support & Business Intelligence Systems”
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Spectral efficient network and resource selection model in 5G networks
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
MIND Revenue Release Quarter 2 2025 Press Release
Dropbox Q2 2025 Financial Results & Investor Presentation
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
A Presentation on Artificial Intelligence
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Chapter 3 Spatial Domain Image Processing.pdf
Spectroscopy.pptx food analysis technology
Encapsulation_ Review paper, used for researhc scholars
Electronic commerce courselecture one. Pdf
Network Security Unit 5.pdf for BCA BBA.

Democratizing data science Using spark, hive and druid

  • 1. Democratizing Data Science using Apache Spark, Hive and Druid
  • 2. ● Pushkar Priyadarshi ● Igor Yurinok ● Michael Dreibelbis Intro
  • 3. ● Game studio produces massive mobile games that break down linguistic and geographic barriers by uniting an unprecedented number of global players in one gaming world. Games are played in 180+ countries. ● Performance marketing platform Cognant enables marketing for our internal games as well as external businesses over 250+ channels. It merges extensive mobile ad buying expertise with a live data platform to deliver not only true ROI on mobile marketing spend but eliminate endless fraud and tiresome make-goods in the process. Machine Zone(mz.com)
  • 4. ● 40 billion messages/day ● kafka cluster handling 250+ topics over 4k partitions ● 3 hadoop clusters largest one spanning 300 nodes ● 5 PB of unreplicated data in hadoop eco system ● Ads published on 100k apps in nearly 200 countries serving average 750 millions impression a day peaking at 1B/day ● Data from 300 distinct sources ● Druid cluster containing 30+ data sources holding 50 TBs of data Data @ MZ
  • 5. ● Data Ingestion ○ Ingest raw data from external entities ● Data Normalization ○ Normalize data using transformation framework ● Model Generation ○ Create Model using model generation framework ● Generate predictions ● Second layer of Intelligence ○ Campaign Initialization ○ Campaign Optimization ● Data Service Framework Overview
  • 7. Data Ingestion (cont’d) ● DataReaders extract data from various types of sources ○ S3 - Amazon S3 bucket accessed reporting data ○ REST - HTTP endpoint reporting data ○ FTP Similar to S3, loads from FileSystem ○ Email - Scan inbox and extract valid reports ● DataWriters output data to HDFS ○ HIVE external tables
  • 9. ● Streaming Real time data source ○ Kafka + Spark Streaming => Tranquility => Druid ● Batch historical backfill raw data source ○ Spark => Druid ● Rule based transformation engine (normalizer) ○ Built using Apache Spark ○ Custom DDL for defining column transformation rules Data Normalization (cont’d)
  • 10. ● Machine Learning Pipeline based on Apache Spark ML ○ Feature Engineering ○ Model Training ○ Predictions ○ Model Testing/Tuning ○ Model Deployment MLPlatform
  • 11. ● Feature Engineering extensions ○ Aggregator => NumericAggregator ● Perform aggregate transformation on input Dataset MLPlatform (cont’d)
  • 12. ● Feature Engineering extensions ○ ParallelCountVectorizer ■ Compute CountVectorizer per input column ○ ParallelIDF ■ Compute IDF per input column MLPlatform (cont’d)
  • 13. ● Feature Engineering extensions ○ DAGPipeline ■ Support multi-input dataset DAG based feature extraction MLPlatform (cont’d) n1 n4 n3 n2 DAGModel generated:
  • 14. ● Model Testing/Tuning ○ Feature Store ■ Rapid iterative model testing ○ Configurable Split-Testing ○ Model Store ■ Based on SparkML MLWritable ● Predictions ○ Can be generated using any version of model ○ Compared across model implementations MLPlatform (cont’d)
  • 15. ● Predictions using Apache Zeppelin based visualization layer ○ Notebooks allow for rapid testing and model iteration ○ Graphing library allows for instant visual feedback MLPlatform (cont’d)
  • 16. What is output from ML Models? ● Predictions What is business value of it? ● Not much What does business need? ● Translate predictions in ad partner instructions Second Layer of Intelligence
  • 17. Partner instruction is a command which partner can/should execute: ● Create a new campaign ● Update Budget ● Update Bid ● Update Targeting ● Update Creative Asset What is Partner Instructions?
  • 18. Campaign Initialization: ● Bid ○ Finds the best possible bid to create campaigns ● Budget ○ Splits total budget between partners ● Targeting ○ Generates sets of possible targeting groups (Gender, Age, GEO) ● Creative ○ Generates and assign creatives Campaign Initialization Process
  • 19. Campaign Optimization: ● Bid ○ Increase, Decrease bids per campaign based on performance prediction ● Budget ○ Increase, Decrease and Reshuffle budget across partners/campaigns ● Targeting ○ Update targeting based on performance ● Creative ○ Reassign creatives based on performance Campaign Optimization Process
  • 21. Where to store metadata for Data Pipelines? Where to store Ad Partner Instructions? How to deliver Ad Partner Instructions? Data Service Framework
  • 22. Possible Microservices: ● Ad Partner Data Service ● Campaign Data Service ● ASP Data Service ● Ad Partner Instruction Service Data Service Framework (cont’d)
  • 23. Technologies: ● REST API ● Spring Boot ● Openshift Kubernetis ● Gradle + Jenkins Pipelines for CI/CD Data Service Framework (cont’d)
  • 24. Connect All Components Together Data Ingestion Data Normalization MLPlatform Ad PartnerData Services