Democratizing data science Using spark, hive and druid

Democratizing Data Science using Apache Spark, Hive and Druid

● Pushkar Priyadarshi
● Igor Yurinok
● Michael Dreibelbis
Intro

● Game studio produces massive mobile games that break
down linguistic and geographic barriers by uniting an
unprecedented number of global players in one gaming
world. Games are played in 180+ countries.
● Performance marketing platform Cognant enables marketing
for our internal games as well as external businesses over
250+ channels. It merges extensive mobile ad buying
expertise with a live data platform to deliver not only true ROI
on mobile marketing spend but eliminate endless fraud and
tiresome make-goods in the process.
Machine Zone(mz.com)

● 40 billion messages/day
● kafka cluster handling 250+ topics over 4k partitions
● 3 hadoop clusters largest one spanning 300 nodes
● 5 PB of unreplicated data in hadoop eco system
● Ads published on 100k apps in nearly 200 countries serving average
750 millions impression a day peaking at 1B/day
● Data from 300 distinct sources
● Druid cluster containing 30+ data sources holding 50 TBs of data
Data @ MZ

● Data Ingestion
○ Ingest raw data from external entities
● Data Normalization
○ Normalize data using transformation framework
● Model Generation
○ Create Model using model generation framework
● Generate predictions
● Second layer of Intelligence
○ Campaign Initialization
○ Campaign Optimization
● Data Service Framework
Overview

Data Ingestion
RAW Store
S3
FTP
REST
Email
Reader
Delegator
WriterReader
WriterReader

Data Ingestion (cont’d)
● DataReaders extract data from various types of sources
○ S3 - Amazon S3 bucket accessed reporting data
○ REST - HTTP endpoint reporting data
○ FTP Similar to S3, loads from FileSystem
○ Email - Scan inbox and extract valid reports
● DataWriters output data to HDFS
○ HIVE external tables

Data Normalization
RT
RAW
Rules
Loader
Rules
Store HDFS
Rules
Parser
Rules
Applier
Druid
Rule Based Transformation Engine

● Streaming Real time data source
○ Kafka + Spark Streaming => Tranquility => Druid
● Batch historical backfill raw data source
○ Spark => Druid
● Rule based transformation engine (normalizer)
○ Built using Apache Spark
○ Custom DDL for defining column transformation rules
Data Normalization (cont’d)

● Machine Learning Pipeline based on Apache Spark ML
○ Feature Engineering
○ Model Training
○ Predictions
○ Model Testing/Tuning
○ Model Deployment
MLPlatform

● Feature Engineering extensions
○ Aggregator => NumericAggregator
● Perform aggregate transformation on input Dataset
MLPlatform (cont’d)

○ ParallelCountVectorizer
■ Compute CountVectorizer per input column
○ ParallelIDF
■ Compute IDF per input column

○ DAGPipeline
■ Support multi-input dataset DAG based feature extraction
n1 n4
n3
n2
DAGModel generated:

● Model Testing/Tuning
○ Feature Store
■ Rapid iterative model testing
○ Configurable Split-Testing
○ Model Store
■ Based on SparkML MLWritable
● Predictions
○ Can be generated using any version of model
○ Compared across model implementations

● Predictions using Apache Zeppelin based visualization layer
○ Notebooks allow for rapid testing and model iteration
○ Graphing library allows for instant visual feedback

What is output from ML Models?
● Predictions
What is business value of it?
● Not much
What does business need?
● Translate predictions in ad partner instructions
Second Layer of Intelligence

Partner instruction is a command which partner can/should execute:
● Create a new campaign
● Update Budget
● Update Bid
● Update Targeting
● Update Creative Asset
What is Partner Instructions?

Campaign Initialization:
● Bid
○ Finds the best possible bid to create campaigns
● Budget
○ Splits total budget between partners
● Targeting
○ Generates sets of possible targeting groups (Gender, Age, GEO)
● Creative
○ Generates and assign creatives
Campaign Initialization Process

Campaign Optimization:
● Bid
○ Increase, Decrease bids per campaign based on performance prediction
● Budget
○ Increase, Decrease and Reshuffle budget across partners/campaigns
● Targeting
○ Update targeting based on performance
● Creative
○ Reassign creatives based on performance
Campaign Optimization Process

Campaign InitializationOptimization
Process
ML Output
(Predictions)
Historical
Data
Initializer
Optimizer
Ad Partner
Instructions
Ad
Partner

Where to store metadata for Data Pipelines?
Where to store Ad Partner Instructions?
How to deliver Ad Partner Instructions?
Data Service Framework

Possible Microservices:
● Ad Partner Data Service
● Campaign Data Service
● ASP Data Service
● Ad Partner Instruction Service
Data Service Framework (cont’d)

Technologies:
● REST API
● Spring Boot
● Openshift Kubernetis
● Gradle + Jenkins Pipelines for CI/CD
Data Service Framework (cont’d)

Connect All Components Together
Data
Ingestion
Data
Normalization
MLPlatform
Ad PartnerData Services

Democratizing data science Using spark, hive and druid

Democratizing data science Using spark, hive and druid

More Related Content

What's hot (20)

Similar to Democratizing data science Using spark, hive and druid (20)

More from DataWorks Summit (20)

Recently uploaded (20)

Democratizing data science Using spark, hive and druid