SlideShare a Scribd company logo
FOUNDATIONS FOR A
DATA-DRIVEN MARKETING
ENGINE
Michael Dreibelbis
INTRO
•MZ is changing the digital marketing landscape by building the most
sophisticated Performance Marketing Platform using data from a vast
network of advertising channels.
•MZ’s Cognant Division manages marketing campaigns for internal
games as well as external clients.
OVERVIEW
•Problem Statement
•Choosing Gobblin
•Gobblin @MZ
•Customization
•Questions
•Contact
PROBLEM STATEMENT
•Ingest Data from over 300 Ad network channels
•Support REST, Email, S3, Kafka
•Support schema migrations and changes from input data
•Normalize disparate data sets
•Support partial data set merge
•Scale Horizontally
CHOOSING GOBBLIN
• The Good
• Familiarity with Camus (predecessor) KAFKA -> HDFS
• Support for stream + batch
• Minimal Learning curve
• Configuration Driven
• The Bad
• No Higher order api (see Flink / Beam / Spark)
• Minimal community support (at start)
• No GUI
GOBBLIN OVERVIEW
• State
• Source
• Extractor
• Converter
• QualityChecker
• Partitioner + Writer
• Publisher
GOBBLIN @MZ (Implementation)
•Started using Gobblin 0.6.0 -> 0.9.0
• POC to production in < 3 months with 3 engineers
• Replaced existing home grown ingestion framework
• Facebook 30 day backfill never worked to 45 minutes
• CognantDSP from 5 hours to 20 mins
GOBBLIN @MZ (Deployment)
• Local build (Standalone) and multi-remote (MapReduce)
• Local mode allows for easy testing on developing new integrations
• Deployment to hadoop clusters through Jenkins CI
• Azkaban Scheduler manages ~225 ingestion jobs
• Separate job files from runtime jars
GOBBLIN @MZ (Customization)
•Source
• S3FileInputSource (Extension of HadoopFileInputSource)
• EmailFileSource
• Csv extraction + Header validation on ingestion
•Extractor
• RestApiExtractor
• Handle async polling + pagination
GOBBLIN @MZ (Customization)
•Converter
• Rule-based Converter
• Translate input records to output records with any number of custom DDL rules, which
are supplied as RuleSets
• Example rules: NUMERIC_CAST, TIMESTAMP, DATE, MATH, SUBSTRING
•QualityChecker
• RequiredFieldsPolicy
• Ensure required columns are populated (RowLevelPolicy)
GOBBLIN @MZ (Customization)
•Partitioner
• FieldAndTimeBasedWriterPartitioner
• Existing Implementation allowed for /custom/column/YYYY/MM/DD/data.json
• New Implementation allowed for
/custom/column/YYYY/MM/DD/custom/column/data.json
•Writer
• AvroOrcWriter
• Write `GenericRecord`s to ORC files.
• AvroParquetWriter
• Write `GenericRecord`s to Parquet files.
GOBBLIN @MZ (Architecture)
GOBBLIN @MZ (Learnings)
•Job Properties - templates are your friend
• WorkUnit/Extract/State Management
• Materializes/Serializes entire Job + WorkUnitState objects on driver
• Monitoring is per task (not at the job level)
• Metrics flooded our graphite instance
• Compaction
• Works great if you’re using HIVE table with single primary key
• JSON key ordering matters for compaction on output strings
• Had to write our own custom compaction application
GOBBLIN @MZ (Stats)
• 125M ad performance records ingested per day
• >500K campaigns
• >5M ad groups
• >20M ads
• >7M records for largest job
• >250 active ad network integrations
Questions?
Contact
email: mdreibelbis@mz.com
in: https://linkedin.com/in/belbis
END
Apache Gobblin at MZ

More Related Content

PPTX
Introduction to GCP BigQuery and DataPrep
PDF
E commerce data migration in moving systems across data centres
PDF
Building tiered data stores using aesop to bridge sql and no sql systems
PDF
CosmosDB for DBAs & Developers
PDF
Spring Camp 2016 - List query performance improvement using Couchbase
PPTX
Meetup#2: Building responsive Symbology & Suggest WebService
PPTX
Kafka & Couchbase Integration Patterns
PDF
Real Time Streaming with Flink & Couchbase
Introduction to GCP BigQuery and DataPrep
E commerce data migration in moving systems across data centres
Building tiered data stores using aesop to bridge sql and no sql systems
CosmosDB for DBAs & Developers
Spring Camp 2016 - List query performance improvement using Couchbase
Meetup#2: Building responsive Symbology & Suggest WebService
Kafka & Couchbase Integration Patterns
Real Time Streaming with Flink & Couchbase

What's hot (15)

PDF
InfiniFlux Feature perf comp_v1
PPTX
Compare DynamoDB vs. MongoDB
PPTX
Introduction to CosmosDB - Azure Bootcamp 2018
PDF
What's New in Infinispan 6.0
PDF
InfiniFlux vs_RDBMS
PPTX
Hadoop-2 @ eBay
PDF
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
PPTX
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander Zaitsev
PDF
hbaseconasia2017: HBase on Beam
POTX
MongoDB Days Silicon Valley: A Technical Introduction to WiredTiger
PDF
gobblin-meetup-yarn
PPT
5 Things You Didn't Know You Could do with CouchDB
PPTX
Hadoop Training in Hyderabad
PPTX
DC Migration and Hadoop Scale For Big Billion Days
PPTX
RedisConf17 - Home Depot - Turbo charging existing applications with Redis
InfiniFlux Feature perf comp_v1
Compare DynamoDB vs. MongoDB
Introduction to CosmosDB - Azure Bootcamp 2018
What's New in Infinispan 6.0
InfiniFlux vs_RDBMS
Hadoop-2 @ eBay
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander Zaitsev
hbaseconasia2017: HBase on Beam
MongoDB Days Silicon Valley: A Technical Introduction to WiredTiger
gobblin-meetup-yarn
5 Things You Didn't Know You Could do with CouchDB
Hadoop Training in Hyderabad
DC Migration and Hadoop Scale For Big Billion Days
RedisConf17 - Home Depot - Turbo charging existing applications with Redis
Ad

Similar to Apache Gobblin at MZ (20)

PDF
MySQL in the Hosted Cloud
PPTX
Evolutionary database design
PPTX
OrigoDB - take the red pill
PPTX
BigQuery_Architecture_Componaaaents.pptx
PDF
MySQL in the Cloud
PDF
Jitney, Kafka at Airbnb
PPTX
Triple C - Centralize, Cloudify and Consolidate Dozens of Oracle Databases (O...
PDF
[db tech showcase Tokyo 2014] B15: Scalability with MariaDB and MaxScale by ...
PPTX
MongoDB World 2018: Breaking the Mold - Redesigning Dell's E-Commerce Platform
PPTX
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
PPTX
Festive Tech Calendar 2021
PPTX
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
PPTX
Ibm datastage online training in hyderabad
PDF
Introduction to firebidSQL 3.x
PPTX
Columnar Table Performance Enhancements Of Greenplum Database with Block Meta...
PDF
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
PDF
MariaDB 10.1 what's new and what's coming in 10.2 - Tokyo MariaDB Meetup
PPTX
Webinar - Macy’s: Why Your Database Decision Directly Impacts Customer Experi...
PDF
Databases in the Hosted Cloud
PPTX
TechEd AU 2014: Microsoft Azure DocumentDB Deep Dive
MySQL in the Hosted Cloud
Evolutionary database design
OrigoDB - take the red pill
BigQuery_Architecture_Componaaaents.pptx
MySQL in the Cloud
Jitney, Kafka at Airbnb
Triple C - Centralize, Cloudify and Consolidate Dozens of Oracle Databases (O...
[db tech showcase Tokyo 2014] B15: Scalability with MariaDB and MaxScale by ...
MongoDB World 2018: Breaking the Mold - Redesigning Dell's E-Commerce Platform
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Festive Tech Calendar 2021
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Ibm datastage online training in hyderabad
Introduction to firebidSQL 3.x
Columnar Table Performance Enhancements Of Greenplum Database with Block Meta...
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
MariaDB 10.1 what's new and what's coming in 10.2 - Tokyo MariaDB Meetup
Webinar - Macy’s: Why Your Database Decision Directly Impacts Customer Experi...
Databases in the Hosted Cloud
TechEd AU 2014: Microsoft Azure DocumentDB Deep Dive
Ad

Recently uploaded (20)

PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
PDF
Microsoft Core Cloud Services powerpoint
PDF
Business Analytics and business intelligence.pdf
PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
PDF
annual-report-2024-2025 original latest.
PPTX
Introduction to Inferential Statistics.pptx
PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
PPTX
Managing Community Partner Relationships
PDF
Introduction to Data Science and Data Analysis
PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
A Complete Guide to Streamlining Business Processes
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
Introduction to the R Programming Language
PPTX
Database Infoormation System (DBIS).pptx
PDF
Transcultural that can help you someday.
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
Microsoft Core Cloud Services powerpoint
Business Analytics and business intelligence.pdf
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
annual-report-2024-2025 original latest.
Introduction to Inferential Statistics.pptx
retention in jsjsksksksnbsndjddjdnFPD.pptx
Managing Community Partner Relationships
Introduction to Data Science and Data Analysis
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
STERILIZATION AND DISINFECTION-1.ppthhhbx
A Complete Guide to Streamlining Business Processes
Optimise Shopper Experiences with a Strong Data Estate.pdf
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Introduction to the R Programming Language
Database Infoormation System (DBIS).pptx
Transcultural that can help you someday.
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb

Apache Gobblin at MZ

  • 1. FOUNDATIONS FOR A DATA-DRIVEN MARKETING ENGINE Michael Dreibelbis
  • 2. INTRO •MZ is changing the digital marketing landscape by building the most sophisticated Performance Marketing Platform using data from a vast network of advertising channels. •MZ’s Cognant Division manages marketing campaigns for internal games as well as external clients.
  • 3. OVERVIEW •Problem Statement •Choosing Gobblin •Gobblin @MZ •Customization •Questions •Contact
  • 4. PROBLEM STATEMENT •Ingest Data from over 300 Ad network channels •Support REST, Email, S3, Kafka •Support schema migrations and changes from input data •Normalize disparate data sets •Support partial data set merge •Scale Horizontally
  • 5. CHOOSING GOBBLIN • The Good • Familiarity with Camus (predecessor) KAFKA -> HDFS • Support for stream + batch • Minimal Learning curve • Configuration Driven • The Bad • No Higher order api (see Flink / Beam / Spark) • Minimal community support (at start) • No GUI
  • 6. GOBBLIN OVERVIEW • State • Source • Extractor • Converter • QualityChecker • Partitioner + Writer • Publisher
  • 7. GOBBLIN @MZ (Implementation) •Started using Gobblin 0.6.0 -> 0.9.0 • POC to production in < 3 months with 3 engineers • Replaced existing home grown ingestion framework • Facebook 30 day backfill never worked to 45 minutes • CognantDSP from 5 hours to 20 mins
  • 8. GOBBLIN @MZ (Deployment) • Local build (Standalone) and multi-remote (MapReduce) • Local mode allows for easy testing on developing new integrations • Deployment to hadoop clusters through Jenkins CI • Azkaban Scheduler manages ~225 ingestion jobs • Separate job files from runtime jars
  • 9. GOBBLIN @MZ (Customization) •Source • S3FileInputSource (Extension of HadoopFileInputSource) • EmailFileSource • Csv extraction + Header validation on ingestion •Extractor • RestApiExtractor • Handle async polling + pagination
  • 10. GOBBLIN @MZ (Customization) •Converter • Rule-based Converter • Translate input records to output records with any number of custom DDL rules, which are supplied as RuleSets • Example rules: NUMERIC_CAST, TIMESTAMP, DATE, MATH, SUBSTRING •QualityChecker • RequiredFieldsPolicy • Ensure required columns are populated (RowLevelPolicy)
  • 11. GOBBLIN @MZ (Customization) •Partitioner • FieldAndTimeBasedWriterPartitioner • Existing Implementation allowed for /custom/column/YYYY/MM/DD/data.json • New Implementation allowed for /custom/column/YYYY/MM/DD/custom/column/data.json •Writer • AvroOrcWriter • Write `GenericRecord`s to ORC files. • AvroParquetWriter • Write `GenericRecord`s to Parquet files.
  • 13. GOBBLIN @MZ (Learnings) •Job Properties - templates are your friend • WorkUnit/Extract/State Management • Materializes/Serializes entire Job + WorkUnitState objects on driver • Monitoring is per task (not at the job level) • Metrics flooded our graphite instance • Compaction • Works great if you’re using HIVE table with single primary key • JSON key ordering matters for compaction on output strings • Had to write our own custom compaction application
  • 14. GOBBLIN @MZ (Stats) • 125M ad performance records ingested per day • >500K campaigns • >5M ad groups • >20M ads • >7M records for largest job • >250 active ad network integrations
  • 17. END