S M A R T V I D E O A D V E R T I S I N G
Processing Complex Workflows
in Advertising Using Hadoop
June 3rd, 2014
Who we are
Rahul Ravindran Bernardo de Seabra
Data Team Data Team
rahul@brightroll.com bernardo@brightroll.com
@bseabra
Agenda
• Introduction to BrightRoll
• Data Consumer Requirements
• Motivation
• Design
– Streaming log data into HDFS
– Anatomy of an event
– Event de-duplication
– Configuration driven processing
– Auditing
• Future
Introduction: BrightRoll
• Largest Online Video Advertisement Platform
• BrightRoll builds technology that improves
and automates video advertising globally
• Reaching 53.9% of US audience, 168MM
unique viewers
• 3+ Billion video ads / month
• 20+ Billion events processed / day
Data Consumer Requirements
• Processing results
– Campaign delivery
– Analytics
– Telemetry
• Consumers of processed data
– Delivery algorithms to augment decision behavior
– Campaign managers to monitor/tweak campaigns
– Billing system
– Forecasting/planning tools
– Business Analysts: long/short term analysis
Motivation – legacy data pipeline
• Not linearly scale-able
• Unit of processing was single campaign
• Not HA
• Lots of moving parts, no centralized control
and monitoring
• Failure recovery was time consuming
Motivation – legacy data pipeline
• Lots of boilerplate code
– hard to onboard new data/computations
• Interval based processing
– 2 hour sliding window
– Inherited delay
– Inefficient use of resources
• All data must be retrieved prior to processing
Performance requirements
• Low end-to-end delivery of aggregated
metrics
– Feedback loop into delivery algorithm
– Campaign managers can react faster to their
campaign performance
• Linearly scalable
Design decisions
• Streaming model
– Data is continuously being written
– Process data once
– Checkpoint states
– Low end-to-end latency (5 mins)
• Idempotent
– Jobs can fail, tasks can fail, allows repeatability
• Configuration driven join semantics
– Ease of on-boarding new data/computations
Overview Data Processing Pipeline
ProcessDe-duplicate Store
HDFS
M/R HBase
Data
Data Producers
Flume NG
Data Warehouse
Stream log data into HDFS using Flume
Adser
v
Adser
v
Adser
v
HDFS
Flume
• Flume rolls files every
2 minutes
• Files lexicographically
ordered
• Treat files written
from flume to be a
stream
• Maintain a marker
which points to
current location in
the input stream
• Enables us to always
process new data
logs
logs
logs
File.1239
File.1238
File.1237
File.1236
File.1235
File.1234
Marker
Files written by Flume
Event Header Event
payload
Event ID Event
timestamp
Event type Machine id
Anatomy of an event
De-duplication
Raw logs Hbase table
• We load raw logs into
an hbase table
• We use hbase table as a
stream
• We keep track of a
time-based marker per
table which represents
a point in time up to
which we have
processed data
Hbase table
Start time
End time
• Next run will read data which was
inserted from start time to end time
(window of TO_BE_PROCESSED data)
• Rowkey is <salt, event timestamp,
event id>
Chunk 1
Chunk 3
Chunk 2
• Break up data in
WINDOW_TO_BE_PROCESSED
into chunks
• Each chunk has same salt and
contiguous event timestamp
• Each chunk is sorted – artifact of
hbase storage
Salt time id
4 1234 Foobar
1
4 1234 Foobar
2
4 1235 Foobar
3
6 1234 Foobar
4
7 1235 Foobar
5
7 1236 foobar6
StartRow
EndRow
Historical Scan
without time range ,
multi-versions
• New Scan object gives
historical view
• Perform de-duplication of data
in chunk based on historical
view
Key Event
payload
4,1234,
foobar1
4,1234,
foobar2
4,1235,
foobar3
De-duplication performance
• High Dedup throughput – 1.2+ million events
per second
• Dedup across 4 days of historical data
StartRow/EndRow
scan
TimeRange scan
Compaction co-
processor to
compact files older
than table start
Time
Processing - Joins
Impression Auction Computation
Arbitrary joins
• Use of an mechanism very similar to the de-
duplication previously described
• Historical scan now checks for other events
specified in the join
• Business level de-duplication – duplicate
impressions for same auction performed here
as well
• “Session debugging”
Auditing
Adser
v
Adser
v
Adser
v
Metadata
Auditor
Metadata
Machine id #
events
Time
interval
Deduped.1 Deduped.2 Deduped.3
Disk Replay
What we have now
• All the stuff we have talked about plus system
which
– Scales linearly
– HA within our data center
– HA across data centers (by switching traffic)
– Allows us to on-board new computations easily
– Provide guarantees on consumption on data in
pipeline
Future
• Move to HBase 0.98/1.x
• Further improvements to De-duplication
algorithm
• Dynamic definition of join semantics
• HDFS Federation
Questions

More Related Content

PPTX
Hard disk project
PDF
Cassandra nice use cases and worst anti patterns
PPTX
Hard Disk Drive
PPT
Hard Disk Componets
PPTX
Vários tipos de buses
PPT
Hard disk
PPTX
Ssd(solid state drive )
KEY
Graphs in the Database: Rdbms In The Social Networks Age
Hard disk project
Cassandra nice use cases and worst anti patterns
Hard Disk Drive
Hard Disk Componets
Vários tipos de buses
Hard disk
Ssd(solid state drive )
Graphs in the Database: Rdbms In The Social Networks Age

What's hot (16)

PDF
Sta by usha_mehta
PDF
Synchronous and asynchronous reset
PDF
Physical Verification.pdf
PDF
Basics of Digital Design and Verilog
PPTX
Solid state drive (ssd)
PDF
ClickHouse Mark Cache, by Mik Kocikowski, Cloudflare
PDF
FPGA Coding Guidelines
PPTX
Czy powinniśmy się przejmować Core Web Vitals?
PDF
Best Practices in Security with PostgreSQL
 
PPTX
Oracle Database in-Memory Overivew
PDF
Field Programmable Gate Arrays : Architecture
PPTX
Hard disk
PPT
Solid State Drive (SSD) - SBMathema
PDF
InnoDB Internal
PDF
Design for Test [DFT]-1 (1).pdf DESIGN DFT
PPTX
Difference between HDD & SSD
Sta by usha_mehta
Synchronous and asynchronous reset
Physical Verification.pdf
Basics of Digital Design and Verilog
Solid state drive (ssd)
ClickHouse Mark Cache, by Mik Kocikowski, Cloudflare
FPGA Coding Guidelines
Czy powinniśmy się przejmować Core Web Vitals?
Best Practices in Security with PostgreSQL
 
Oracle Database in-Memory Overivew
Field Programmable Gate Arrays : Architecture
Hard disk
Solid State Drive (SSD) - SBMathema
InnoDB Internal
Design for Test [DFT]-1 (1).pdf DESIGN DFT
Difference between HDD & SSD
Ad

Viewers also liked (9)

PPTX
Real Time Conversion Joins Using Storm and HBase
PPTX
Key-Value Pairs
PPTX
Unified Log London (May 2015) - Why your company needs a unified log
PDF
Unified Log Processing Architecture
PPTX
Introduction to Big Data processing (FGRE2016)
PDF
Non-Relational Databases & Key/Value Stores
PDF
Key-Value Stores: a practical overview
PPTX
Building Data Pipelines with Spark and StreamSets
PPTX
Open Source Big Data Ingestion - Without the Heartburn!
Real Time Conversion Joins Using Storm and HBase
Key-Value Pairs
Unified Log London (May 2015) - Why your company needs a unified log
Unified Log Processing Architecture
Introduction to Big Data processing (FGRE2016)
Non-Relational Databases & Key/Value Stores
Key-Value Stores: a practical overview
Building Data Pipelines with Spark and StreamSets
Open Source Big Data Ingestion - Without the Heartburn!
Ad

Similar to Processing Complex Workflows in Advertising using Hadoop (20)

PDF
Architecting applications with Hadoop - Fraud Detection
PDF
Introduction to Stream Processing
PPTX
Trivento summercamp masterclass 9/9/2016
PDF
Introduction to Stream Processing
PPTX
Gcp dataflow
PPTX
Trivento summercamp fast data 9/9/2016
PDF
Introduction to Stream Processing
PDF
Routing trillion events per day @twitter
PDF
Data pipelines from zero to solid
PPTX
WebAction-Sami Abkay
PDF
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
PDF
Near Realtime Processing over HBase
PPT
Process Mining: Data Science in Action - Wil van der Aalst, TU/e, DSC/e, HSE
PDF
Spark meetup stream processing use cases
PDF
Introduction to Stream Processing
PPTX
Functional architectural patterns
PPTX
Streaming map reduce
PPTX
vJUG - Introduction to data streaming
PPTX
Introduction to Hadoop and Big Data
PPTX
JUG Tirana - Introduction to data streaming
Architecting applications with Hadoop - Fraud Detection
Introduction to Stream Processing
Trivento summercamp masterclass 9/9/2016
Introduction to Stream Processing
Gcp dataflow
Trivento summercamp fast data 9/9/2016
Introduction to Stream Processing
Routing trillion events per day @twitter
Data pipelines from zero to solid
WebAction-Sami Abkay
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
Near Realtime Processing over HBase
Process Mining: Data Science in Action - Wil van der Aalst, TU/e, DSC/e, HSE
Spark meetup stream processing use cases
Introduction to Stream Processing
Functional architectural patterns
Streaming map reduce
vJUG - Introduction to data streaming
Introduction to Hadoop and Big Data
JUG Tirana - Introduction to data streaming

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PPTX
Managing the Dewey Decimal System
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PPTX
Security Framework for Multitenant Architecture
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Extending Twitter's Data Platform to Google Cloud
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
PDF
Computer Vision: Coming to a Store Near You
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Recently uploaded (20)

PPTX
TEXTILE technology diploma scope and career opportunities
PDF
CloudStack 4.21: First Look Webinar slides
PPTX
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
PPT
What is a Computer? Input Devices /output devices
PDF
Taming the Chaos: How to Turn Unstructured Data into Decisions
PPTX
Chapter 5: Probability Theory and Statistics
PDF
Five Habits of High-Impact Board Members
PDF
Convolutional neural network based encoder-decoder for efficient real-time ob...
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
Improvisation in detection of pomegranate leaf disease using transfer learni...
PDF
sustainability-14-14877-v2.pddhzftheheeeee
PDF
A review of recent deep learning applications in wood surface defect identifi...
PDF
Consumable AI The What, Why & How for Small Teams.pdf
PDF
Credit Without Borders: AI and Financial Inclusion in Bangladesh
PPTX
The various Industrial Revolutions .pptx
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PPTX
Benefits of Physical activity for teenagers.pptx
PDF
“A New Era of 3D Sensing: Transforming Industries and Creating Opportunities,...
PPT
Geologic Time for studying geology for geologist
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
TEXTILE technology diploma scope and career opportunities
CloudStack 4.21: First Look Webinar slides
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
What is a Computer? Input Devices /output devices
Taming the Chaos: How to Turn Unstructured Data into Decisions
Chapter 5: Probability Theory and Statistics
Five Habits of High-Impact Board Members
Convolutional neural network based encoder-decoder for efficient real-time ob...
1 - Historical Antecedents, Social Consideration.pdf
Improvisation in detection of pomegranate leaf disease using transfer learni...
sustainability-14-14877-v2.pddhzftheheeeee
A review of recent deep learning applications in wood surface defect identifi...
Consumable AI The What, Why & How for Small Teams.pdf
Credit Without Borders: AI and Financial Inclusion in Bangladesh
The various Industrial Revolutions .pptx
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
Benefits of Physical activity for teenagers.pptx
“A New Era of 3D Sensing: Transforming Industries and Creating Opportunities,...
Geologic Time for studying geology for geologist
Final SEM Unit 1 for mit wpu at pune .pptx

Processing Complex Workflows in Advertising using Hadoop

  • 1. S M A R T V I D E O A D V E R T I S I N G Processing Complex Workflows in Advertising Using Hadoop June 3rd, 2014
  • 2. Who we are Rahul Ravindran Bernardo de Seabra Data Team Data Team rahul@brightroll.com bernardo@brightroll.com @bseabra
  • 3. Agenda • Introduction to BrightRoll • Data Consumer Requirements • Motivation • Design – Streaming log data into HDFS – Anatomy of an event – Event de-duplication – Configuration driven processing – Auditing • Future
  • 4. Introduction: BrightRoll • Largest Online Video Advertisement Platform • BrightRoll builds technology that improves and automates video advertising globally • Reaching 53.9% of US audience, 168MM unique viewers • 3+ Billion video ads / month • 20+ Billion events processed / day
  • 5. Data Consumer Requirements • Processing results – Campaign delivery – Analytics – Telemetry • Consumers of processed data – Delivery algorithms to augment decision behavior – Campaign managers to monitor/tweak campaigns – Billing system – Forecasting/planning tools – Business Analysts: long/short term analysis
  • 6. Motivation – legacy data pipeline • Not linearly scale-able • Unit of processing was single campaign • Not HA • Lots of moving parts, no centralized control and monitoring • Failure recovery was time consuming
  • 7. Motivation – legacy data pipeline • Lots of boilerplate code – hard to onboard new data/computations • Interval based processing – 2 hour sliding window – Inherited delay – Inefficient use of resources • All data must be retrieved prior to processing
  • 8. Performance requirements • Low end-to-end delivery of aggregated metrics – Feedback loop into delivery algorithm – Campaign managers can react faster to their campaign performance • Linearly scalable
  • 9. Design decisions • Streaming model – Data is continuously being written – Process data once – Checkpoint states – Low end-to-end latency (5 mins) • Idempotent – Jobs can fail, tasks can fail, allows repeatability • Configuration driven join semantics – Ease of on-boarding new data/computations
  • 10. Overview Data Processing Pipeline ProcessDe-duplicate Store HDFS M/R HBase Data Data Producers Flume NG Data Warehouse
  • 11. Stream log data into HDFS using Flume Adser v Adser v Adser v HDFS Flume • Flume rolls files every 2 minutes • Files lexicographically ordered • Treat files written from flume to be a stream • Maintain a marker which points to current location in the input stream • Enables us to always process new data logs logs logs
  • 13. Event Header Event payload Event ID Event timestamp Event type Machine id Anatomy of an event
  • 14. De-duplication Raw logs Hbase table • We load raw logs into an hbase table • We use hbase table as a stream • We keep track of a time-based marker per table which represents a point in time up to which we have processed data
  • 15. Hbase table Start time End time • Next run will read data which was inserted from start time to end time (window of TO_BE_PROCESSED data) • Rowkey is <salt, event timestamp, event id>
  • 16. Chunk 1 Chunk 3 Chunk 2 • Break up data in WINDOW_TO_BE_PROCESSED into chunks • Each chunk has same salt and contiguous event timestamp • Each chunk is sorted – artifact of hbase storage Salt time id 4 1234 Foobar 1 4 1234 Foobar 2 4 1235 Foobar 3 6 1234 Foobar 4 7 1235 Foobar 5 7 1236 foobar6
  • 17. StartRow EndRow Historical Scan without time range , multi-versions • New Scan object gives historical view • Perform de-duplication of data in chunk based on historical view Key Event payload 4,1234, foobar1 4,1234, foobar2 4,1235, foobar3
  • 18. De-duplication performance • High Dedup throughput – 1.2+ million events per second • Dedup across 4 days of historical data StartRow/EndRow scan TimeRange scan Compaction co- processor to compact files older than table start Time
  • 19. Processing - Joins Impression Auction Computation
  • 20. Arbitrary joins • Use of an mechanism very similar to the de- duplication previously described • Historical scan now checks for other events specified in the join • Business level de-duplication – duplicate impressions for same auction performed here as well • “Session debugging”
  • 22. What we have now • All the stuff we have talked about plus system which – Scales linearly – HA within our data center – HA across data centers (by switching traffic) – Allows us to on-board new computations easily – Provide guarantees on consumption on data in pipeline
  • 23. Future • Move to HBase 0.98/1.x • Further improvements to De-duplication algorithm • Dynamic definition of join semantics • HDFS Federation

Editor's Notes

  • #2: Good afternoon everyone. Thanks for joining us on this talk named Processing Complex Workflows in Advertising Using Hadoop.
  • #3: My name is Bernardo de Seabra, this is Rahul Ravindran and we are part of the Data Team at BrightRoll. Our team is responsible for all Big Data related things in the company including the most recent project we undertook to rebuild the data processing pipeline that powers a lot of the critical components of the BrightRoll technology stack. That’s data processing pipeline will be the focus of this talk.
  • #4: In order to give the audience some more context we’ll take a minute to explain what BrightRoll does for those of you that are not familiar with the company. We will then cover the requirements of all the different consumers of data throughout the platform, the motivation to develop a new data processing pipeline and the design decisions made to respond to such requirements.
  • #9: Smaller chances of underdelivery or overdelivery which costs us money
  • #13: Bernardo to cover up to this slide. All files until File.1235 have been processed The arrows between files represent time. Older file which was followed by the next file
  • #14: Global unique event id for each event generated at the point when the event is logged
  • #15: We have a requirement to consume all logs. We have a separate audit mechanism to verify if all logs were consumed by the pipeline. We automatically replay log lines if we find missing log lines. On replay, we may have duplicates which need to be de-duped.
  • #16: Historical perspective: We began with a naïve dedup algo where we would look up each event id to check if it exists, if so, it is duplicate else we would emit. This as too slow as large number of such random lookup were slow and each lookup went over the entire keyspace. We needed a mechanism to constrain the key space and perform a range query but with event IDs being random, this was hard. Hence, we needed the event timestamp at the beginning of the rowkey, but this would result in hosspotting, so, we used added a one byte salt, generated from the hash of the event id as the prefix to distribute load across all the regions.
  • #18: StartRow and EndRow of each chunk are used to construct a new scan object with no constraints on time. This scan constraints the keyspace in the query using startRow and endRow
  • #19: As number of hfiles increase, timerange scans benefit, since a lot of hfiles outside the timerange are ignored. However, as number of hfiles increase, the historical scan gets slower as all the hfiles need to be scanned. In the other scenario, if we have one giant hfile(say after a major compaction run), then, the timerange scan has to scan the entire hfile which is slow. So, we use co-processor which enables us to use #hfiles as a coarse index on time for recent data (where we will do a timerange scan) and older large hfiles which provide an index on the rowkey
  • #20: Allow arbitrary event joins Events to be joined, along with fields to be used are defined in a configuration file to allow ease of adding new computations All financial computations expressed via config. Currently, 24 different computations exist On-boarding new computations are changes to config file Each computation is an entry in a different hbase table
  • #21: Also, allows us to perform joins across events generated over arbitrary and possibly long time windows (currently at 2 hours) since mobile clients frequently cache the auction results and show an ad later (as much as 2 hours from auction time). Hence, impression generated 2 hours after auction. This does not need us to compare with all the old data. Old pipeline required us to load all the data for 2 hours to perform join Last event type which is part of the join triggers a computation Since we have a view into joined data, this allows other engineering teams to query into this data to allow for better debugging at large scale. This allows for arbitrary joins across event types which enables engineering to deal with new events.
  • #22: Auditor processes the deduped stream and then uses that to compare with the meta data it has received from the adserving machines. If they do not match, we force a replay of the files from the adserv box, which would get deduped, thereby removing all the duplicates and ensuring that all the data makes it through to the processing pipeline
  • #23: If we can provide something about how this has impacted business