SlideShare a Scribd company logo
ARCHITECTURE & LESSONS LEARNED
BARTOSZ ŁOŚ
REAL-TIME DATA PROCESSING AT RTB HOUSEREAL-TIME DATA PROCESSING AT RTB HOUSE
BIG DATA TECHNOLOGY WARSAW SUMMIT 2017
FEBRUARY 9, 2017
TABLE OF CONTENTS
Agenda:
- real-time bidding
- the first iteration: mutable structures
- the second iteration: data-flow
- the third iteration: immutable streams of events
02/23
REAL-TIME BIDDING
REAL-TIME BIDDING: RTB PLATFORM
Processing bid requests
(350K/s, ~30 SSP networks, <50-100ms)
04/23
REAL-TIME BIDDING: DATA & MACHINE LEARNING
Impressions:
~ 150M events / day
~ 4TB data / day
Clicks:
~ 1M events / day
~ 35GB data / day
Conversions:
~ 450K events / day
~ 25GB data / day
05/23
THE FIRST ITERATION
THE 1ST ITERATION: MUTABLE IMPRESSIONS 07/23
THE 1ST ITERATION: DRAWBACKS
Issues:
- long, overloading data migrations (30 days back)
- complex servlets' logic, inability to reprocess
- inflexible, various schemas
- single-DC
- inconsistencies
08/23
THE SECOND ITERATION:
DATA-FLOW
THE 2ND ITERATION: THE 1ST DATA-FLOW ARCHITECTURE 10/23
THE 2ND ITERATION: DISTRIBUTED LOG
Why Apache Kafka:
- distributed log
- topics partitioning
- partition replication
- log retention
- stateless
- efficient data consuming
11/23
THE 2ND ITERATION: BATCH LOADING
Why Apache Camus:
- "Kafka to HDFS" pipeline
- map-reduce jobs, batches
- storing offsets in log files
- data partitioning
12/23
THE 2ND ITERATION: AVRO & SCHEMA VERSIONING
Why Apache Avro:
- data serialization framework
- rich data structures
- self-describing container files
- reader & writer schemas
- binary data format
- schema registry
13/23
THE 2ND ITERATION: ACCURATE STATISTICS
Why Apache Storm:
- real-time processing
- streams of tuples, topologies
- fault-tolerance
Why Trident:
- transactions, exactly-once processing
- microbatches (latency & throughput)
14/23
THE 2ND ITERATION: STATS-COUNTER TOPOLOGY 15/23
THE 2ND ITERATION: DRAWBACKS
Hybrid architecture:
- aggregates (real-time)
- raw events (2-hour batches)
- joined events (end-of-day batch jobs)
Other issues:
- Hive joins
- mutable events
- servlets' complex logic
16/23
THE THIRD ITERATION:
NEW APPROACH
THE 3RD ITERATION: NEW APPROACH
{ "IMPRESSION”:
"URL”,
"TIME”,
"CREATIVE”,
...
"CLICKS”,
"CONVERSIONS”
}
{ "CLICK”:
"TIME”,
"IMPRESSION_ID”,
...
"IMPRESSION”
}
{ "CONVERSION”:
"TIME”,
"CLICK_ID”,
...
"CLICK”
}
New approach:
- real-time processing
- publishing light events
- immutable streams of events
18/23
THE 3RD ITERATION: HIGH-LEVEL ARCHITECTURE 19/23
THE 3RD ITERATION: DATA-FLOW TOPOLOGY 20/23
THE 3RD ITERATION: EVENTS MERGE 21/23
SUMMARY
What we have achieved:
- multi-DC architecture
- HDFS & BigQuery streaming
- platform monitoring
- much more stable platform
- higher quality of data processing
- better data-flow monitoring, deployment & maintenance
22/23
THANK YOU
FOR YOUR ATTENTION

More Related Content

PDF
Real-Time Data Processing at RTB House – Architecture & Lessons Learned
PDF
How we have grown 10x within 2 years
PDF
Geobricks Framework
PDF
Web scale monitoring
PDF
NetFlow Data processing using Hadoop and Vertica
PDF
Event Driven Microservices
PPTX
Apache Cassandra Lunch #67: Moving Data from Cassandra to Datastax Astra
PDF
Scaling CouchDB with BigCouch
Real-Time Data Processing at RTB House – Architecture & Lessons Learned
How we have grown 10x within 2 years
Geobricks Framework
Web scale monitoring
NetFlow Data processing using Hadoop and Vertica
Event Driven Microservices
Apache Cassandra Lunch #67: Moving Data from Cassandra to Datastax Astra
Scaling CouchDB with BigCouch

What's hot (16)

PPTX
Dynamo db and Cross Region Migration
PPTX
The missing data issue for HiSeq runs
ODP
Summary of OGC Support by MapServer
PDF
Caffe + H2O - By Cyprien noel
PPTX
Cassandra Lunch #59 Functions in Cassandra
PPTX
Ruby,no sql and tokyocabinet
PDF
Stream Processing Live Traffic Data with Kafka Streams
PDF
Stream processing comparison
PPTX
Transf from csv to xml
PPTX
PelotonDB - A self-driving database for hybrid workloads
PDF
Flink Forward Berlin 2017: Francesco Versaci - Integrating Flink and Kafka in...
PDF
Assignment.4.2012
PDF
Atmosphere 2014: Centralized log management based on Logstash and Kibana - ca...
ODP
Geo2tag LBS platform training at FRUCT12
PPTX
Open Source india 2014
Dynamo db and Cross Region Migration
The missing data issue for HiSeq runs
Summary of OGC Support by MapServer
Caffe + H2O - By Cyprien noel
Cassandra Lunch #59 Functions in Cassandra
Ruby,no sql and tokyocabinet
Stream Processing Live Traffic Data with Kafka Streams
Stream processing comparison
Transf from csv to xml
PelotonDB - A self-driving database for hybrid workloads
Flink Forward Berlin 2017: Francesco Versaci - Integrating Flink and Kafka in...
Assignment.4.2012
Atmosphere 2014: Centralized log management based on Logstash and Kibana - ca...
Geo2tag LBS platform training at FRUCT12
Open Source india 2014
Ad

Similar to Real Time Data Processing at RTB House - Bartosz Łoś (20)

PPTX
Trivento summercamp fast data 9/9/2016
PPTX
Trivento summercamp masterclass 9/9/2016
PDF
Data Infrastructure for a World of Music
PDF
Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Sum...
PPTX
Ledingkart Meetup #4: Data pipeline @ lk
PDF
Building a Modern Data Pipeline: Lessons Learned - Saulius Valatka, Adform
PPTX
Big Data Analytics_basic introduction of Kafka.pptx
PPTX
real time data processing is a tsubtopic in the topic in the domain bigdata
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka
PDF
Kafka as your Data Lake - is it Feasible?
PPT
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
PPTX
Big Data Ecosystem
PDF
Simply Business' Data Platform
PDF
Towards Data Operations
PPTX
In-Stream Processing Service Blueprint, Reference architecture for real-time ...
PDF
Reflections on Almost Two Decades of Research into Stream Processing
PDF
Overview of modern software ecosystem for big data analysis
PDF
Flink Forward Berlin 2017: Stephan Ewen - The State of Flink and how to adopt...
PDF
Making Machine Learning Easy with H2O and WebFlux
Trivento summercamp fast data 9/9/2016
Trivento summercamp masterclass 9/9/2016
Data Infrastructure for a World of Music
Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Sum...
Ledingkart Meetup #4: Data pipeline @ lk
Building a Modern Data Pipeline: Lessons Learned - Saulius Valatka, Adform
Big Data Analytics_basic introduction of Kafka.pptx
real time data processing is a tsubtopic in the topic in the domain bigdata
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Kafka as your Data Lake - is it Feasible?
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
Big Data Ecosystem
Simply Business' Data Platform
Towards Data Operations
In-Stream Processing Service Blueprint, Reference architecture for real-time ...
Reflections on Almost Two Decades of Research into Stream Processing
Overview of modern software ecosystem for big data analysis
Flink Forward Berlin 2017: Stephan Ewen - The State of Flink and how to adopt...
Making Machine Learning Easy with H2O and WebFlux
Ad

More from Evention (20)

PDF
The Factorization Machines algorithm for building recommendation system - Paw...
PDF
A/B testing powered by Big data - Saurabh Goyal, Booking.com
PDF
Near Real-Time Fraud Detection in Telecommunication Industry - Burak Işıklı, ...
PDF
Assisting millions of active users in real-time - Alexey Brodovshuk, Kcell; K...
PDF
Machine learning security - Pawel Zawistowski, Warsaw University of Technolog...
PDF
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
PDF
Privacy by Design - Lars Albertsson, Mapflat
PDF
Elephants in the cloud or how to become cloud ready - Krzysztof Adamski, GetI...
PDF
Deriving Actionable Insights from High Volume Media Streams - Jörn Kottmann, ...
PDF
Enhancing Spark - increase streaming capabilities of your applications - Kami...
PDF
7 Days of Playing Minesweeper, or How to Shut Down Whistleblower Defense with...
PDF
Big Data Journey at a Big Corp - Tomasz Burzyński, Maciej Czyżowicz, Orange P...
PDF
Stream processing with Apache Flink - Maximilian Michels Data Artisans
PDF
Scaling Cassandra in all directions - Jimmy Mardell Spotify
PDF
Big Data for unstructured data Dariusz Śliwa
PDF
Elastic development. Implementing Big Data search Grzegorz Kołpuć
PDF
H2 o deep water making deep learning accessible to everyone -jo-fai chow
PDF
That won’t fit into RAM - Michał Brzezicki
PDF
Stream Analytics with SQL on Apache Flink - Fabian Hueske
PDF
Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...
The Factorization Machines algorithm for building recommendation system - Paw...
A/B testing powered by Big data - Saurabh Goyal, Booking.com
Near Real-Time Fraud Detection in Telecommunication Industry - Burak Işıklı, ...
Assisting millions of active users in real-time - Alexey Brodovshuk, Kcell; K...
Machine learning security - Pawel Zawistowski, Warsaw University of Technolog...
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Privacy by Design - Lars Albertsson, Mapflat
Elephants in the cloud or how to become cloud ready - Krzysztof Adamski, GetI...
Deriving Actionable Insights from High Volume Media Streams - Jörn Kottmann, ...
Enhancing Spark - increase streaming capabilities of your applications - Kami...
7 Days of Playing Minesweeper, or How to Shut Down Whistleblower Defense with...
Big Data Journey at a Big Corp - Tomasz Burzyński, Maciej Czyżowicz, Orange P...
Stream processing with Apache Flink - Maximilian Michels Data Artisans
Scaling Cassandra in all directions - Jimmy Mardell Spotify
Big Data for unstructured data Dariusz Śliwa
Elastic development. Implementing Big Data search Grzegorz Kołpuć
H2 o deep water making deep learning accessible to everyone -jo-fai chow
That won’t fit into RAM - Michał Brzezicki
Stream Analytics with SQL on Apache Flink - Fabian Hueske
Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...

Recently uploaded (20)

PPTX
Business Acumen Training GuidePresentation.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Computer network topology notes for revision
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Global journeys: estimating international migration
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
.pdf is not working space design for the following data for the following dat...
PPT
Quality review (1)_presentation of this 21
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Business Acumen Training GuidePresentation.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
climate analysis of Dhaka ,Banglades.pptx
Introduction-to-Cloud-ComputingFinal.pptx
Computer network topology notes for revision
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
Global journeys: estimating international migration
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Major-Components-ofNKJNNKNKNKNKronment.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Clinical guidelines as a resource for EBP(1).pdf
.pdf is not working space design for the following data for the following dat...
Quality review (1)_presentation of this 21
Acceptance and paychological effects of mandatory extra coach I classes.pptx

Real Time Data Processing at RTB House - Bartosz Łoś

  • 1. ARCHITECTURE & LESSONS LEARNED BARTOSZ ŁOŚ REAL-TIME DATA PROCESSING AT RTB HOUSEREAL-TIME DATA PROCESSING AT RTB HOUSE BIG DATA TECHNOLOGY WARSAW SUMMIT 2017 FEBRUARY 9, 2017
  • 2. TABLE OF CONTENTS Agenda: - real-time bidding - the first iteration: mutable structures - the second iteration: data-flow - the third iteration: immutable streams of events 02/23
  • 4. REAL-TIME BIDDING: RTB PLATFORM Processing bid requests (350K/s, ~30 SSP networks, <50-100ms) 04/23
  • 5. REAL-TIME BIDDING: DATA & MACHINE LEARNING Impressions: ~ 150M events / day ~ 4TB data / day Clicks: ~ 1M events / day ~ 35GB data / day Conversions: ~ 450K events / day ~ 25GB data / day 05/23
  • 7. THE 1ST ITERATION: MUTABLE IMPRESSIONS 07/23
  • 8. THE 1ST ITERATION: DRAWBACKS Issues: - long, overloading data migrations (30 days back) - complex servlets' logic, inability to reprocess - inflexible, various schemas - single-DC - inconsistencies 08/23
  • 10. THE 2ND ITERATION: THE 1ST DATA-FLOW ARCHITECTURE 10/23
  • 11. THE 2ND ITERATION: DISTRIBUTED LOG Why Apache Kafka: - distributed log - topics partitioning - partition replication - log retention - stateless - efficient data consuming 11/23
  • 12. THE 2ND ITERATION: BATCH LOADING Why Apache Camus: - "Kafka to HDFS" pipeline - map-reduce jobs, batches - storing offsets in log files - data partitioning 12/23
  • 13. THE 2ND ITERATION: AVRO & SCHEMA VERSIONING Why Apache Avro: - data serialization framework - rich data structures - self-describing container files - reader & writer schemas - binary data format - schema registry 13/23
  • 14. THE 2ND ITERATION: ACCURATE STATISTICS Why Apache Storm: - real-time processing - streams of tuples, topologies - fault-tolerance Why Trident: - transactions, exactly-once processing - microbatches (latency & throughput) 14/23
  • 15. THE 2ND ITERATION: STATS-COUNTER TOPOLOGY 15/23
  • 16. THE 2ND ITERATION: DRAWBACKS Hybrid architecture: - aggregates (real-time) - raw events (2-hour batches) - joined events (end-of-day batch jobs) Other issues: - Hive joins - mutable events - servlets' complex logic 16/23
  • 18. THE 3RD ITERATION: NEW APPROACH { "IMPRESSION”: "URL”, "TIME”, "CREATIVE”, ... "CLICKS”, "CONVERSIONS” } { "CLICK”: "TIME”, "IMPRESSION_ID”, ... "IMPRESSION” } { "CONVERSION”: "TIME”, "CLICK_ID”, ... "CLICK” } New approach: - real-time processing - publishing light events - immutable streams of events 18/23
  • 19. THE 3RD ITERATION: HIGH-LEVEL ARCHITECTURE 19/23
  • 20. THE 3RD ITERATION: DATA-FLOW TOPOLOGY 20/23
  • 21. THE 3RD ITERATION: EVENTS MERGE 21/23
  • 22. SUMMARY What we have achieved: - multi-DC architecture - HDFS & BigQuery streaming - platform monitoring - much more stable platform - higher quality of data processing - better data-flow monitoring, deployment & maintenance 22/23
  • 23. THANK YOU FOR YOUR ATTENTION