Real Time Data Processing at RTB House - Bartosz Łoś

ARCHITECTURE & LESSONS LEARNED
BARTOSZ ŁOŚ
REAL-TIME DATA PROCESSING AT RTB HOUSEREAL-TIME DATA PROCESSING AT RTB HOUSE
BIG DATA TECHNOLOGY WARSAW SUMMIT 2017
FEBRUARY 9, 2017

TABLE OF CONTENTS
Agenda:
- real-time bidding
- the first iteration: mutable structures
- the second iteration: data-flow
- the third iteration: immutable streams of events
02/23

REAL-TIME BIDDING: RTB PLATFORM
Processing bid requests
(350K/s, ~30 SSP networks, <50-100ms)
04/23

REAL-TIME BIDDING: DATA & MACHINE LEARNING
Impressions:
~ 150M events / day
~ 4TB data / day
Clicks:
~ 1M events / day
~ 35GB data / day
Conversions:
~ 450K events / day
~ 25GB data / day
05/23

THE 1ST ITERATION: MUTABLE IMPRESSIONS 07/23

THE 1ST ITERATION: DRAWBACKS
Issues:
- long, overloading data migrations (30 days back)
- complex servlets' logic, inability to reprocess
- inflexible, various schemas
- single-DC
- inconsistencies
08/23

THE SECOND ITERATION:
DATA-FLOW

THE 2ND ITERATION: THE 1ST DATA-FLOW ARCHITECTURE 10/23

THE 2ND ITERATION: DISTRIBUTED LOG
Why Apache Kafka:
- distributed log
- topics partitioning
- partition replication
- log retention
- stateless
- efficient data consuming
11/23

THE 2ND ITERATION: BATCH LOADING
Why Apache Camus:
- "Kafka to HDFS" pipeline
- map-reduce jobs, batches
- storing offsets in log files
- data partitioning
12/23

THE 2ND ITERATION: AVRO & SCHEMA VERSIONING
Why Apache Avro:
- data serialization framework
- rich data structures
- self-describing container files
- reader & writer schemas
- binary data format
- schema registry
13/23

THE 2ND ITERATION: ACCURATE STATISTICS
Why Apache Storm:
- real-time processing
- streams of tuples, topologies
- fault-tolerance
Why Trident:
- transactions, exactly-once processing
- microbatches (latency & throughput)
14/23

THE 2ND ITERATION: STATS-COUNTER TOPOLOGY 15/23

THE 2ND ITERATION: DRAWBACKS
Hybrid architecture:
- aggregates (real-time)
- raw events (2-hour batches)
- joined events (end-of-day batch jobs)
Other issues:
- Hive joins
- mutable events
- servlets' complex logic
16/23

THE THIRD ITERATION:
NEW APPROACH

THE 3RD ITERATION: NEW APPROACH
{ "IMPRESSION”:
"URL”,
"TIME”,
"CREATIVE”,
...
"CLICKS”,
"CONVERSIONS”
}
{ "CLICK”:
"TIME”,
"IMPRESSION_ID”,
...
"IMPRESSION”
}
{ "CONVERSION”:
"TIME”,
"CLICK_ID”,
...
"CLICK”
}
New approach:
- real-time processing
- publishing light events
- immutable streams of events
18/23

THE 3RD ITERATION: HIGH-LEVEL ARCHITECTURE 19/23

THE 3RD ITERATION: DATA-FLOW TOPOLOGY 20/23

THE 3RD ITERATION: EVENTS MERGE 21/23

SUMMARY
What we have achieved:
- multi-DC architecture
- HDFS & BigQuery streaming
- platform monitoring
- much more stable platform
- higher quality of data processing
- better data-flow monitoring, deployment & maintenance
22/23

Real Time Data Processing at RTB House - Bartosz Łoś

More Related Content

What's hot (16)

Similar to Real Time Data Processing at RTB House - Bartosz Łoś (20)

More from Evention (20)

Recently uploaded (20)

Real Time Data Processing at RTB House - Bartosz Łoś