SlideShare a Scribd company logo
REAL-TIME DATA PROCESSING AT RTB HOUSEREAL-TIME DATA PROCESSING AT RTB HOUSE
Real-Time Data
Processing
at RTB House
How we have grown 10x within 2 years
Bartosz Łoś, 2019
AGENDA
●
our RTB platform
AGENDA
●
our RTB platform
●
the previous iterations: three different architectures
AGENDA
●
our RTB platform
●
the previous iterations: three different architectures
●
the fourth iteration: multi-dc architecture
AGENDA
●
our RTB platform
●
the previous iterations: three different architectures
●
the fourth iteration: multi-dc architecture
●
our use cases: requirements and processing patterns
AGENDA
●
our RTB platform
●
the previous iterations: three different architectures
●
the fourth iteration: multi-dc architecture
●
our use cases: requirements and processing patterns
●
kafka workers
OUR RTB PLATFORM
OUR RTB PLATFORM: THE CONTEXT
OUR RTB PLATFORM: THE CONTEXT
THE PREVIOUS ITERATIONS
THE 1ST ITERATION: MUTABLE IMPRESSIONS
THE 2ND ITERATION: LAMBDA ARCHITECTURE
THE 3RD ITERATION: IMMUTABLE STREAMS OF EVENTS
THE FOURTH ITERATION:
MULTI-DC
THE 4TH ITERATION: MAIN CHANGES
●
10x larger scale:
●
from 350K to 3.5M bid requests/s within 2 years
THE 4TH ITERATION: MAIN CHANGES
●
10x larger scale:
●
from 350K to 3.5M bid requests/s within 2 years
●
full multi-dc architecture:
●
synchronization of user profiles
●
merging streams of events
THE 4TH ITERATION: MAIN CHANGES
●
10x larger scale:
●
from 350K to 3.5M bid requests/s within 2 years
●
full multi-dc architecture:
●
synchronization of user profiles
●
merging streams of events
●
fixed partitioning in all DCs:
●
parallelism, merging, end-to-end lag
THE 4TH ITERATION: MAIN CHANGES
●
10x larger scale:
●
from 350K to 3.5M bid requests/s within 2 years
●
full multi-dc architecture:
●
synchronization of user profiles
●
merging streams of events
●
fixed partitioning in all DCs:
●
parallelism, merging, end-to-end lag
●
end-to-end exactly-once processing:
●
at-least-once output semantics & deduplication
THE 4TH ITERATION: MAIN CHANGES
●
10x larger scale:
●
from 350K to 3.5M bid requests/s within 2 years
●
full multi-dc architecture:
●
synchronization of user profiles
●
merging streams of events
●
fixed partitioning in all DCs:
●
parallelism, merging, end-to-end lag
●
end-to-end exactly-once processing:
●
at-least-once output semantics & deduplication
●
a few better components:
●
new stats-counter, new data-flow
●
logstash
●
merger, dispatcher & loader
THE 4TH ITERATION: MULTI-DC ARCHITECTURE
THE 4TH ITERATION: MULTI-DC ARCHITECTURE
THE 4TH ITERATION: MULTI-DC ARCHITECTURE
THE 4TH ITERATION: MULTI-DC ARCHITECTURE
THE 4TH ITERATION: MULTI-DC ARCHITECTURE
THE 4TH ITERATION: MULTI-DC ARCHITECTURE
THE 4TH ITERATION: MULTI-DC ARCHITECTURE
THE 4TH ITERATION: MULTI-DC ARCHITECTURE
THE 4TH ITERATION: MULTI-DC ARCHITECTURE
THE 4TH ITERATION: MULTI-DC ARCHITECTURE
OUR USE CASES
STATS-COUNTER: STORM TOPOLOGY (THE 2ND ITERATION)
APACHE STORM: TRIDENT + EXACTLY-ONCE STATE
APACHE STORM: PARALLELISM MODEL
MERGER (THE 4TH ITERATION)
MERGER: KAFKA CONSUMER API
DATA-FLOW: KAFKA STREAMS (THE 4TH ITERATION)
KAFKA STREAMS: PARALLELISM MODEL
KAFKA STREAMS: PARALLELISM MODEL
KAFKA STREAMS: EXACTLY-ONCE DELIVERY
Kafka Streams:
●
processing.guarantee = exactly-once
KAFKA STREAMS: EXACTLY-ONCE DELIVERY
Kafka Streams:
●
processing.guarantee = exactly-once
Producer:
●
transactions
●
enable.idempotence = true
KAFKA STREAMS: EXACTLY-ONCE DELIVERY
Kafka Streams:
●
processing.guarantee = exactly-once
Producer:
●
transactions
●
enable.idempotence = true
Consumer:
●
isolation.level = read_committed
KAFKA WORKERS
KAFKA WORKERS: MAIN FEATURES
●
higher level of distribution
KAFKA WORKERS: MAIN FEATURES
●
higher level of distribution
KAFKA WORKERS: MAIN FEATURES
●
higher level of distribution
public interface WorkerPartitioner<K, V> {
int subpartition(ConsumerRecord<K, V> consumerRecord);
}
KAFKA WORKERS: MAIN FEATURES
●
higher level of distribution
●
possibility to pause and resume processing for given partition
KAFKA WORKERS: MAIN FEATURES
●
higher level of distribution
●
possibility to pause and resume processing for given partition
KAFKA WORKERS: MAIN FEATURES
●
higher level of distribution
●
possibility to pause and resume processing for given partition
public interface WorkerTask<K, V> {
boolean accept(WorkerRecord<K, V> record);
void process(WorkerRecord<K, V> record, RecordStatusObserver observer);
}
KAFKA WORKERS: MAIN FEATURES
●
higher level of distribution
●
possibility to pause and resume processing for given partition
●
asynchronous processing
●
tighter control of offsets commits
●
backpressure
●
processing timeouts
KAFKA WORKERS: MAIN FEATURES
●
higher level of distribution
●
possibility to pause and resume processing for given partition
●
asynchronous processing
●
tighter control of offsets commits
●
backpressure
●
processing timeouts
public interface RecordStatusObserver {
void onSuccess();
void onFailure(Exception exception);
}
KAFKA WORKERS: MAIN FEATURES
●
higher level of distribution
●
possibility to pause and resume processing for given partition
●
asynchronous processing
●
tighter control of offsets commits
●
backpressure
●
processing timeouts
KAFKA WORKERS: MAIN FEATURES
●
higher level of distribution
●
possibility to pause and resume processing for given partition
●
asynchronous processing
●
tighter control of offsets commits
●
backpressure
●
processing timeouts
●
at-least-once semantics
KAFKA WORKERS: MAIN FEATURES
●
higher level of distribution
●
possibility to pause and resume processing for given partition
●
asynchronous processing
●
tighter control of offsets commits
●
backpressure
●
processing timeouts
●
at-least-once semantics
●
handling failures
KAFKA WORKERS: MAIN FEATURES
●
higher level of distribution
●
possibility to pause and resume processing for given partition
●
asynchronous processing
●
tighter control of offsets commits
●
backpressure
●
processing timeouts
●
at-least-once semantics
●
handling failures
●
kafka-to-kafka, hdfs, bigquery, elasticsearch connectors
KAFKA WORKERS: MAIN FEATURES
●
higher level of distribution
●
possibility to pause and resume processing for given partition
●
asynchronous processing
●
tighter control of offsets commits
●
backpressure
●
processing timeouts
●
at-least-once semantics
●
handling failures
●
kafka-to-kafka, hdfs, bigquery, elasticsearch connectors
●
github.com/RTBHOUSE/kafka-workers
KAFKA WORKERS: PARALLELISM MODEL
THE 5TH ITERATION: KAFKA WORKERS
techblog.rtbhouse.com/jobs

More Related Content

PDF
Real-Time Data Processing at RTB House – Architecture & Lessons Learned
PDF
Real Time Data Processing at RTB House - Bartosz Łoś
PDF
Web scale monitoring
PDF
NetFlow Data processing using Hadoop and Vertica
PDF
Geobricks Framework
PDF
Maintaining spatial data infrastructures (SDIs) using distributed task queues
PDF
Event Driven Microservices
PDF
Streaming Data from Cassandra into Kafka
Real-Time Data Processing at RTB House – Architecture & Lessons Learned
Real Time Data Processing at RTB House - Bartosz Łoś
Web scale monitoring
NetFlow Data processing using Hadoop and Vertica
Geobricks Framework
Maintaining spatial data infrastructures (SDIs) using distributed task queues
Event Driven Microservices
Streaming Data from Cassandra into Kafka

What's hot (20)

PDF
How to build an event driven architecture with kafka and kafka connect
PDF
Migration JCAPS -> WSO2
PPTX
Dynamo db and Cross Region Migration
PDF
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...
PDF
Stream Processing Live Traffic Data with Kafka Streams
ODP
MapServer Project Status 2013
PDF
Apache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-On
PDF
Integration for real-time Kafka SQL
PDF
Streaming sql and druid
ODP
Summary of OGC Support by MapServer
PPTX
Apache Cassandra Lunch #67: Moving Data from Cassandra to Datastax Astra
PDF
Unify Enterprise Data Processing System Platform Level Integration of Flink a...
PDF
Towards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen Li
PPTX
Cassandra Lunch #59 Functions in Cassandra
PPTX
PelotonDB - A self-driving database for hybrid workloads
PDF
Flink Forward Berlin 2017: Francesco Versaci - Integrating Flink and Kafka in...
PDF
Grafana 7.0
PDF
Scaling CouchDB with BigCouch
PPTX
Ruby,no sql and tokyocabinet
How to build an event driven architecture with kafka and kafka connect
Migration JCAPS -> WSO2
Dynamo db and Cross Region Migration
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...
Stream Processing Live Traffic Data with Kafka Streams
MapServer Project Status 2013
Apache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-On
Integration for real-time Kafka SQL
Streaming sql and druid
Summary of OGC Support by MapServer
Apache Cassandra Lunch #67: Moving Data from Cassandra to Datastax Astra
Unify Enterprise Data Processing System Platform Level Integration of Flink a...
Towards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen Li
Cassandra Lunch #59 Functions in Cassandra
PelotonDB - A self-driving database for hybrid workloads
Flink Forward Berlin 2017: Francesco Versaci - Integrating Flink and Kafka in...
Grafana 7.0
Scaling CouchDB with BigCouch
Ruby,no sql and tokyocabinet
Ad

Similar to How we have grown 10x within 2 years (20)

PDF
Moving Towards Better Upgrades in Kafka Streams
PPTX
Kafka Practices @ Uber - Seattle Apache Kafka meetup
PPTX
Slashn Talk OLTP in Supply Chain - Handling Super-scale and Change Propagatio...
PPTX
Architectual Comparison of Apache Apex and Spark Streaming
PDF
How Uber scaled its Real Time Infrastructure to Trillion events per day
PDF
Hadoop summit - Scaling Uber’s Real-Time Infra for Trillion Events per Day
PPTX
How YugaByte DB Implements Distributed PostgreSQL
PDF
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
PDF
Big data Argentina meetup 2020-09: Intro to presto on docker
PDF
Aurora Serverless, 서버리스 RDB의 서막 - 트랙2, Community Day 2018 re:Invent 특집
PDF
Storing State Forever: Why It Can Be Good For Your Analytics
PDF
Key considerations in productionizing streaming applications
PDF
Migrating batch ETLs to streaming Flink
PDF
Argus Production Monitoring at Salesforce
PDF
Argus Production Monitoring at Salesforce
PPTX
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
PDF
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
PDF
A Functional Approach to Architecture - Kafka & Kafka Streams - Kevin Mas Rui...
PDF
Netflix Open Source Meetup Season 4 Episode 2
PDF
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Moving Towards Better Upgrades in Kafka Streams
Kafka Practices @ Uber - Seattle Apache Kafka meetup
Slashn Talk OLTP in Supply Chain - Handling Super-scale and Change Propagatio...
Architectual Comparison of Apache Apex and Spark Streaming
How Uber scaled its Real Time Infrastructure to Trillion events per day
Hadoop summit - Scaling Uber’s Real-Time Infra for Trillion Events per Day
How YugaByte DB Implements Distributed PostgreSQL
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
Big data Argentina meetup 2020-09: Intro to presto on docker
Aurora Serverless, 서버리스 RDB의 서막 - 트랙2, Community Day 2018 re:Invent 특집
Storing State Forever: Why It Can Be Good For Your Analytics
Key considerations in productionizing streaming applications
Migrating batch ETLs to streaming Flink
Argus Production Monitoring at Salesforce
Argus Production Monitoring at Salesforce
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
A Functional Approach to Architecture - Kafka & Kafka Streams - Kevin Mas Rui...
Netflix Open Source Meetup Season 4 Episode 2
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Ad

Recently uploaded (20)

PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Electronic commerce courselecture one. Pdf
PPTX
A Presentation on Artificial Intelligence
PPTX
Cloud computing and distributed systems.
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPT
Teaching material agriculture food technology
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
cuic standard and advanced reporting.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Diabetes mellitus diagnosis method based random forest with bat algorithm
Electronic commerce courselecture one. Pdf
A Presentation on Artificial Intelligence
Cloud computing and distributed systems.
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Teaching material agriculture food technology
“AI and Expert System Decision Support & Business Intelligence Systems”
cuic standard and advanced reporting.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Building Integrated photovoltaic BIPV_UPV.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
Spectral efficient network and resource selection model in 5G networks
Understanding_Digital_Forensics_Presentation.pptx
20250228 LYD VKU AI Blended-Learning.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication

How we have grown 10x within 2 years

  • 1. REAL-TIME DATA PROCESSING AT RTB HOUSEREAL-TIME DATA PROCESSING AT RTB HOUSE Real-Time Data Processing at RTB House How we have grown 10x within 2 years Bartosz Łoś, 2019
  • 3. AGENDA ● our RTB platform ● the previous iterations: three different architectures
  • 4. AGENDA ● our RTB platform ● the previous iterations: three different architectures ● the fourth iteration: multi-dc architecture
  • 5. AGENDA ● our RTB platform ● the previous iterations: three different architectures ● the fourth iteration: multi-dc architecture ● our use cases: requirements and processing patterns
  • 6. AGENDA ● our RTB platform ● the previous iterations: three different architectures ● the fourth iteration: multi-dc architecture ● our use cases: requirements and processing patterns ● kafka workers
  • 8. OUR RTB PLATFORM: THE CONTEXT
  • 9. OUR RTB PLATFORM: THE CONTEXT
  • 11. THE 1ST ITERATION: MUTABLE IMPRESSIONS
  • 12. THE 2ND ITERATION: LAMBDA ARCHITECTURE
  • 13. THE 3RD ITERATION: IMMUTABLE STREAMS OF EVENTS
  • 15. THE 4TH ITERATION: MAIN CHANGES ● 10x larger scale: ● from 350K to 3.5M bid requests/s within 2 years
  • 16. THE 4TH ITERATION: MAIN CHANGES ● 10x larger scale: ● from 350K to 3.5M bid requests/s within 2 years ● full multi-dc architecture: ● synchronization of user profiles ● merging streams of events
  • 17. THE 4TH ITERATION: MAIN CHANGES ● 10x larger scale: ● from 350K to 3.5M bid requests/s within 2 years ● full multi-dc architecture: ● synchronization of user profiles ● merging streams of events ● fixed partitioning in all DCs: ● parallelism, merging, end-to-end lag
  • 18. THE 4TH ITERATION: MAIN CHANGES ● 10x larger scale: ● from 350K to 3.5M bid requests/s within 2 years ● full multi-dc architecture: ● synchronization of user profiles ● merging streams of events ● fixed partitioning in all DCs: ● parallelism, merging, end-to-end lag ● end-to-end exactly-once processing: ● at-least-once output semantics & deduplication
  • 19. THE 4TH ITERATION: MAIN CHANGES ● 10x larger scale: ● from 350K to 3.5M bid requests/s within 2 years ● full multi-dc architecture: ● synchronization of user profiles ● merging streams of events ● fixed partitioning in all DCs: ● parallelism, merging, end-to-end lag ● end-to-end exactly-once processing: ● at-least-once output semantics & deduplication ● a few better components: ● new stats-counter, new data-flow ● logstash ● merger, dispatcher & loader
  • 20. THE 4TH ITERATION: MULTI-DC ARCHITECTURE
  • 21. THE 4TH ITERATION: MULTI-DC ARCHITECTURE
  • 22. THE 4TH ITERATION: MULTI-DC ARCHITECTURE
  • 23. THE 4TH ITERATION: MULTI-DC ARCHITECTURE
  • 24. THE 4TH ITERATION: MULTI-DC ARCHITECTURE
  • 25. THE 4TH ITERATION: MULTI-DC ARCHITECTURE
  • 26. THE 4TH ITERATION: MULTI-DC ARCHITECTURE
  • 27. THE 4TH ITERATION: MULTI-DC ARCHITECTURE
  • 28. THE 4TH ITERATION: MULTI-DC ARCHITECTURE
  • 29. THE 4TH ITERATION: MULTI-DC ARCHITECTURE
  • 31. STATS-COUNTER: STORM TOPOLOGY (THE 2ND ITERATION)
  • 32. APACHE STORM: TRIDENT + EXACTLY-ONCE STATE
  • 34. MERGER (THE 4TH ITERATION)
  • 36. DATA-FLOW: KAFKA STREAMS (THE 4TH ITERATION)
  • 39. KAFKA STREAMS: EXACTLY-ONCE DELIVERY Kafka Streams: ● processing.guarantee = exactly-once
  • 40. KAFKA STREAMS: EXACTLY-ONCE DELIVERY Kafka Streams: ● processing.guarantee = exactly-once Producer: ● transactions ● enable.idempotence = true
  • 41. KAFKA STREAMS: EXACTLY-ONCE DELIVERY Kafka Streams: ● processing.guarantee = exactly-once Producer: ● transactions ● enable.idempotence = true Consumer: ● isolation.level = read_committed
  • 43. KAFKA WORKERS: MAIN FEATURES ● higher level of distribution
  • 44. KAFKA WORKERS: MAIN FEATURES ● higher level of distribution
  • 45. KAFKA WORKERS: MAIN FEATURES ● higher level of distribution public interface WorkerPartitioner<K, V> { int subpartition(ConsumerRecord<K, V> consumerRecord); }
  • 46. KAFKA WORKERS: MAIN FEATURES ● higher level of distribution ● possibility to pause and resume processing for given partition
  • 47. KAFKA WORKERS: MAIN FEATURES ● higher level of distribution ● possibility to pause and resume processing for given partition
  • 48. KAFKA WORKERS: MAIN FEATURES ● higher level of distribution ● possibility to pause and resume processing for given partition public interface WorkerTask<K, V> { boolean accept(WorkerRecord<K, V> record); void process(WorkerRecord<K, V> record, RecordStatusObserver observer); }
  • 49. KAFKA WORKERS: MAIN FEATURES ● higher level of distribution ● possibility to pause and resume processing for given partition ● asynchronous processing ● tighter control of offsets commits ● backpressure ● processing timeouts
  • 50. KAFKA WORKERS: MAIN FEATURES ● higher level of distribution ● possibility to pause and resume processing for given partition ● asynchronous processing ● tighter control of offsets commits ● backpressure ● processing timeouts public interface RecordStatusObserver { void onSuccess(); void onFailure(Exception exception); }
  • 51. KAFKA WORKERS: MAIN FEATURES ● higher level of distribution ● possibility to pause and resume processing for given partition ● asynchronous processing ● tighter control of offsets commits ● backpressure ● processing timeouts
  • 52. KAFKA WORKERS: MAIN FEATURES ● higher level of distribution ● possibility to pause and resume processing for given partition ● asynchronous processing ● tighter control of offsets commits ● backpressure ● processing timeouts ● at-least-once semantics
  • 53. KAFKA WORKERS: MAIN FEATURES ● higher level of distribution ● possibility to pause and resume processing for given partition ● asynchronous processing ● tighter control of offsets commits ● backpressure ● processing timeouts ● at-least-once semantics ● handling failures
  • 54. KAFKA WORKERS: MAIN FEATURES ● higher level of distribution ● possibility to pause and resume processing for given partition ● asynchronous processing ● tighter control of offsets commits ● backpressure ● processing timeouts ● at-least-once semantics ● handling failures ● kafka-to-kafka, hdfs, bigquery, elasticsearch connectors
  • 55. KAFKA WORKERS: MAIN FEATURES ● higher level of distribution ● possibility to pause and resume processing for given partition ● asynchronous processing ● tighter control of offsets commits ● backpressure ● processing timeouts ● at-least-once semantics ● handling failures ● kafka-to-kafka, hdfs, bigquery, elasticsearch connectors ● github.com/RTBHOUSE/kafka-workers
  • 57. THE 5TH ITERATION: KAFKA WORKERS