Routing Trillions of Events Per Day @Twitter
1.
2.
3.
4.
5.
6.
Routing Trillions of Events Per Day @Twitter
●
●
●
○
○
Clients
Aggregated by Category
Storage HDFS
Http
Clients
Clients
Client Daemon
Client Daemon
Client Daemon
Http Endpoint
Across millions of
clients
Incoming
uncompressed
Collocated with
HDFS datanodes
Event groups by
category
Clients
Local log collection daemon
Clients
Aggregate log events grouped
by Category
Storage (HDFS)
HTTP
Remote
Clients
Log
Processor
Storage (HDFS)
Storage (HDFS)
Log
ReplicatorStorage (HDFS)
Inside
DataCenter
Storage
(Streaming)
Clients
Local log collection daemon
Clients
Aggregate log events grouped
by Category
Storage (HDFS)
HTTP
Remote
Clients
Log
Processor
Storage (HDFS)
Storage (HDFS)
Log
ReplicatorStorage (HDFS)
Inside
DataCenter
Storage
(Streaming)
Clients
Local log collection daemon
Clients
Aggregate log events grouped
by Category
Storage (HDFS)
HTTP
Remote
Clients
Log
Processor
Storage (HDFS)
Storage (HDFS)
Log
ReplicatorStorage (HDFS)
Inside
DataCenter
Storage
(Streaming)
Clients
Local log collection daemon
Clients
Aggregate log events grouped
by Category
Storage (HDFS)
HTTP
Remote
Clients
Log
Processor
Storage (HDFS)
Storage (HDFS)
Log
ReplicatorStorage (HDFS)
Inside
DataCenter
Storage
(Streaming)
Clients
Local log collection daemon
Clients
Aggregate log events grouped
by Category
Storage (HDFS)
HTTP
Remote
Clients
Log
Processor
Storage (HDFS)
Storage (HDFS)
Log
ReplicatorStorage (HDFS)
Inside
DataCenter
Storage
(Streaming)
Events Events
RT Storage (HDFS)
Inside
DC1
Events Events
RT Storage (HDFS)
Inside
DCN
DW Storage (HDFS)
Prod Storage (HDFS)
DW Storage (HDFS)
Cold Storage (HDFS)
Prod Storage (HDFS)
Events Events
RT Storage (HDFS)
Inside
DC1
Events Events
RT Storage (HDFS)
Inside
DCN
DW Storage (HDFS)
Prod Storage (HDFS)
DW Storage (HDFS)
Cold Storage (HDFS)
Prod Storage (HDFS)
Routing Trillions of Events Per Day @Twitter
Clients
Local log collection daemon
Clients
Aggregate log events grouped
by Category
Storage (HDFS)
HTTP
Remote
Clients
Log
Processor
Storage (HDFS)
Storage (HDFS)
Log
ReplicatorStorage (HDFS)
Inside
DataCenter
Storage
(Streaming)
Scribe
Client
Daemon
Scribe
Aggregator
Daemons
Scribe
Client
Daemon
Flume
Aggregator
Daemon
Flume
Aggregator
Daemon
Flume
Client
Daemon
●
○
●
●
●
●
●
●
●
●
Source Sink
Channel
Client HDFS
Flume Agent
●
●
●
Agent 1 Agent 2 Agent 3
Category 1 Category 3Category 2
Category Group
Group 1
Category Groups
Aggregator Group 1 Aggregator Group 2●
●
Agent 1 Agent 2 Agent 3 Agent 8
Group 2
●
●
●
●
●
●
●
Routing Trillions of Events Per Day @Twitter
Clients
Local log collection daemon
Clients
Aggregate log events grouped
by Category
Storage (HDFS)
HTTP
Remote
Clients
Log
Processor
Storage (HDFS)
Storage (HDFS)
Log
ReplicatorStorage (HDFS)
Inside
DataCenter
Storage
(Streaming)
To process one
day of data Output of cleaned,
compressed,
consolidated, and
converted
Saved by
processing Flume
sequence files
●
●
●
●
Datacenter 1
ads_group/yyyy/mm/dd/hh
ads_click/yyyy/mm/dd/hh
ads_view/yyyy/mm/dd/hh
login_event/yyyy/mm/dd/hhlogin_group/yyyy/mm/dd/hh
Category Groups CategoriesDemux Jobs
ads_group_demuxer
login_group_demuxer
Datacenter 1
ads_group/yyyy/mm/dd/hh
ads_click/yyyy/mm/dd/hh
ads_view/yyyy/mm/dd/hh
login_event/yyyy/mm/dd/hhlogin_group/yyyy/mm/dd/hh
Category Groups CategoriesDemux Jobs
ads_group_demuxer
login_group_demuxer
Datacenter 1
ads_group/yyyy/mm/dd/hh
ads_click/yyyy/mm/dd/hh
ads_view/yyyy/mm/dd/hh
login_event/yyyy/mm/dd/hhlogin_group/yyyy/mm/dd/hh
Category Groups CategoriesDemux Jobs
ads_group_demuxer
login_group_demuxer
Decode
Demux
Clean
[Convert]
● Scribe’s contract amounts to sending a binary blob to a port
● Scribe used new line characters to delimit records in a binary blob batch
of records
● Valid records may include newline characters
● Scribe base64 encoded received binary blobs to avoid confusion with
record delimiter
● Base 64 encoding is no longer necessary because we have moved to one
serialized Thrift object per binary blob
/logs/ads_click/yyyy/mm/dd/hh/1.lzo
/logs/ads_view/yyyy/mm/dd/hh/1.lzo
/raw/ads_group/yyyy/mm/dd/hh/ads_group_1.seq
DEMUX
/logs/ads_view/yyyy/mm/dd/hh/1.lzo
/logs/ads_click/yyyy/mm/dd/hh/1.lzo
/logs/ads_view/yyyy/mm/dd/hh/1.lzo
/raw/ads_group/yyyy/mm/dd/hh/ads_group_1.seq
DEMUX
/logs/ads_view/yyyy/mm/dd/hh/1.lzo
/logs/ads_click/yyyy/mm/dd/hh/1.lzo
/logs/ads_view/yyyy/mm/dd/hh/2.lzo
/raw/ads_group/yyyy/mm/dd/hh/ads_group_1.seq
DEMUX
/logs/ads_view/yyyy/mm/dd/hh/1.lzo
●
●
●
●
●
● Some categories are significantly larger than other categories (KBs v TBs)
● MapReduce demux? Each reducer handles a single category
● Streaming demux? Each spout or channel handles a single category
● Massive skew in partitioning by category causes long running tasks which
slows down job completion time
● Relatively well understood fault tolerance semantics similar to
MapReduce, Spark, etc
● Tez’s dynamic hash partitioner adjusts partitions at runtime if necessary,
allowing large partitions to be further partitioned so multiple tasks process
events for a single category one task
○ More info at TEZ-3209.
○ Thanks to team member Ming Ma for the contribution!
● Easier horizontal scaling simultaneously providing more predictable
processing times
Task 3
Task 2
Input File 1
Task 1
Task 3
Task 2
Input File 1
Task 1
Task 4
Task 5
Routing Trillions of Events Per Day @Twitter
Clients
Local log collection daemon
Clients
Aggregate log events grouped
by Category
Storage (HDFS)
HTTP
Remote
Clients
Log
Processor
Storage (HDFS)
Storage (HDFS)
Log
ReplicatorStorage (HDFS)
Inside
DataCenter
Storage
(Streaming)
Across all analytics clusters Replicated to analytics clusters
●
○
○
●
●
●
Datacenter N
Datacenter 1
ads_click/yyyy/mm/dd/hh
ads_view/yyyy/mm/dd/hh
login_event/yyyy/mm/dd/hh
ads_click/yyyy/mm/dd/hh
ads_view/yyyy/mm/dd/hh
login_event/yyyy/mm/dd/hh
ads_click/yyyy/mm/dd/hh
ads_view/yyyy/mm/dd/hh
login_event/yyyy/mm/dd/hh
ads_click/yyyy/mm/dd/hh
Replication Jobs
ads_click_repl
ads_view_repl
ads_click_repl
login_event_repl
Datacenter N
Datacenter 1
ads_click/yyyy/mm/dd/hh
ads_view/yyyy/mm/dd/hh
login_event/yyyy/mm/dd/hh
ads_click/yyyy/mm/dd/hh
ads_view/yyyy/mm/dd/hh
login_event/yyyy/mm/dd/hh
ads_click/yyyy/mm/dd/hh
ads_view/yyyy/mm/dd/hh
login_event/yyyy/mm/dd/hh
ads_click/yyyy/mm/dd/hh
Replication Jobs
ads_click_repl
ads_view_repl
ads_click_repl
login_event_repl
Datacenter N
Datacenter 1
ads_click/yyyy/mm/dd/hh
ads_view/yyyy/mm/dd/hh
login_event/yyyy/mm/dd/hh
ads_click/yyyy/mm/dd/hh
ads_view/yyyy/mm/dd/hh
login_event/yyyy/mm/dd/hh
ads_click/yyyy/mm/dd/hh
ads_view/yyyy/mm/dd/hh
login_event/yyyy/mm/dd/hh
ads_click/yyyy/mm/dd/hh
Replication Jobs
ads_click_repl
ads_view_repl
ads_click_repl
login_event_repl
Datacenter N
Datacenter 1
ads_click/yyyy/mm/dd/hh
ads_view/yyyy/mm/dd/hh
login_event/yyyy/mm/dd/hh
ads_click/yyyy/mm/dd/hh
ads_view/yyyy/mm/dd/hh
login_event/yyyy/mm/dd/hh
ads_click/yyyy/mm/dd/hh
ads_view/yyyy/mm/dd/hh
login_event/yyyy/mm/dd/hh
ads_click/yyyy/mm/dd/hh
Replication Jobs
ads_click_repl
ads_view_repl
ads_click_repl
login_event_repl
Datacenter N
Datacenter 1
ads_click/yyyy/mm/dd/hh
ads_view/yyyy/mm/dd/hh
login_event/yyyy/mm/dd/hh
ads_click/yyyy/mm/dd/hh
ads_view/yyyy/mm/dd/hh
login_event/yyyy/mm/dd/hh
ads_click/yyyy/mm/dd/hh
ads_view/yyyy/mm/dd/hh
login_event/yyyy/mm/dd/hh
ads_click/yyyy/mm/dd/hh
Replication Jobs
ads_click_repl
ads_view_repl
ads_click_repl
login_event_repl
Datacenter N
Datacenter 1
ads_click/yyyy/mm/dd/hh
ads_view/yyyy/mm/dd/hh
login_event/yyyy/mm/dd/hh
ads_click/yyyy/mm/dd/hh
ads_view/yyyy/mm/dd/hh
login_event/yyyy/mm/dd/hh
ads_click/yyyy/mm/dd/hh
ads_view/yyyy/mm/dd/hh
login_event/yyyy/mm/dd/hh
ads_click/yyyy/mm/dd/hh
Replication Jobs
ads_click_repl
ads_view_repl
ads_click_repl
login_event_repl
Datacenter N
Datacenter 1
ads_click/yyyy/mm/dd/hh
ads_view/yyyy/mm/dd/hh
login_event/yyyy/mm/dd/hh
ads_click/yyyy/mm/dd/hh
ads_view/yyyy/mm/dd/hh
login_event/yyyy/mm/dd/hh
ads_click/yyyy/mm/dd/hh
ads_view/yyyy/mm/dd/hh
login_event/yyyy/mm/dd/hh
ads_click/yyyy/mm/dd/hh
Replication Jobs
ads_click_repl
ads_view_repl
ads_click_repl
login_event_repl
Datacenter N
Datacenter 1
ads_click/yyyy/mm/dd/hh
ads_view/yyyy/mm/dd/hh
login_event/yyyy/mm/dd/hh
ads_click/yyyy/mm/dd/hh
ads_view/yyyy/mm/dd/hh
login_event/yyyy/mm/dd/hh
ads_click/yyyy/mm/dd/hh
ads_view/yyyy/mm/dd/hh
login_event/yyyy/mm/dd/hh
ads_click/yyyy/mm/dd/hh
Replication Jobs
ads_click_repl
ads_view_repl
ads_click_repl
login_event_repl
Datacenter N
Datacenter 1
ads_click/yyyy/mm/dd/hh
ads_view/yyyy/mm/dd/hh
login_event/yyyy/mm/dd/hh
ads_click/yyyy/mm/dd/hh
ads_view/yyyy/mm/dd/hh
login_event/yyyy/mm/dd/hh
ads_click/yyyy/mm/dd/hh
ads_view/yyyy/mm/dd/hh
login_event/yyyy/mm/dd/hh
ads_click/yyyy/mm/dd/hh
Replication Jobs
ads_click_repl
ads_view_repl
ads_click_repl
login_event_repl
Datacenter N
Datacenter 1
ads_click/yyyy/mm/dd/hh
ads_view/yyyy/mm/dd/hh
login_event/yyyy/mm/dd/hh
ads_click/yyyy/mm/dd/hh
ads_view/yyyy/mm/dd/hh
login_event/yyyy/mm/dd/hh
ads_click/yyyy/mm/dd/hh
ads_view/yyyy/mm/dd/hh
login_event/yyyy/mm/dd/hh
ads_click/yyyy/mm/dd/hh
Replication Jobs
ads_click_repl
ads_view_repl
ads_click_repl
login_event_repl
Datacenter N
Datacenter 1
ads_click/yyyy/mm/dd/hh
ads_view/yyyy/mm/dd/hh
login_event/yyyy/mm/dd/hh
ads_click/yyyy/mm/dd/hh
ads_view/yyyy/mm/dd/hh
login_event/yyyy/mm/dd/hh
ads_click/yyyy/mm/dd/hh
ads_view/yyyy/mm/dd/hh
login_event/yyyy/mm/dd/hh
ads_click/yyyy/mm/dd/hh
Replication Jobs
ads_click_repl
ads_view_repl
ads_click_repl
login_event_repl
Copy
Merge
Present
Publish
●
●
●
●
●
Routing Trillions of Events Per Day @Twitter
●
●
●
●
Routing Trillions of Events Per Day @Twitter

More Related Content

PDF
Module: Content Routing in IPFS
PDF
Module: InterPlanetary Linked Data (IPLD)
PDF
Module: Content Exchange in IPFS
PDF
Module: Content Addressing in IPFS
PDF
Module: Mutable Content in IPFS
PDF
Universal DDoS Mitigation Bypass
PDF
MongoDB FabLab León
PDF
MongoDB Schema Design Tips & Tricks
Module: Content Routing in IPFS
Module: InterPlanetary Linked Data (IPLD)
Module: Content Exchange in IPFS
Module: Content Addressing in IPFS
Module: Mutable Content in IPFS
Universal DDoS Mitigation Bypass
MongoDB FabLab León
MongoDB Schema Design Tips & Tricks

What's hot (10)

PDF
How to scale MongoDB
PDF
Indexing Decentralized Data with Ethereum, IPFS & The Graph
PDF
Enhancing the default MongoDB Security
PDF
Securing dns records from subdomain takeover
PDF
Python and MongoDB
PPTX
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
PDF
Distributed Timeseries Database In Go (gophercon India 17)
PDF
IPFS introduction
ODP
MongoDB - Ekino PHP
PPT
Directories
How to scale MongoDB
Indexing Decentralized Data with Ethereum, IPFS & The Graph
Enhancing the default MongoDB Security
Securing dns records from subdomain takeover
Python and MongoDB
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
Distributed Timeseries Database In Go (gophercon India 17)
IPFS introduction
MongoDB - Ekino PHP
Directories
Ad

Similar to Routing Trillions of Events Per Day @Twitter (20)

PDF
Routing trillion events per day @twitter
PDF
Large Scale EventLog Management @Twitter
PPT
My other computer is a datacentre - 2012 edition
PDF
Scaling event aggregation at twitter
PDF
RuG Guest Lecture
PDF
Infrastructure Around Hadoop
PDF
JDD2014: Real Big Data - Scott MacGregor
PPT
Borthakur hadoop univ-research
PPTX
Storage and-compute-hdfs-map reduce
PDF
Mesos at OpenTable
PPTX
Big Data Analytics -Introduction education
PDF
20080611accel
PDF
Bigdata Technologies that includes various components .pdf
PPTX
Hic 2011 realtime_analytics_at_facebook
PDF
Hadoop and Hive Development at Facebook
 
PDF
Hadoop and Hive Development at Facebook
PPTX
from source to solution - building a system for event-oriented data
PDF
Managing Big Data (Chapter 2, SC 11 Tutorial)
PDF
Cloud arch patterns
PDF
HDFCloud Workshop: HDF5 in the Cloud
Routing trillion events per day @twitter
Large Scale EventLog Management @Twitter
My other computer is a datacentre - 2012 edition
Scaling event aggregation at twitter
RuG Guest Lecture
Infrastructure Around Hadoop
JDD2014: Real Big Data - Scott MacGregor
Borthakur hadoop univ-research
Storage and-compute-hdfs-map reduce
Mesos at OpenTable
Big Data Analytics -Introduction education
20080611accel
Bigdata Technologies that includes various components .pdf
Hic 2011 realtime_analytics_at_facebook
Hadoop and Hive Development at Facebook
 
Hadoop and Hive Development at Facebook
from source to solution - building a system for event-oriented data
Managing Big Data (Chapter 2, SC 11 Tutorial)
Cloud arch patterns
HDFCloud Workshop: HDF5 in the Cloud
Ad

Recently uploaded (20)

PPTX
Statisticsccdxghbbnhhbvvvvvvvvvv. Dxcvvvhhbdzvbsdvvbbvv ccc
PPTX
ai agent creaction with langgraph_presentation_
PPT
statistic analysis for study - data collection
PPTX
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
PDF
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
PPTX
Tapan_20220802057_Researchinternship_final_stage.pptx
PPT
PROJECT CYCLE MANAGEMENT FRAMEWORK (PCM).ppt
PDF
Navigating the Thai Supplements Landscape.pdf
PPT
statistics analysis - topic 3 - describing data visually
PPTX
CHAPTER-2-THE-ACCOUNTING-PROCESS-2-4.pptx
PDF
CS3352FOUNDATION OF DATA SCIENCE _1_MAterial.pdf
PDF
©️ 01_Algorithm for Microsoft New Product Launch - handling web site - by Ale...
PPTX
SET 1 Compulsory MNH machine learning intro
PPTX
eGramSWARAJ-PPT Training Module for beginners
PPTX
Business_Capability_Map_Collection__pptx
PPTX
New ISO 27001_2022 standard and the changes
PPTX
1 hour to get there before the game is done so you don’t need a car seat for ...
PPTX
Caseware_IDEA_Detailed_Presentation.pptx
PDF
©️ 02_SKU Automatic SW Robotics for Microsoft PC.pdf
PDF
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
Statisticsccdxghbbnhhbvvvvvvvvvv. Dxcvvvhhbdzvbsdvvbbvv ccc
ai agent creaction with langgraph_presentation_
statistic analysis for study - data collection
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
Tapan_20220802057_Researchinternship_final_stage.pptx
PROJECT CYCLE MANAGEMENT FRAMEWORK (PCM).ppt
Navigating the Thai Supplements Landscape.pdf
statistics analysis - topic 3 - describing data visually
CHAPTER-2-THE-ACCOUNTING-PROCESS-2-4.pptx
CS3352FOUNDATION OF DATA SCIENCE _1_MAterial.pdf
©️ 01_Algorithm for Microsoft New Product Launch - handling web site - by Ale...
SET 1 Compulsory MNH machine learning intro
eGramSWARAJ-PPT Training Module for beginners
Business_Capability_Map_Collection__pptx
New ISO 27001_2022 standard and the changes
1 hour to get there before the game is done so you don’t need a car seat for ...
Caseware_IDEA_Detailed_Presentation.pptx
©️ 02_SKU Automatic SW Robotics for Microsoft PC.pdf
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...

Routing Trillions of Events Per Day @Twitter