SlideShare a Scribd company logo
May 15, 2018
The Four R’s of Metrics Delivery
Jim Hagan
2
A Little Background
• Wayfair currently uses both Graphite and InfluxDB as a time series
platform
• We send a very diverse set of event trackers, timers, and other system
metrics from over 2000 VMs running hundreds of applications.
• Our systems are spread across three major data centers. The data is
used by our developers, business stakeholders, by our internal alerting
engine.
• Most importantly our 24x7 Ops Monitoring Center is using this data to
constantly analyze the vital signs of Wayfair’s IT infrastructure and
storefront operations.
3
A Little Background
Our legacy data pipeline was an elaborate series of services relaying data
from data center to data center over UDP (used to avoid blocking calls from
the client). This configuration works very fast (and even supported
replication) but lacks a number of the elements we are looking for in our
metrics pipeline.
Our next generation pipeline takes advantage of Kafka and the Telegraf
streaming service to create a more robust data topology. Essentially this
allows us to explicitly implement...
4
A Little Background
The Four R’s:
• Resiliency
• Redundancy
• Routability
• Retention
5
The Four R’s
How we want to think of our our data pipeline...
Application(s)
Time Series
Database
Sweet Insights!
6
The Four R’s
How it really is (current generation)...
7
The Four R’s
(previous generation)...
8
The Four R’s
What requirements are driving that architecture…
• Receive data from anywhere in simple text format (dynamic schemas) with a low
overhead network protocol.
• Handle in the range of 100 million data points per second (in the aggregate)
• Blacklist, whitelist and re-direct into more than one stream to achieve scaling and
other business objectives
• Handle spikes of 3x to 5x average peak traffic
• Buffer that raw data for a reasonable amount of time (say 24 hours)
• Ingest data when and where we please
• Make 99% of data available for time series queries within 30 seconds.
9
The Four R’s
Breaking those requirements down conceptually we arrive at the “four R’s”
• Routability (keep, drop, redirect)
• Retention (keep data in pre-digested, or final format for some time)
• Resilience (survive network or DC failure, recover data after a DB failure,
survive massive flood of data)
• Redundancy (replication of raw data for failover and purpose built dbs)
10
The Four R’s
Conceptual architecture to support the four R’s
LOAD
BALANCE
RECEIVE/
HANDLE
BUFFER INGEST PERSIST
Resilience
Routability
Redundancy
Retention Redundancy
Resilience
Routability Retention
Resilience Resilience Resilience
Routability Routability
11
The Four R’s
Conceptual architecture to support the four R’s
LOAD
BALANCE
RECEIVE/
HANDLE
BUFFER INGEST PERSIST
NGINX
(UDP LB)
>> UDP
Telegraf
Receiver
>> UDP
Kafka Local
>> TCP
Telegraf Ingest
>> TCP
InfluxDB
>>TCP
Kafka
Aggregate
>> TCP
12
The Four R’s
NGINX
(UDP LB)
Telegraf
Receiver
(>> UDP,
<< TCP)
Kafka Local
>> TCP
Telegraf Ingest
>> TCP
InfluxDB
>>TCP
Kafka
Aggregate
>> TCP
We use the UDP load balancing plugin for Nginx. This allows us to take very high data rates including 3 to 5
x spikes we referred to earlier and efficiently route them to an array of telegraf hosts. 3 udp load balancers
can feed into dozens of telegraf hosts.
In addition we can use different port designations to route traffic, so for example, let’s say we took all UDP
traffic into port 8094 and sent it to one array of receivers and all that into 8095 into another array of
receivers.
This is giving us RESILIENCY and a means of top level ROUTABILITY.
13
The Four R’s
NGINX
(UDP LB)
Telegraf
Receiver
(>> UDP,
<< TCP)
Kafka Local
>> TCP
Telegraf Ingest
>> TCP
InfluxDB
>>TCP
Kafka
Aggregate
>> TCP
We use several features of telegraf to perform basic traffic shaping…
We use TAG and measurement filters to
1. drop certain metrics
2. keep certain metrics
3. route certain metrics to a specific Kafka topic
4. route to specific Kafka brokers
We are getting both RESILIENCE and ROUTING in this layer.
14
The Four R’s
NGINX
(UDP LB)
Telegraf
Receiver
(>> UDP,
<< TCP)
Kafka Local
>> TCP
Telegraf Ingest
>> TCP
InfluxDB
>>TCP
Kafka
Aggregate
>> TCP
We use several features of telegraf to perform basic traffic shaping…
We use TAG and measurement filters to
1. drop certain metrics
2. keep certain metrics
3. route certain metrics to a specific Kafka topic
4. route to specific Kafka brokers
We are getting both RESILIENCE and ROUTING in this layer.
15
The Four R’s
NGINX
(UDP LB)
Telegraf
Receiver
(>> UDP,
<< TCP)
Kafka Local
>> TCP
Telegraf Ingest
>> TCP
InfluxDB
>>TCP
Kafka
Aggregate
>> TCP
The local Kafka layer gives us an immediate place to store the metrics coming into the system. No
expensive processing needs to be applied to the data yet. We configure several different “mirroring”
services to copy different topics from local Kafka instances to what we call “Aggregate” kafka instances. All
of this communication is happening with minimal transformation.
We are getting both RESILIENCE and ROUTING and RETENTION in this layer.
16
The Four R’s: Kafka Local/Aggregate Mirroring Topology
C1 Kafka
(write-only)
C1 Kafka
(write-only)
C1 Kafka
(write-only)
C3 Kafka
(read-only)
C3 Kafka
(read-only)
C3 Kafka
(read-only)
Seattle Boston Europe The C1 Clusters are used for write-only in the
local data centers. I addition each data center
has an “aggregate” cluster (C3) for consumption
of the integrated data stream. So if I’m in Boston
our Telegraf ingest layer can consume data from
all three data centers.
LOCAL
AGGREGATE
17
The Four R’s
NGINX
(UDP LB)
Telegraf
Receiver
(>> UDP,
<< TCP)
Kafka Local
>> TCP
Telegraf Ingest
>> TCP
InfluxDB
>>TCP
Kafka
Aggregate
>> TCP
We have a second layer of Telegraf acting as a Kafka consumer. This allows us to subscribe to topics that
we care about and route them to the DB of our choice. We can also deploy multiple layers of ingest and
populate multiple databases.
We are getting both RESILIENCE and ROUTING and REDUNDANCY in this layer.
18
The Four R’s
Some architectural recipes (Dynamic Topic Routing)
NGINX
(UDP LB)
Kafka Topic A
(receives metric A)
Telegraf Ingest
Consume
Topic A
InfluxDB for
Topic A
Kafka Topic B
(receives metric B) Telegraf Ingest
Consume
Topic B
InfluxDB for
Topic B
Telegraf Receiver
Keep Metric A
and Metric B
Discard Metric C
19
The Four R’s
Some architectural recipes (Redundancy)
InfluxDB for
Data Science
(Longer
Retention)
InfluxDB for
Alerts (Short
Retention)
Telegraf Ingest
(Topic A)
Telegraf Ingest
(Topic A)
Kafka
20
Monitoring It All (Load Balancer Layer)
21
Monitoring It All (Telegraf Layers)
22
Monitoring It All (Telegraf Layers)
23
Monitoring It All (Kafka)
24
A Little Background
References
https://tech.wayfair.com/2018/04/time-series-data-at-wayfair/
https://docs.influxdata.com/telegraf/v1.6/
https://kafka.apache.org/
Wayfair Use Case: The four R's of Metrics Delivery

More Related Content

PPTX
Development and Applications of Distributed IoT Sensors for Intermittent Conn...
PPTX
How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...
PDF
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
PPTX
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...
PPTX
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
PDF
Let's Compare: A Benchmark review of InfluxDB and Elasticsearch
PDF
Fast Insight from Fast Data: Integrating ClickHouse and Apache Kafka
PPTX
RESTful API – How to Consume, Extract, Store and Visualize Data with InfluxDB...
Development and Applications of Distributed IoT Sensors for Intermittent Conn...
How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Let's Compare: A Benchmark review of InfluxDB and Elasticsearch
Fast Insight from Fast Data: Integrating ClickHouse and Apache Kafka
RESTful API – How to Consume, Extract, Store and Visualize Data with InfluxDB...

What's hot (20)

PDF
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
PDF
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
PDF
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
PDF
Spark Streaming into context
ODP
Cascalog internal dsl_preso
PDF
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
PPTX
The Stream Processor as a Database Apache Flink
PPT
Real-Time Streaming with Apache Spark Streaming and Apache Storm
PDF
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
PDF
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
PDF
Extending Flux to Support Other Databases and Data Stores | Adam Anthony | In...
PDF
Patterns of the Lambda Architecture -- 2015 April -- Hadoop Summit, Europe
PDF
Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018
PDF
Data correlation using PySpark and HDFS
ODP
Meet Up - Spark Stream Processing + Kafka
PDF
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
PDF
Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approx...
PDF
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
PDF
Adventures in Observability: How in-house ClickHouse deployment enabled Inst...
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
Spark Streaming into context
Cascalog internal dsl_preso
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
The Stream Processor as a Database Apache Flink
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
Extending Flux to Support Other Databases and Data Stores | Adam Anthony | In...
Patterns of the Lambda Architecture -- 2015 April -- Hadoop Summit, Europe
Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018
Data correlation using PySpark and HDFS
Meet Up - Spark Stream Processing + Kafka
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approx...
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Adventures in Observability: How in-house ClickHouse deployment enabled Inst...
Ad

Similar to Wayfair Use Case: The four R's of Metrics Delivery (20)

PPTX
Pristine rina-sdk-icc-2016
PDF
SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)
PPTX
Practice of large Hadoop cluster in China Mobile
PDF
Kernel Recipes 2014 - NDIV: a low overhead network traffic diverter
PPTX
QoS Classification on Cisco IOS Router
PDF
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
PPTX
Internet Internet Protocols.pptx( technology)
PPTX
dpdk acceleration techniques ncdşs şdcnş
PPTX
Transport layer
PDF
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
PDF
Maximizing Real-Time Data Processing with Apache Kafka and InfluxDB: A Compre...
PDF
Porting a Streaming Pipeline from Scala to Rust
PPTX
02 coms 525 tcpip - introduction to tcpip
PDF
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
PPTX
Sky x technology
PDF
Module 1 slides
PPTX
Influx data basic
PPTX
Web technologies: recap on TCP-IP
ODP
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
PDF
PLNOG16: Data center interconnect dla opornych, Krzysztof Mazepa
Pristine rina-sdk-icc-2016
SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)
Practice of large Hadoop cluster in China Mobile
Kernel Recipes 2014 - NDIV: a low overhead network traffic diverter
QoS Classification on Cisco IOS Router
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Internet Internet Protocols.pptx( technology)
dpdk acceleration techniques ncdşs şdcnş
Transport layer
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Maximizing Real-Time Data Processing with Apache Kafka and InfluxDB: A Compre...
Porting a Streaming Pipeline from Scala to Rust
02 coms 525 tcpip - introduction to tcpip
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
Sky x technology
Module 1 slides
Influx data basic
Web technologies: recap on TCP-IP
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
PLNOG16: Data center interconnect dla opornych, Krzysztof Mazepa
Ad

More from InfluxData (20)

PPTX
Announcing InfluxDB Clustered
PDF
Best Practices for Leveraging the Apache Arrow Ecosystem
PDF
How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...
PDF
Power Your Predictive Analytics with InfluxDB
PDF
How Teréga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base
PDF
Build an Edge-to-Cloud Solution with the MING Stack
PDF
Meet the Founders: An Open Discussion About Rewriting Using Rust
PDF
Introducing InfluxDB Cloud Dedicated
PDF
Gain Better Observability with OpenTelemetry and InfluxDB
PPTX
How a Heat Treating Plant Ensures Tight Process Control and Exceptional Quali...
PDF
How Delft University's Engineering Students Make Their EV Formula-Style Race ...
PPTX
Introducing InfluxDB’s New Time Series Database Storage Engine
PDF
Start Automating InfluxDB Deployments at the Edge with balena
PDF
Understanding InfluxDB’s New Storage Engine
PDF
Streamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDB
PPTX
Ward Bowman [PTC] | ThingWorx Long-Term Data Storage with InfluxDB | InfluxDa...
PDF
Scott Anderson [InfluxData] | New & Upcoming Flux Features | InfluxDays 2022
PDF
Steinkamp, Clifford [InfluxData] | Closing Thoughts | InfluxDays 2022
PDF
Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...
PDF
Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022
Announcing InfluxDB Clustered
Best Practices for Leveraging the Apache Arrow Ecosystem
How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...
Power Your Predictive Analytics with InfluxDB
How Teréga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base
Build an Edge-to-Cloud Solution with the MING Stack
Meet the Founders: An Open Discussion About Rewriting Using Rust
Introducing InfluxDB Cloud Dedicated
Gain Better Observability with OpenTelemetry and InfluxDB
How a Heat Treating Plant Ensures Tight Process Control and Exceptional Quali...
How Delft University's Engineering Students Make Their EV Formula-Style Race ...
Introducing InfluxDB’s New Time Series Database Storage Engine
Start Automating InfluxDB Deployments at the Edge with balena
Understanding InfluxDB’s New Storage Engine
Streamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDB
Ward Bowman [PTC] | ThingWorx Long-Term Data Storage with InfluxDB | InfluxDa...
Scott Anderson [InfluxData] | New & Upcoming Flux Features | InfluxDays 2022
Steinkamp, Clifford [InfluxData] | Closing Thoughts | InfluxDays 2022
Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...
Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022

Recently uploaded (20)

PDF
The Internet -By the Numbers, Sri Lanka Edition
PDF
💰 𝐔𝐊𝐓𝐈 𝐊𝐄𝐌𝐄𝐍𝐀𝐍𝐆𝐀𝐍 𝐊𝐈𝐏𝐄𝐑𝟒𝐃 𝐇𝐀𝐑𝐈 𝐈𝐍𝐈 𝟐𝟎𝟐𝟓 💰
PPTX
presentation_pfe-universite-molay-seltan.pptx
PDF
RPKI Status Update, presented by Makito Lay at IDNOG 10
PDF
Introduction to the IoT system, how the IoT system works
PDF
An introduction to the IFRS (ISSB) Stndards.pdf
PPTX
INTERNET------BASICS-------UPDATED PPT PRESENTATION
PPTX
June-4-Sermon-Powerpoint.pptx USE THIS FOR YOUR MOTIVATION
PDF
FINAL CALL-6th International Conference on Networks & IOT (NeTIOT 2025)
PDF
Best Practices for Testing and Debugging Shopify Third-Party API Integrations...
PDF
SASE Traffic Flow - ZTNA Connector-1.pdf
PDF
Automated vs Manual WooCommerce to Shopify Migration_ Pros & Cons.pdf
PPT
tcp ip networks nd ip layering assotred slides
PPTX
artificial intelligence overview of it and more
PPTX
PptxGenJS_Demo_Chart_20250317130215833.pptx
PPTX
CHE NAA, , b,mn,mblblblbljb jb jlb ,j , ,C PPT.pptx
PPT
isotopes_sddsadsaadasdasdasdasdsa1213.ppt
PPTX
Digital Literacy And Online Safety on internet
PPT
Design_with_Watersergyerge45hrbgre4top (1).ppt
PDF
How to Ensure Data Integrity During Shopify Migration_ Best Practices for Sec...
The Internet -By the Numbers, Sri Lanka Edition
💰 𝐔𝐊𝐓𝐈 𝐊𝐄𝐌𝐄𝐍𝐀𝐍𝐆𝐀𝐍 𝐊𝐈𝐏𝐄𝐑𝟒𝐃 𝐇𝐀𝐑𝐈 𝐈𝐍𝐈 𝟐𝟎𝟐𝟓 💰
presentation_pfe-universite-molay-seltan.pptx
RPKI Status Update, presented by Makito Lay at IDNOG 10
Introduction to the IoT system, how the IoT system works
An introduction to the IFRS (ISSB) Stndards.pdf
INTERNET------BASICS-------UPDATED PPT PRESENTATION
June-4-Sermon-Powerpoint.pptx USE THIS FOR YOUR MOTIVATION
FINAL CALL-6th International Conference on Networks & IOT (NeTIOT 2025)
Best Practices for Testing and Debugging Shopify Third-Party API Integrations...
SASE Traffic Flow - ZTNA Connector-1.pdf
Automated vs Manual WooCommerce to Shopify Migration_ Pros & Cons.pdf
tcp ip networks nd ip layering assotred slides
artificial intelligence overview of it and more
PptxGenJS_Demo_Chart_20250317130215833.pptx
CHE NAA, , b,mn,mblblblbljb jb jlb ,j , ,C PPT.pptx
isotopes_sddsadsaadasdasdasdasdsa1213.ppt
Digital Literacy And Online Safety on internet
Design_with_Watersergyerge45hrbgre4top (1).ppt
How to Ensure Data Integrity During Shopify Migration_ Best Practices for Sec...

Wayfair Use Case: The four R's of Metrics Delivery

  • 1. May 15, 2018 The Four R’s of Metrics Delivery Jim Hagan
  • 2. 2 A Little Background • Wayfair currently uses both Graphite and InfluxDB as a time series platform • We send a very diverse set of event trackers, timers, and other system metrics from over 2000 VMs running hundreds of applications. • Our systems are spread across three major data centers. The data is used by our developers, business stakeholders, by our internal alerting engine. • Most importantly our 24x7 Ops Monitoring Center is using this data to constantly analyze the vital signs of Wayfair’s IT infrastructure and storefront operations.
  • 3. 3 A Little Background Our legacy data pipeline was an elaborate series of services relaying data from data center to data center over UDP (used to avoid blocking calls from the client). This configuration works very fast (and even supported replication) but lacks a number of the elements we are looking for in our metrics pipeline. Our next generation pipeline takes advantage of Kafka and the Telegraf streaming service to create a more robust data topology. Essentially this allows us to explicitly implement...
  • 4. 4 A Little Background The Four R’s: • Resiliency • Redundancy • Routability • Retention
  • 5. 5 The Four R’s How we want to think of our our data pipeline... Application(s) Time Series Database Sweet Insights!
  • 6. 6 The Four R’s How it really is (current generation)...
  • 7. 7 The Four R’s (previous generation)...
  • 8. 8 The Four R’s What requirements are driving that architecture… • Receive data from anywhere in simple text format (dynamic schemas) with a low overhead network protocol. • Handle in the range of 100 million data points per second (in the aggregate) • Blacklist, whitelist and re-direct into more than one stream to achieve scaling and other business objectives • Handle spikes of 3x to 5x average peak traffic • Buffer that raw data for a reasonable amount of time (say 24 hours) • Ingest data when and where we please • Make 99% of data available for time series queries within 30 seconds.
  • 9. 9 The Four R’s Breaking those requirements down conceptually we arrive at the “four R’s” • Routability (keep, drop, redirect) • Retention (keep data in pre-digested, or final format for some time) • Resilience (survive network or DC failure, recover data after a DB failure, survive massive flood of data) • Redundancy (replication of raw data for failover and purpose built dbs)
  • 10. 10 The Four R’s Conceptual architecture to support the four R’s LOAD BALANCE RECEIVE/ HANDLE BUFFER INGEST PERSIST Resilience Routability Redundancy Retention Redundancy Resilience Routability Retention Resilience Resilience Resilience Routability Routability
  • 11. 11 The Four R’s Conceptual architecture to support the four R’s LOAD BALANCE RECEIVE/ HANDLE BUFFER INGEST PERSIST NGINX (UDP LB) >> UDP Telegraf Receiver >> UDP Kafka Local >> TCP Telegraf Ingest >> TCP InfluxDB >>TCP Kafka Aggregate >> TCP
  • 12. 12 The Four R’s NGINX (UDP LB) Telegraf Receiver (>> UDP, << TCP) Kafka Local >> TCP Telegraf Ingest >> TCP InfluxDB >>TCP Kafka Aggregate >> TCP We use the UDP load balancing plugin for Nginx. This allows us to take very high data rates including 3 to 5 x spikes we referred to earlier and efficiently route them to an array of telegraf hosts. 3 udp load balancers can feed into dozens of telegraf hosts. In addition we can use different port designations to route traffic, so for example, let’s say we took all UDP traffic into port 8094 and sent it to one array of receivers and all that into 8095 into another array of receivers. This is giving us RESILIENCY and a means of top level ROUTABILITY.
  • 13. 13 The Four R’s NGINX (UDP LB) Telegraf Receiver (>> UDP, << TCP) Kafka Local >> TCP Telegraf Ingest >> TCP InfluxDB >>TCP Kafka Aggregate >> TCP We use several features of telegraf to perform basic traffic shaping… We use TAG and measurement filters to 1. drop certain metrics 2. keep certain metrics 3. route certain metrics to a specific Kafka topic 4. route to specific Kafka brokers We are getting both RESILIENCE and ROUTING in this layer.
  • 14. 14 The Four R’s NGINX (UDP LB) Telegraf Receiver (>> UDP, << TCP) Kafka Local >> TCP Telegraf Ingest >> TCP InfluxDB >>TCP Kafka Aggregate >> TCP We use several features of telegraf to perform basic traffic shaping… We use TAG and measurement filters to 1. drop certain metrics 2. keep certain metrics 3. route certain metrics to a specific Kafka topic 4. route to specific Kafka brokers We are getting both RESILIENCE and ROUTING in this layer.
  • 15. 15 The Four R’s NGINX (UDP LB) Telegraf Receiver (>> UDP, << TCP) Kafka Local >> TCP Telegraf Ingest >> TCP InfluxDB >>TCP Kafka Aggregate >> TCP The local Kafka layer gives us an immediate place to store the metrics coming into the system. No expensive processing needs to be applied to the data yet. We configure several different “mirroring” services to copy different topics from local Kafka instances to what we call “Aggregate” kafka instances. All of this communication is happening with minimal transformation. We are getting both RESILIENCE and ROUTING and RETENTION in this layer.
  • 16. 16 The Four R’s: Kafka Local/Aggregate Mirroring Topology C1 Kafka (write-only) C1 Kafka (write-only) C1 Kafka (write-only) C3 Kafka (read-only) C3 Kafka (read-only) C3 Kafka (read-only) Seattle Boston Europe The C1 Clusters are used for write-only in the local data centers. I addition each data center has an “aggregate” cluster (C3) for consumption of the integrated data stream. So if I’m in Boston our Telegraf ingest layer can consume data from all three data centers. LOCAL AGGREGATE
  • 17. 17 The Four R’s NGINX (UDP LB) Telegraf Receiver (>> UDP, << TCP) Kafka Local >> TCP Telegraf Ingest >> TCP InfluxDB >>TCP Kafka Aggregate >> TCP We have a second layer of Telegraf acting as a Kafka consumer. This allows us to subscribe to topics that we care about and route them to the DB of our choice. We can also deploy multiple layers of ingest and populate multiple databases. We are getting both RESILIENCE and ROUTING and REDUNDANCY in this layer.
  • 18. 18 The Four R’s Some architectural recipes (Dynamic Topic Routing) NGINX (UDP LB) Kafka Topic A (receives metric A) Telegraf Ingest Consume Topic A InfluxDB for Topic A Kafka Topic B (receives metric B) Telegraf Ingest Consume Topic B InfluxDB for Topic B Telegraf Receiver Keep Metric A and Metric B Discard Metric C
  • 19. 19 The Four R’s Some architectural recipes (Redundancy) InfluxDB for Data Science (Longer Retention) InfluxDB for Alerts (Short Retention) Telegraf Ingest (Topic A) Telegraf Ingest (Topic A) Kafka
  • 20. 20 Monitoring It All (Load Balancer Layer)
  • 21. 21 Monitoring It All (Telegraf Layers)
  • 22. 22 Monitoring It All (Telegraf Layers)