SlideShare a Scribd company logo
Monitoring at Scale
Migrating to Prometheus at Fastly
PROMCON 2018 | Marcus Barczak
@ickymettle
Monitoring at scale: Migrating to Prometheus at Fastly
Monitoring at scale: Migrating to Prometheus at Fastly
Monitoring at scale: Migrating to Prometheus at Fastly
How were we
monitoring Fastly?
+
เน Operational overhead.
เน Limited graphing functions.
เน No alerting support,
เน No real API for consuming metric data.
Growing pains with Ganglia
aaS
+
+
เน Now supporting two systems.
เน Where do I put my metrics?
เน Still writing external plugins and agents.
เน Monitoring treated as a "post-release" phase.
Growing pains doubled
Scaling our infrastructure
horizontally
Required scaling our monitoring
vertically
Third time lucky
เน Scale with our infrastructure growth,
เน Be easy to deploy and operate.
เน Engineer friendly instrumentation libraries.
เน First class API support for data access.
เน To reinvigorate our monitoring culture.โ€จ
See: https://peter.bourgon.org/observability-the-hard-parts/
Project goals
?
Monitoring at scale: Migrating to Prometheus at Fastly
เน Build a proof of concept.
เน Pair with pilot team to instrument their services.
เน Iterate through the rest.
เน Run both systems in parallel.
เน Decommission SaaS system and Ganglia.
Getting started
Infrastructure
build
prometheus A prometheus B
scrapes
targets
SJC
scrapes
targets
prometheus A prometheus B
scrapes
targets
SJC
scrapes
targets
prometheus A prometheus B
scrapes
targets
JFK
scrapes
targets
prometheus A prometheus B
scrapes
targets
ATL
scrapes
targets
prometheus A prometheus B
scrapes
targets
SJC
scrapes
targets
prometheus A prometheus B
scrapes
targets
JFK
scrapes
targets
prometheus A prometheus B
scrapes
targets
ATL
scrapes
targets
GCP
federator A federator B
frontend stack
prometheus A prometheus B
scrapes
targets
SJC
scrapes
targets
prometheus A prometheus B
scrapes
targets
JFK
scrapes
targets
prometheus A prometheus B
scrapes
targets
ATL
scrapes
targets
GCP
federator A federator B
frontend stack
Query Tra๏ฌƒc (TLS)
Prometheus Server
Software Stack
Ghost Tunnel
TLS termination and auth.
Service Discovery Sidecar
Target con๏ฌguration
Rules Loader
Recording and Alert rules
Prometheus
Prometheus Server
Software Stack
Ghost Tunnel
TLS termination and auth.
Service Discovery Sidecar
Target con๏ฌguration
Rules Loader
Recording and Alert rules
Prometheus
Typical Server
Software Stack
Service Discovery Proxy
Service discovery and
TLS exporter proxy
Exporters
Built into services or sidecar
Build your own
service discovery?
Fastly's infrastructure
is bare metal hardware
no cloud conveniences
เน Automatic discovery of targets.
เน Self-service registration of exporter endpoints,
เน TLS encryption for all exporter tra๏ฌƒc.
เน Minimal exposure of exporter TCP ports.
Service discovery requirements
Prometheus Server
Software Stack
Ghost Tunnel
TLS termination and auth.
PromSD Sidecar
Target con๏ฌguration
Prometheus
Typical Server
Software Stack
PromSD Proxy
Service discovery and
TLS exporter proxy
Exporters
Built into services or sidecar
generates con๏ฌg
for prometheus
scrapes proxied targets over TLS
queries for
available targets
promsd sidecar
"exporter_hosts": [
"10.0.0.1",
"10.0.0.2",
"10.0.0.3",
"10.0.0.4"
]
con๏ฌgly
fetch list of hosts
in a datacenter
1
promsd proxy
request /targets endpoint
for each host to get list
of available scrape targets
32
3
output all targets as a
๏ฌle service discovery
JSON ๏ฌle
4
Prometheus reads
the ๏ฌle and scrapes
the con๏ฌgured
targets.
{
"targets": [
โ€œ10.0.0.1:9702โ€,
โ€œ10.0.0.2:9702โ€
],
"labels": {
"__metrics_path__": โ€œ/node_exporter_9100/metrics",
"job": โ€œnode_exporterโ€
}
},
{
"targets": [
โ€œ10.0.0.1:9702โ€,
โ€œ10.0.0.2:9702โ€
],
"labels": {
"__metrics_path__": "/varnishstat_exporter_19102/metrics",
"job": "varnishstat_exporter"
}
}
PromSD sidecar
promsd proxy
fetch list of installed
systemd services
node_exporter
process_exporter
systemd
"node_exporter": {
"prometheus_properties": {
"target": "127.0.0.1:9100"
}
},
โ€ฆ
"varnishstat_exporter": {
"prometheus_properties": {
"target": "127.0.0.1:19102"
}
}
for each corresponding
systemd service fetch the
local exporter target address
varnishstat_exporter
1
32
3
con๏ฌgly
exposes an API
used by prometheus
and promsd sidecar
/node_exporter_9100/metrics
/varnish_exporter_19102/metrics
/targetssidecar
PromSD proxy
เน Really easy to leverage the ๏ฌle SD mechanism.
เน New targets can be added with one line of con๏ฌg.
เน TLS and authentication everywhere.
เน Single exporter port open per host.
It worked!
Prometheus Adoption
Prometheus at Scale at Fastly
114 Prometheus servers globally
28.4M time series
2.2M million samples/second
... a few hours later
เน Engineers love it.
เน Dashboard and alert quality have increased.
เน PromQL enables some deep insights.
เน Scaling linearly with our infrastructure growth.
Prometheus wins
เน Metrics exploration without prior knowledge.
เน Alertmanager's ๏ฌ‚exibility.
เน Federation and global views.
เน Long term storage still an open question.
Still some rough edges.
๐Ÿ˜
Thanks!
@ickymettle fastly.com

More Related Content

DOCX
Assignment Server, Client Application
PDF
Cncf explore k8s_api_go
PDF
OSMC 2018 | Integrating Check_MK agent into Thruk โ€“ Windows monitoring made e...
ย 
PDF
CNCF explore k8s_api
PDF
Cncf k8s_network_02
PPTX
Middleware webnextconf - 20152609
PDF
Heptio Contour - talk CNCF Nantes
PPTX
ChronoLogic ERC20 Transfers 4/17/18
Assignment Server, Client Application
Cncf explore k8s_api_go
OSMC 2018 | Integrating Check_MK agent into Thruk โ€“ Windows monitoring made e...
ย 
CNCF explore k8s_api
Cncf k8s_network_02
Middleware webnextconf - 20152609
Heptio Contour - talk CNCF Nantes
ChronoLogic ERC20 Transfers 4/17/18

What's hot (20)

PDF
Cncf Istio introduction
PPTX
Roslyn: el futuro de C#
ODP
Kong API Gateway
PPTX
Scylla Summit 2018: Kong & Cassandra/Scylla for distributed APIs and Microser...
PDF
gRPC @ Weaveworks
PDF
Cloud Foundry Meetup Stuttgart 2017 - Spring Cloud Development
PPTX
Meteor Day Athens (2014-11-07)
ย 
PDF
Oleksandr Navka How I Configure Infrastructure of My Project
PDF
Cncf k8s_network_03 (Ingress introduction)
PPTX
Code lifecycle on the Acquia Cloud Platform
PPTX
Using an API Gateway for Microservices
PDF
How to Serve Blockchain Data with AWS Lambda
ย 
PDF
Sprint 13
PDF
Matt Turner: Istio, The Packet's-Eye View (DevSecOps - London Gathering, Janu...
PPTX
Microservice: starting point
ย 
PDF
SignalR: Add real-time to your applications
PDF
There and back again: A story of a s
PDF
Open Source and Secure Coding Practices
PPTX
How to Adopt Infrastructure as Code
PDF
NGINX Controller: Configuration, Management, and Troubleshooting at Scale โ€“ EMEA
Cncf Istio introduction
Roslyn: el futuro de C#
Kong API Gateway
Scylla Summit 2018: Kong & Cassandra/Scylla for distributed APIs and Microser...
gRPC @ Weaveworks
Cloud Foundry Meetup Stuttgart 2017 - Spring Cloud Development
Meteor Day Athens (2014-11-07)
ย 
Oleksandr Navka How I Configure Infrastructure of My Project
Cncf k8s_network_03 (Ingress introduction)
Code lifecycle on the Acquia Cloud Platform
Using an API Gateway for Microservices
How to Serve Blockchain Data with AWS Lambda
ย 
Sprint 13
Matt Turner: Istio, The Packet's-Eye View (DevSecOps - London Gathering, Janu...
Microservice: starting point
ย 
SignalR: Add real-time to your applications
There and back again: A story of a s
Open Source and Secure Coding Practices
How to Adopt Infrastructure as Code
NGINX Controller: Configuration, Management, and Troubleshooting at Scale โ€“ EMEA
Ad

Similar to Monitoring at scale: Migrating to Prometheus at Fastly (20)

PDF
Prometheus and Docker (Docker Galway, November 2015)
PDF
Jacopo Nardiello - Monitoring Cloud-Native applications with Prometheus - Cod...
PDF
DevOps Braga #15: Agentless monitoring with icinga and prometheus
PPTX
Prometheus design and philosophy
PDF
Infrastructure & System Monitoring using Prometheus
PPTX
Prometheus Training
PPTX
Prometheus workshop
PDF
From nothing to Prometheus : one year after
PDF
Microservices and Prometheus (Microservices NYC 2016)
PDF
Prometheus Course from beginners to expert course
PDF
Prometheus course
PDF
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
PDF
Monitoring with prometheus at scale
PDF
Monitoring with prometheus at scale
PDF
Practical monitoring with Prometheus and Grafana Presentation.pdf
PDF
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...
PPTX
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)
PDF
Prometheus - basics
PDF
An Introduction to Prometheus
PDF
OSDC 2018 | Hardware-level data-center monitoring with Prometheus by Conrad H...
ย 
Prometheus and Docker (Docker Galway, November 2015)
Jacopo Nardiello - Monitoring Cloud-Native applications with Prometheus - Cod...
DevOps Braga #15: Agentless monitoring with icinga and prometheus
Prometheus design and philosophy
Infrastructure & System Monitoring using Prometheus
Prometheus Training
Prometheus workshop
From nothing to Prometheus : one year after
Microservices and Prometheus (Microservices NYC 2016)
Prometheus Course from beginners to expert course
Prometheus course
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring with prometheus at scale
Monitoring with prometheus at scale
Practical monitoring with Prometheus and Grafana Presentation.pdf
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)
Prometheus - basics
An Introduction to Prometheus
OSDC 2018 | Hardware-level data-center monitoring with Prometheus by Conrad H...
ย 
Ad

Recently uploaded (20)

PPTX
Digital Literacy And Online Safety on internet
PPTX
E -tech empowerment technologies PowerPoint
PDF
๐Ÿ’ฐ ๐”๐Š๐“๐ˆ ๐Š๐„๐Œ๐„๐๐€๐๐†๐€๐ ๐Š๐ˆ๐๐„๐‘๐Ÿ’๐ƒ ๐‡๐€๐‘๐ˆ ๐ˆ๐๐ˆ ๐Ÿ๐ŸŽ๐Ÿ๐Ÿ“ ๐Ÿ’ฐ
ย 
PDF
How to Ensure Data Integrity During Shopify Migration_ Best Practices for Sec...
PDF
Automated vs Manual WooCommerce to Shopify Migration_ Pros & Cons.pdf
PDF
SASE Traffic Flow - ZTNA Connector-1.pdf
PPTX
Introuction about WHO-FIC in ICD-10.pptx
PDF
Unit-1 introduction to cyber security discuss about how to secure a system
PDF
RPKI Status Update, presented by Makito Lay at IDNOG 10
ย 
PDF
Decoding a Decade: 10 Years of Applied CTI Discipline
PDF
Slides PDF The World Game (s) Eco Economic Epochs.pdf
PPTX
presentation_pfe-universite-molay-seltan.pptx
PPT
Design_with_Watersergyerge45hrbgre4top (1).ppt
PPTX
innovation process that make everything different.pptx
PDF
Best Practices for Testing and Debugging Shopify Third-Party API Integrations...
PDF
Testing WebRTC applications at scale.pdf
PPTX
Module 1 - Cyber Law and Ethics 101.pptx
PPTX
Slides PPTX World Game (s) Eco Economic Epochs.pptx
PPTX
June-4-Sermon-Powerpoint.pptx USE THIS FOR YOUR MOTIVATION
PDF
An introduction to the IFRS (ISSB) Stndards.pdf
Digital Literacy And Online Safety on internet
E -tech empowerment technologies PowerPoint
๐Ÿ’ฐ ๐”๐Š๐“๐ˆ ๐Š๐„๐Œ๐„๐๐€๐๐†๐€๐ ๐Š๐ˆ๐๐„๐‘๐Ÿ’๐ƒ ๐‡๐€๐‘๐ˆ ๐ˆ๐๐ˆ ๐Ÿ๐ŸŽ๐Ÿ๐Ÿ“ ๐Ÿ’ฐ
ย 
How to Ensure Data Integrity During Shopify Migration_ Best Practices for Sec...
Automated vs Manual WooCommerce to Shopify Migration_ Pros & Cons.pdf
SASE Traffic Flow - ZTNA Connector-1.pdf
Introuction about WHO-FIC in ICD-10.pptx
Unit-1 introduction to cyber security discuss about how to secure a system
RPKI Status Update, presented by Makito Lay at IDNOG 10
ย 
Decoding a Decade: 10 Years of Applied CTI Discipline
Slides PDF The World Game (s) Eco Economic Epochs.pdf
presentation_pfe-universite-molay-seltan.pptx
Design_with_Watersergyerge45hrbgre4top (1).ppt
innovation process that make everything different.pptx
Best Practices for Testing and Debugging Shopify Third-Party API Integrations...
Testing WebRTC applications at scale.pdf
Module 1 - Cyber Law and Ethics 101.pptx
Slides PPTX World Game (s) Eco Economic Epochs.pptx
June-4-Sermon-Powerpoint.pptx USE THIS FOR YOUR MOTIVATION
An introduction to the IFRS (ISSB) Stndards.pdf

Monitoring at scale: Migrating to Prometheus at Fastly

  • 1. Monitoring at Scale Migrating to Prometheus at Fastly PROMCON 2018 | Marcus Barczak @ickymettle
  • 6. +
  • 7. เน Operational overhead. เน Limited graphing functions. เน No alerting support, เน No real API for consuming metric data. Growing pains with Ganglia
  • 9. เน Now supporting two systems. เน Where do I put my metrics? เน Still writing external plugins and agents. เน Monitoring treated as a "post-release" phase. Growing pains doubled
  • 10. Scaling our infrastructure horizontally Required scaling our monitoring vertically
  • 12. เน Scale with our infrastructure growth, เน Be easy to deploy and operate. เน Engineer friendly instrumentation libraries. เน First class API support for data access. เน To reinvigorate our monitoring culture.โ€จ See: https://peter.bourgon.org/observability-the-hard-parts/ Project goals
  • 13. ?
  • 15. เน Build a proof of concept. เน Pair with pilot team to instrument their services. เน Iterate through the rest. เน Run both systems in parallel. เน Decommission SaaS system and Ganglia. Getting started
  • 17. prometheus A prometheus B scrapes targets SJC scrapes targets
  • 18. prometheus A prometheus B scrapes targets SJC scrapes targets prometheus A prometheus B scrapes targets JFK scrapes targets prometheus A prometheus B scrapes targets ATL scrapes targets
  • 19. prometheus A prometheus B scrapes targets SJC scrapes targets prometheus A prometheus B scrapes targets JFK scrapes targets prometheus A prometheus B scrapes targets ATL scrapes targets GCP federator A federator B frontend stack
  • 20. prometheus A prometheus B scrapes targets SJC scrapes targets prometheus A prometheus B scrapes targets JFK scrapes targets prometheus A prometheus B scrapes targets ATL scrapes targets GCP federator A federator B frontend stack Query Tra๏ฌƒc (TLS)
  • 21. Prometheus Server Software Stack Ghost Tunnel TLS termination and auth. Service Discovery Sidecar Target con๏ฌguration Rules Loader Recording and Alert rules Prometheus
  • 22. Prometheus Server Software Stack Ghost Tunnel TLS termination and auth. Service Discovery Sidecar Target con๏ฌguration Rules Loader Recording and Alert rules Prometheus Typical Server Software Stack Service Discovery Proxy Service discovery and TLS exporter proxy Exporters Built into services or sidecar
  • 24. Fastly's infrastructure is bare metal hardware no cloud conveniences
  • 25. เน Automatic discovery of targets. เน Self-service registration of exporter endpoints, เน TLS encryption for all exporter tra๏ฌƒc. เน Minimal exposure of exporter TCP ports. Service discovery requirements
  • 26. Prometheus Server Software Stack Ghost Tunnel TLS termination and auth. PromSD Sidecar Target con๏ฌguration Prometheus Typical Server Software Stack PromSD Proxy Service discovery and TLS exporter proxy Exporters Built into services or sidecar generates con๏ฌg for prometheus scrapes proxied targets over TLS queries for available targets
  • 27. promsd sidecar "exporter_hosts": [ "10.0.0.1", "10.0.0.2", "10.0.0.3", "10.0.0.4" ] con๏ฌgly fetch list of hosts in a datacenter 1 promsd proxy request /targets endpoint for each host to get list of available scrape targets 32 3 output all targets as a ๏ฌle service discovery JSON ๏ฌle 4 Prometheus reads the ๏ฌle and scrapes the con๏ฌgured targets. { "targets": [ โ€œ10.0.0.1:9702โ€, โ€œ10.0.0.2:9702โ€ ], "labels": { "__metrics_path__": โ€œ/node_exporter_9100/metrics", "job": โ€œnode_exporterโ€ } }, { "targets": [ โ€œ10.0.0.1:9702โ€, โ€œ10.0.0.2:9702โ€ ], "labels": { "__metrics_path__": "/varnishstat_exporter_19102/metrics", "job": "varnishstat_exporter" } } PromSD sidecar
  • 28. promsd proxy fetch list of installed systemd services node_exporter process_exporter systemd "node_exporter": { "prometheus_properties": { "target": "127.0.0.1:9100" } }, โ€ฆ "varnishstat_exporter": { "prometheus_properties": { "target": "127.0.0.1:19102" } } for each corresponding systemd service fetch the local exporter target address varnishstat_exporter 1 32 3 con๏ฌgly exposes an API used by prometheus and promsd sidecar /node_exporter_9100/metrics /varnish_exporter_19102/metrics /targetssidecar PromSD proxy
  • 29. เน Really easy to leverage the ๏ฌle SD mechanism. เน New targets can be added with one line of con๏ฌg. เน TLS and authentication everywhere. เน Single exporter port open per host. It worked!
  • 31. Prometheus at Scale at Fastly 114 Prometheus servers globally 28.4M time series 2.2M million samples/second
  • 32. ... a few hours later
  • 33. เน Engineers love it. เน Dashboard and alert quality have increased. เน PromQL enables some deep insights. เน Scaling linearly with our infrastructure growth. Prometheus wins
  • 34. เน Metrics exploration without prior knowledge. เน Alertmanager's ๏ฌ‚exibility. เน Federation and global views. เน Long term storage still an open question. Still some rough edges.