SlideShare a Scribd company logo
How to Manage
600 Prometheus
Instances?
SRE @ Criteo
g.beausire@criteo.com
Geoffrey Beausire
2
Who are we?
● We do personalized recommendation in advertisements (aka “retargeting”)
On the technical side:
● More than 40 000 servers across 10 datacenters
● 2 large Hadoop clusters (100K vcpus / 200 PB each)
● 6M requests per second (<100 ms)
3
Observability at Criteo
● Dedicated team of 4 persons
● Maintain observability stack:
○ Metrics
○ Logs
○ Tracing
● Prometheus:
○ 642 instances
○ 3M samples per second
○ Most common resolution is one minute
○ More than 300 committers on the configurations files
How do we manage 600
instances of Prometheus?
We don’t!
Each team is responsible for
their own instances
6
Using perimeters
EU-1
global
US-1
global
EU-1
local
EU-2
local
US-1
local
US-2
local
AS-1
local
Perimeter: Observability Perimeter: NoSQL
EU-1
global
US-1
global
EU-1
local
US-1
local
AS-1
local
● One team has one
perimeter
● Scrape services
owned by the
team
● Global/Local
topology
● Running Mesos
7
Why?
Advantages:
● Low maintenance cost
● Isolation between teams
● Freedom of usage
● Clear ownership separation
Disadvantages:
● Added workload to the client teams
● High entry cost:
○ learning how Prometheus works
○ how to use it in Criteo
How to make it easier for
teams to use Prometheus?
9
Reducing the workload by providing shared services
● Alertmanager
○ Generic routes are provided and preconfigured (send
pages, messages via Slack given specific labels)
● Common exporters (e.g. Blackbox exporter)
● Graphite Remote Adapter
10
Reducing the workload with efficient tooling
● Reliable Meta monitoring:
- Users can choose how to be alerted
- Alerts are actionable and well-documented
● Tooling is available to debug more complex issues
(e.g. out of order errors)
● Grafana dashboard for Prometheus stats
11
Reducing the workload by acting as a consulting team
The Observability team:
● Give users advice on how best to monitor their applications
● Dig deeper into complicated issues
● Take care of upgrades
● Provide workshops to onboard new users
12
Reducing the entry cost through automation and self-service
● Creating a new perimeter is easy thanks to Jenkins
(Creates all the configurations and staging pipelines)
● Prometheus is simple to test locally (one script to execute)
● Configuration is tested when creating a review:
○ Rules syntax is checked (thanks to promtool)
○ Documentation on alerts is enforced
○ Slack channels and OpsGenie receivers are validated
13
Some general advice
● Start small
● Automate progressively
● Listen to the users’ feedback
Criteo is hiring!
g.beausire@criteo.com
Thank you

More Related Content

PDF
Erasing Belady's Limitations: In Search of Flash Cache Offline Optimality
PDF
JCache / JSR107 shortcomings
PDF
cache2k, Java Caching, Turbo Charged, FOSDEM 2015
PDF
Debugging your varnish instance
PDF
Performance tests with gatling
PDF
Sharding: Past, Present and Future with Krutika Dhananjay
PDF
Log Event Stream Processing In Flink Way
PPT
Intro to Node.js
Erasing Belady's Limitations: In Search of Flash Cache Offline Optimality
JCache / JSR107 shortcomings
cache2k, Java Caching, Turbo Charged, FOSDEM 2015
Debugging your varnish instance
Performance tests with gatling
Sharding: Past, Present and Future with Krutika Dhananjay
Log Event Stream Processing In Flink Way
Intro to Node.js

What's hot (19)

PDF
Running OpenStack in Production - Barcamp Saigon 2016
PPTX
How to summation 2 numb using RMI [ misto ]
PDF
Guest Agents: Support & Implementation
PDF
Node in Real Time - The Beginning
PDF
Multi-core Node.pdf
PDF
Socket programming, and openresty
PDF
Modxpo 2015 - Custom Manager Page in MODX Revolution
PDF
CloudStack In Production
PDF
Integration of Glusterfs in to commvault simpana
PDF
Gluster fs architecture_&_roadmap_atin_punemeetup_2015
PDF
Nightwatch.js (vodQA Shots - Pune 2017)
PDF
OSDC 2013 | Distributed Storage with GlusterFS by Dr. Udo Seidel
PDF
Varnish Web Accelerator
PDF
OSBConf 2015 | Scale out backups with bareos and gluster by niels de vos
ODP
Speeding up ps and top
PDF
Extreme HTTP Performance Tuning: 1.2M API req/s on a 4 vCPU EC2 Instance
PPTX
Barcamp presentation
PDF
High-availability with Galera Cluster for MySQL
PDF
Integrating gluster fs,_qemu_and_ovirt-vijay_bellur-linuxcon_eu_2013
Running OpenStack in Production - Barcamp Saigon 2016
How to summation 2 numb using RMI [ misto ]
Guest Agents: Support & Implementation
Node in Real Time - The Beginning
Multi-core Node.pdf
Socket programming, and openresty
Modxpo 2015 - Custom Manager Page in MODX Revolution
CloudStack In Production
Integration of Glusterfs in to commvault simpana
Gluster fs architecture_&_roadmap_atin_punemeetup_2015
Nightwatch.js (vodQA Shots - Pune 2017)
OSDC 2013 | Distributed Storage with GlusterFS by Dr. Udo Seidel
Varnish Web Accelerator
OSBConf 2015 | Scale out backups with bareos and gluster by niels de vos
Speeding up ps and top
Extreme HTTP Performance Tuning: 1.2M API req/s on a 4 vCPU EC2 Instance
Barcamp presentation
High-availability with Galera Cluster for MySQL
Integrating gluster fs,_qemu_and_ovirt-vijay_bellur-linuxcon_eu_2013
Ad

Similar to Managing 600 instances (20)

PPTX
Prometheus - Open Source Forum Japan
PPTX
Prometheus (Prometheus London, 2016)
PDF
Prometheus-Grafana-RahulSoni1584KnolX.pptx.pdf
PPTX
How to Improve the Observability of Apache Cassandra and Kafka applications...
PDF
Migrating to Prometheus: what we learned running it in production
PDF
Microservices and Prometheus (Microservices NYC 2016)
PPTX
Scaling Prometheus on Kubernetes with Thanos
PPT
Monitoring using Prometheus and Grafana
PDF
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
PPTX
Prometheus (Monitorama 2016)
PDF
Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)
PDF
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
PPTX
Prometheus for Monitoring Metrics (Fermilab 2018)
PDF
Prometheus and Docker (Docker Galway, November 2015)
PDF
Jacopo Nardiello - Monitoring Cloud-Native applications with Prometheus - Cod...
PDF
ApacheCon2019 Talk: Improving the Observability of Cassandra, Kafka and Kuber...
PDF
Infrastructure & System Monitoring using Prometheus
PDF
DevOps Braga #15: Agentless monitoring with icinga and prometheus
PPTX
Prometheus
PDF
Monitoring with prometheus at scale
Prometheus - Open Source Forum Japan
Prometheus (Prometheus London, 2016)
Prometheus-Grafana-RahulSoni1584KnolX.pptx.pdf
How to Improve the Observability of Apache Cassandra and Kafka applications...
Migrating to Prometheus: what we learned running it in production
Microservices and Prometheus (Microservices NYC 2016)
Scaling Prometheus on Kubernetes with Thanos
Monitoring using Prometheus and Grafana
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus (Monitorama 2016)
Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Prometheus for Monitoring Metrics (Fermilab 2018)
Prometheus and Docker (Docker Galway, November 2015)
Jacopo Nardiello - Monitoring Cloud-Native applications with Prometheus - Cod...
ApacheCon2019 Talk: Improving the Observability of Cassandra, Kafka and Kuber...
Infrastructure & System Monitoring using Prometheus
DevOps Braga #15: Agentless monitoring with icinga and prometheus
Prometheus
Monitoring with prometheus at scale
Ad

Recently uploaded (20)

PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPT
Mechanical Engineering MATERIALS Selection
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
Geodesy 1.pptx...............................................
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PDF
Arduino robotics embedded978-1-4302-3184-4.pdf
PPTX
Strings in CPP - Strings in C++ are sequences of characters used to store and...
PPTX
CH1 Production IntroductoryConcepts.pptx
PPTX
Welding lecture in detail for understanding
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPT
Project quality management in manufacturing
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
web development for engineering and engineering
Embodied AI: Ushering in the Next Era of Intelligent Systems
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Mechanical Engineering MATERIALS Selection
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Geodesy 1.pptx...............................................
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Arduino robotics embedded978-1-4302-3184-4.pdf
Strings in CPP - Strings in C++ are sequences of characters used to store and...
CH1 Production IntroductoryConcepts.pptx
Welding lecture in detail for understanding
Operating System & Kernel Study Guide-1 - converted.pdf
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Project quality management in manufacturing
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
web development for engineering and engineering

Managing 600 instances

  • 1. How to Manage 600 Prometheus Instances? SRE @ Criteo g.beausire@criteo.com Geoffrey Beausire
  • 2. 2 Who are we? ● We do personalized recommendation in advertisements (aka “retargeting”) On the technical side: ● More than 40 000 servers across 10 datacenters ● 2 large Hadoop clusters (100K vcpus / 200 PB each) ● 6M requests per second (<100 ms)
  • 3. 3 Observability at Criteo ● Dedicated team of 4 persons ● Maintain observability stack: ○ Metrics ○ Logs ○ Tracing ● Prometheus: ○ 642 instances ○ 3M samples per second ○ Most common resolution is one minute ○ More than 300 committers on the configurations files
  • 4. How do we manage 600 instances of Prometheus?
  • 5. We don’t! Each team is responsible for their own instances
  • 6. 6 Using perimeters EU-1 global US-1 global EU-1 local EU-2 local US-1 local US-2 local AS-1 local Perimeter: Observability Perimeter: NoSQL EU-1 global US-1 global EU-1 local US-1 local AS-1 local ● One team has one perimeter ● Scrape services owned by the team ● Global/Local topology ● Running Mesos
  • 7. 7 Why? Advantages: ● Low maintenance cost ● Isolation between teams ● Freedom of usage ● Clear ownership separation Disadvantages: ● Added workload to the client teams ● High entry cost: ○ learning how Prometheus works ○ how to use it in Criteo
  • 8. How to make it easier for teams to use Prometheus?
  • 9. 9 Reducing the workload by providing shared services ● Alertmanager ○ Generic routes are provided and preconfigured (send pages, messages via Slack given specific labels) ● Common exporters (e.g. Blackbox exporter) ● Graphite Remote Adapter
  • 10. 10 Reducing the workload with efficient tooling ● Reliable Meta monitoring: - Users can choose how to be alerted - Alerts are actionable and well-documented ● Tooling is available to debug more complex issues (e.g. out of order errors) ● Grafana dashboard for Prometheus stats
  • 11. 11 Reducing the workload by acting as a consulting team The Observability team: ● Give users advice on how best to monitor their applications ● Dig deeper into complicated issues ● Take care of upgrades ● Provide workshops to onboard new users
  • 12. 12 Reducing the entry cost through automation and self-service ● Creating a new perimeter is easy thanks to Jenkins (Creates all the configurations and staging pipelines) ● Prometheus is simple to test locally (one script to execute) ● Configuration is tested when creating a review: ○ Rules syntax is checked (thanks to promtool) ○ Documentation on alerts is enforced ○ Slack channels and OpsGenie receivers are validated
  • 13. 13 Some general advice ● Start small ● Automate progressively ● Listen to the users’ feedback