SlideShare a Scribd company logo
An Introduction to Prometheus
Time Series Denver - May 30, 2018
Introduction
● CTO & Co-Founder - FreshTracks.io - A CA Accelerator Incubation
○ “Simplifying Kubernetes Visibility”
● bob@freshtracks.io
● @bob_cotton
● Father, Fly Fisher & Avid Homebrewer
Agenda
● What is a Cloud Native Application?
● Cloud Native Application Challenges
● The 5 Pillars of Monitoring
● An Introduction to Prometheus
● What FreshTracks Provides
What is a Cloud Native Application?
Cloud Native Application
● Follows 12 Factor Application Practices
● Packaged into containers
● Follows a micro-service architecture
● Managed by a Container Orchestration
○ Kubernetes, Docker Swarm, Mesos
● Usually deployed on dynamic
infrastructure
○ VMWare
○ Cloud providers
● Application lifecycle allows for
○ Auto-provisioning
○ Auto-scaling
○ Auto-redundancy
Cloud Native Applications Challenges
Cloud Native Challenges
● Containers are ephemeral
○ Scheduled on any node in the cluster
○ Move Frequently on restarts and deployments
● Kubernetes needs to be monitored
● Kubernetes brings additional complexities
○ Resource Quotas
○ Pod and Cluster Scaling
● Challenges traditional tools
5 Pillars of Monitoring
The 5 Pillars of Monitoring
Metrics and
Alerting Log Analytics
Distributed
Tracing
Application
Performance
Monitoring
Real User
Monitoring
Enter Prometheus
Prometheus
● Started in 2012 at SoundCloud by ex-Google Engineers
○ Open Sourced in 2015
● Patterned after “BorgMon” - Google’s Container monitoring system
● Second project accepted into the CNCF after Kubernetes
● Adoption surge is tracking Kubernetes
○ 63% of teams using Kubernetes use Prometheus
Prometheus Major Features
● Label/value based time series data model
● “Pull based” metrics collection
● Service discovery mechanism
● Simple metrics format with a rich set of “exporters”
● Extremely high-performance TSDB
● Extensive query language - PromQL
● Alert Manager
● Easily installable from Helm
○ Single, statically linked binary
● Open Source Grafana used for visualization
Time Series Data Model
<identifier> → [(t0, v0), (t1, v1), (t2, v2) …]
Identifier is a collection of label/value pairs
Time stored as int64 - Millis since the epoch
Values stored as float64
Efficient storage on disk -- 1.3 bytes/sample
Label/Value Based Data Model
● Graphite/StatsD
○ apache.192-168-5-1.home.200.http_request_total
○ apache.192-168-5-1.home.500.http_request_total
○ apache.192-168-5-1.about.200.http_request_total
● Prometheus
○ http_request_total{job=”apache”, instance=”192.168.5.1”, path=”/home”, status=”200”}
○ http_request_total{job=”apache”, instance=”192.168.5.1”, path=”/home”, status=”500”}
○ http_request_total{job=”apache”, instance=”192.168.5.1”, path=”/about”, status=”200”}
● Selecting Series
○ *.*.home.200.*.http_requests_total
○ http_requests_total{status=”200”, path=”/home”}
Client Data Model
● Counters
○ Always go up or get reset to 0
● Gauge
○ Tracks a real value e.g. temperature
● Histogram and Summary
○ Used for percentiles
Prometheus Service Discovery and Target Scrape
Prometheus
K8s API Server
TSDB
Kublet
(cAdvisor)
node-exporter
kube_state_metrics
App containers
other exporters
node_exporter
App containers
Kublet
(cAdvisor)
Service Discovery
Prometheus Exposition Format and Exporters
● The Prometheus exposition format - Text over http. Simple, human readable
● Supported by Sysdig and the TICK collector
○ Efforts to make it a standard
● Close to 100 exporters for various technologies
● The jmx_exporter can cover any Java/JMX application
● https://prometheus.io/docs/instrumenting/exporters/
Official Exporters:
● node_exporter
● jmx_exporter
● snmp_exporter
● haproxy_exporter
● cloudwatch_exporter
● collectd_exporter
● mysql_exporter
● memcached_exporter
Querying Series with PromQL
● PromQL is a functional query language. Nothing like SQL
rate(http_requests_total[5m])
select job, instance, path, status
rate(value, 5m)
FROM http_requests_total;
Querying Series with PromQL
Calculate a ratio of website hits to failures:
sum(rate(http_requests_total{status=”500”}[5m])) by (path) /
sum(rate(http_requests_total[5m])) by (path)
{path=”/home”} 0.014
{path=”/about”} 0.027
Graphing
Dashboards with Grafana
@bob_cotton@bob_cotton
Labels, Re-Label and Recording Rules
Oh My...
Label/Value Based Data Model
● Graphite/StatsD
○ apache.192-168-5-1.home.200.http_request_total
○ apache.192-168-5-1.home.500.http_request_total
○ apache.192-168-5-1.about.200.http_request_total
● Prometheus
○ http_request_total{job=”apache”, instance=”192.168.5.1”, path=”/home”, status=”200”}
○ http_request_total{job=”apache”, instance=”192.168.5.1”, path=”/home”, status=”500”}
○ http_request_total{job=”apache”, instance=”192.168.5.1”, path=”/about”, status=”200”}
● Selecting Series
○ *.*.home.200.*.http_requests_total
○ http_requests_total{status=”200”, path=”/home”}
@bob_cotton
Kubernetes Labels
● Kubernetes gives us labels on all the things
● Our scrape targets live in the context of the K8s labels
○ This comes from service discovery
● We want to enhance the scraped metric labels with K8s labels
● This is why we need relabel rules in Prometheus
@bob_cotton
K8s API Server
TSDB
Scrape Target
Service Discovery
Prometheus
0="{__address__ 300.196.17.41}"
1="{__meta_kubernetes_namespace default}"
2="{__meta_kubernetes_pod_annotation_freshtracks_io_data_sidecar true}"
3="{__meta_kubernetes_pod_annotation_freshtracks_io_path /metrics2}"
4="{__meta_kubernetes_pod_annotation_kubernetes_io_created_by "kind":"SerializedReference"?}"
5="{__meta_kubernetes_pod_annotation_kubernetes_io_limit_ranger LimitRanger plugin set: cpu
request for container prometheus-configmap-reload; cpu request for container data-sidecar}"
6="{__meta_kubernetes_pod_annotation_prometheus_io_port 8077}"
7="{__meta_kubernetes_pod_annotation_prometheus_io_scrape false}"
8="{__meta_kubernetes_pod_container_name prometheus-configmap-reload}"
9="{__meta_kubernetes_pod_host_ip 172.20.42.119}"
10="{__meta_kubernetes_pod_ip 100.96.17.41}"
11="{__meta_kubernetes_pod_label_freshtracks_io_cluster bowl.freshtracks.io}"
12="{__meta_kubernetes_pod_label_pod_template_hash 1636686694}"
13="{__meta_kubernetes_pod_label_run data-sidecar}"
14="{__meta_kubernetes_pod_name data-sidecar-1636686694-83crm}"
15="{__meta_kubernetes_pod_node_name ip-xx-xxx-xx-xxx.us-west-2.compute.internal}"
16="{__meta_kubernetes_pod_ready false}"
17="{__metrics_path__ /metrics}"
18="{__scheme__ http}"
19="{job ftio-data-sidecar-calc}"
<relabel_config>
{__address__ 300.196.17.41:8077}
{__scheme__ http}
{__metrics_path__ /metrics}
{job ftio-data-sidecar-calc}
{kubernetes_namespace default}
{container_name prometheus-configmap-reload}
http_requests_total{region=”us-east”,
az=”us-east-1”, instance_type=”m2.xlarge”,
instance=”i-3582k8”, hostname=”host1”} = 5439
http_requests_total{region=”us-east”,
az=”us-east-1”,
instance_type=”m2.xlarge”,
instance=”i-3582k8”,
hostname=”host1”,
instance=”300.196.17.41:8077”,
job=”ftio-data-sidecar-calc”,
kubernetes_namespace=”default”,
container_name=”prometheus-configmap-reload”,
} = 5439
<metric_relabel_config>
Recording Rules - Derivative Series
● New series can be generated by querying existing series and storing them
path:request_failures_per_requests:ratio_rate5m =
sum(rate(http_requests_total{status=”500”}[5m])) by (path) 
sum(rate(http_requests_total[5m])) by (path)
High Availability
Prometheus
Prometheus
Federation
Prometheus
Prometheus
Prometheus
Prometheus
Prometheus
Prometheus
Prometheus
Prometheus
Subset of Metrics
Long Term Storage and External Integrations
Prometheus
remote_write
● AppOptics: write
● Chronix: write
● Cortex: read and write
● CrateDB: read and write
● Elasticsearch: write
● Gnocchi: write
● Graphite: write
● InfluxDB: read and write
● OpenTSDB: write
● PostgreSQL/TimescaleD
B: read and write
● SignalFx: write
remote_read
Alerting
Alert Definition
ALERT <alert name>
EXPR <expression>
[ FOR <duration> ]
[ LABELS <label set> ]
[ ANNOTATIONS <labelset> ]
ALERT: IngesterCrowding
EXPR: count by(ft_cluster, node)
(cortex_ingester_ingested_samples_total) > 1
FOR: 30m
LABELS: severity: critical
ANNOTATIONS:
description:
https://github.com/Fresh-Tracks/gke-configs/blob/master
/docs/alerts.md#ingestercrowding
summary: Node {{ $labels.node }} is hosting {{ $value
}} ingester pods
Alert Manager
● Deduplication
● Grouping
● Routing
● Suppression
Alert Manager
Prometheus
Prometheus
Alert Manager
Alert Manager
PagerDuty
VictorOps
Slack
FreshTracks.io
Simplifying Kubernetes Visibility
Filling the Gaps
● A small Kubernetes cluster generate > 500K unique samples
○ Which metrics are important?
● Performance of any one container is easy
○ How is the whole microservice behaving? Node? Cluster?
● Prometheus has no anomaly detection
● Dashboard creation is tedious, even if you know what to watch
● How is my service behaving in the context of the cluster?
○ How do node/container/application metrics correlate to each other?
Kubernetes Hierarchy Visibility
Namespace
Workload
Pod
Container
(Workload can be a deployment,
replicaSet, statefulSet,
daemonSet or similar)
Demo
Thanks!
We’re Hiring!

More Related Content

PDF
Kubernetes Colorado - Kubernetes metrics deep dive 10/25/2017
PDF
20180503 kube con eu kubernetes metrics deep dive
PDF
KubeCon Prometheus Salon -- Kubernetes metrics deep dive
PDF
PostgreSQL on AWS: Tips & Tricks (and horror stories)
PDF
PGConf APAC 2018 - Patroni: Kubernetes-native PostgreSQL companion
PDF
Optimizing {Java} Application Performance on Kubernetes
PDF
Virtual training Intro to Kapacitor
PDF
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
Kubernetes Colorado - Kubernetes metrics deep dive 10/25/2017
20180503 kube con eu kubernetes metrics deep dive
KubeCon Prometheus Salon -- Kubernetes metrics deep dive
PostgreSQL on AWS: Tips & Tricks (and horror stories)
PGConf APAC 2018 - Patroni: Kubernetes-native PostgreSQL companion
Optimizing {Java} Application Performance on Kubernetes
Virtual training Intro to Kapacitor
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...

What's hot (20)

PDF
Let's Compare: A Benchmark review of InfluxDB and Elasticsearch
PDF
PGConf APAC 2018 - Managing replication clusters with repmgr, Barman and PgBo...
PPTX
How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...
PDF
Scaling Up Logging and Metrics
PDF
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
PDF
Streaming millions of Contact Center interactions in (near) real-time with Pu...
PDF
Inside the InfluxDB storage engine
PDF
DevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
PDF
Anatomy of an action
PDF
2016 08-30 Kubernetes talk for Waterloo DevOps
PPTX
Managing Container Clusters in OpenStack Native Way
PDF
Taking Your Database Beyond the Border of a Single Kubernetes Cluster
PDF
InfluxDB & Kubernetes
ODP
OpenStack Nova Scheduler
PDF
Scaling 100PB Data Warehouse in Cloud
PDF
Moving from CellsV1 to CellsV2 at CERN
PDF
Our Story With ClickHouse at seo.do
PDF
Mixing Metrics and Logs with Grafana + Influx by David Kaltschmidt, Director ...
PDF
Advanced kapacitor
PPTX
InfluxDB Client Libraries and Applications | Miroslav Malecha | Bonitoo
Let's Compare: A Benchmark review of InfluxDB and Elasticsearch
PGConf APAC 2018 - Managing replication clusters with repmgr, Barman and PgBo...
How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...
Scaling Up Logging and Metrics
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
Streaming millions of Contact Center interactions in (near) real-time with Pu...
Inside the InfluxDB storage engine
DevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
Anatomy of an action
2016 08-30 Kubernetes talk for Waterloo DevOps
Managing Container Clusters in OpenStack Native Way
Taking Your Database Beyond the Border of a Single Kubernetes Cluster
InfluxDB & Kubernetes
OpenStack Nova Scheduler
Scaling 100PB Data Warehouse in Cloud
Moving from CellsV1 to CellsV2 at CERN
Our Story With ClickHouse at seo.do
Mixing Metrics and Logs with Grafana + Influx by David Kaltschmidt, Director ...
Advanced kapacitor
InfluxDB Client Libraries and Applications | Miroslav Malecha | Bonitoo
Ad

Similar to Time series denver an introduction to prometheus (20)

PDF
Prometheus - basics
PPTX
How to Improve the Observability of Apache Cassandra and Kafka applications...
PDF
ApacheCon2019 Talk: Improving the Observability of Cassandra, Kafka and Kuber...
PDF
Monitoring a Kubernetes-backed microservice architecture with Prometheus
PDF
The hitchhiker’s guide to Prometheus
PDF
The hitchhiker’s guide to Prometheus
PDF
Prometheus monitoring
PDF
Monitoring Cloud Native Applications with Prometheus
PDF
Jacopo Nardiello - Monitoring Cloud-Native applications with Prometheus - Cod...
PDF
Monitoring Kafka w/ Prometheus
PDF
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
PDF
Monitoring on Kubernetes using Prometheus - Chandresh
PPTX
Monitoring on Kubernetes using prometheus
PDF
Prometheus kubernetes tech talk
PDF
DevOps Braga #15: Agentless monitoring with icinga and prometheus
PDF
Monitoring with Prometheus
PPTX
Scaling Prometheus on Kubernetes with Thanos
PDF
End to-end monitoring with the prometheus operator - Max Inden
PDF
Monitoring with prometheus at scale
PDF
Monitoring with prometheus at scale
Prometheus - basics
How to Improve the Observability of Apache Cassandra and Kafka applications...
ApacheCon2019 Talk: Improving the Observability of Cassandra, Kafka and Kuber...
Monitoring a Kubernetes-backed microservice architecture with Prometheus
The hitchhiker’s guide to Prometheus
The hitchhiker’s guide to Prometheus
Prometheus monitoring
Monitoring Cloud Native Applications with Prometheus
Jacopo Nardiello - Monitoring Cloud-Native applications with Prometheus - Cod...
Monitoring Kafka w/ Prometheus
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring on Kubernetes using Prometheus - Chandresh
Monitoring on Kubernetes using prometheus
Prometheus kubernetes tech talk
DevOps Braga #15: Agentless monitoring with icinga and prometheus
Monitoring with Prometheus
Scaling Prometheus on Kubernetes with Thanos
End to-end monitoring with the prometheus operator - Max Inden
Monitoring with prometheus at scale
Monitoring with prometheus at scale
Ad

Recently uploaded (20)

PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Advanced Soft Computing BINUS July 2025.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
NewMind AI Monthly Chronicles - July 2025
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Empathic Computing: Creating Shared Understanding
Understanding_Digital_Forensics_Presentation.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Spectral efficient network and resource selection model in 5G networks
Review of recent advances in non-invasive hemoglobin estimation
Dropbox Q2 2025 Financial Results & Investor Presentation
MYSQL Presentation for SQL database connectivity
Advanced Soft Computing BINUS July 2025.pdf
Network Security Unit 5.pdf for BCA BBA.
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Per capita expenditure prediction using model stacking based on satellite ima...
“AI and Expert System Decision Support & Business Intelligence Systems”
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
Diabetes mellitus diagnosis method based random forest with bat algorithm
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
NewMind AI Monthly Chronicles - July 2025
The AUB Centre for AI in Media Proposal.docx
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf

Time series denver an introduction to prometheus

  • 1. An Introduction to Prometheus Time Series Denver - May 30, 2018
  • 2. Introduction ● CTO & Co-Founder - FreshTracks.io - A CA Accelerator Incubation ○ “Simplifying Kubernetes Visibility” ● bob@freshtracks.io ● @bob_cotton ● Father, Fly Fisher & Avid Homebrewer
  • 3. Agenda ● What is a Cloud Native Application? ● Cloud Native Application Challenges ● The 5 Pillars of Monitoring ● An Introduction to Prometheus ● What FreshTracks Provides
  • 4. What is a Cloud Native Application?
  • 5. Cloud Native Application ● Follows 12 Factor Application Practices ● Packaged into containers ● Follows a micro-service architecture ● Managed by a Container Orchestration ○ Kubernetes, Docker Swarm, Mesos ● Usually deployed on dynamic infrastructure ○ VMWare ○ Cloud providers ● Application lifecycle allows for ○ Auto-provisioning ○ Auto-scaling ○ Auto-redundancy
  • 7. Cloud Native Challenges ● Containers are ephemeral ○ Scheduled on any node in the cluster ○ Move Frequently on restarts and deployments ● Kubernetes needs to be monitored ● Kubernetes brings additional complexities ○ Resource Quotas ○ Pod and Cluster Scaling ● Challenges traditional tools
  • 8. 5 Pillars of Monitoring
  • 9. The 5 Pillars of Monitoring Metrics and Alerting Log Analytics Distributed Tracing Application Performance Monitoring Real User Monitoring
  • 11. Prometheus ● Started in 2012 at SoundCloud by ex-Google Engineers ○ Open Sourced in 2015 ● Patterned after “BorgMon” - Google’s Container monitoring system ● Second project accepted into the CNCF after Kubernetes ● Adoption surge is tracking Kubernetes ○ 63% of teams using Kubernetes use Prometheus
  • 12. Prometheus Major Features ● Label/value based time series data model ● “Pull based” metrics collection ● Service discovery mechanism ● Simple metrics format with a rich set of “exporters” ● Extremely high-performance TSDB ● Extensive query language - PromQL ● Alert Manager ● Easily installable from Helm ○ Single, statically linked binary ● Open Source Grafana used for visualization
  • 13. Time Series Data Model <identifier> → [(t0, v0), (t1, v1), (t2, v2) …] Identifier is a collection of label/value pairs Time stored as int64 - Millis since the epoch Values stored as float64 Efficient storage on disk -- 1.3 bytes/sample
  • 14. Label/Value Based Data Model ● Graphite/StatsD ○ apache.192-168-5-1.home.200.http_request_total ○ apache.192-168-5-1.home.500.http_request_total ○ apache.192-168-5-1.about.200.http_request_total ● Prometheus ○ http_request_total{job=”apache”, instance=”192.168.5.1”, path=”/home”, status=”200”} ○ http_request_total{job=”apache”, instance=”192.168.5.1”, path=”/home”, status=”500”} ○ http_request_total{job=”apache”, instance=”192.168.5.1”, path=”/about”, status=”200”} ● Selecting Series ○ *.*.home.200.*.http_requests_total ○ http_requests_total{status=”200”, path=”/home”}
  • 15. Client Data Model ● Counters ○ Always go up or get reset to 0 ● Gauge ○ Tracks a real value e.g. temperature ● Histogram and Summary ○ Used for percentiles
  • 16. Prometheus Service Discovery and Target Scrape Prometheus K8s API Server TSDB Kublet (cAdvisor) node-exporter kube_state_metrics App containers other exporters node_exporter App containers Kublet (cAdvisor) Service Discovery
  • 17. Prometheus Exposition Format and Exporters ● The Prometheus exposition format - Text over http. Simple, human readable ● Supported by Sysdig and the TICK collector ○ Efforts to make it a standard ● Close to 100 exporters for various technologies ● The jmx_exporter can cover any Java/JMX application ● https://prometheus.io/docs/instrumenting/exporters/ Official Exporters: ● node_exporter ● jmx_exporter ● snmp_exporter ● haproxy_exporter ● cloudwatch_exporter ● collectd_exporter ● mysql_exporter ● memcached_exporter
  • 18. Querying Series with PromQL ● PromQL is a functional query language. Nothing like SQL rate(http_requests_total[5m]) select job, instance, path, status rate(value, 5m) FROM http_requests_total;
  • 19. Querying Series with PromQL Calculate a ratio of website hits to failures: sum(rate(http_requests_total{status=”500”}[5m])) by (path) / sum(rate(http_requests_total[5m])) by (path) {path=”/home”} 0.014 {path=”/about”} 0.027
  • 23. Label/Value Based Data Model ● Graphite/StatsD ○ apache.192-168-5-1.home.200.http_request_total ○ apache.192-168-5-1.home.500.http_request_total ○ apache.192-168-5-1.about.200.http_request_total ● Prometheus ○ http_request_total{job=”apache”, instance=”192.168.5.1”, path=”/home”, status=”200”} ○ http_request_total{job=”apache”, instance=”192.168.5.1”, path=”/home”, status=”500”} ○ http_request_total{job=”apache”, instance=”192.168.5.1”, path=”/about”, status=”200”} ● Selecting Series ○ *.*.home.200.*.http_requests_total ○ http_requests_total{status=”200”, path=”/home”}
  • 24. @bob_cotton Kubernetes Labels ● Kubernetes gives us labels on all the things ● Our scrape targets live in the context of the K8s labels ○ This comes from service discovery ● We want to enhance the scraped metric labels with K8s labels ● This is why we need relabel rules in Prometheus
  • 25. @bob_cotton K8s API Server TSDB Scrape Target Service Discovery Prometheus 0="{__address__ 300.196.17.41}" 1="{__meta_kubernetes_namespace default}" 2="{__meta_kubernetes_pod_annotation_freshtracks_io_data_sidecar true}" 3="{__meta_kubernetes_pod_annotation_freshtracks_io_path /metrics2}" 4="{__meta_kubernetes_pod_annotation_kubernetes_io_created_by "kind":"SerializedReference"?}" 5="{__meta_kubernetes_pod_annotation_kubernetes_io_limit_ranger LimitRanger plugin set: cpu request for container prometheus-configmap-reload; cpu request for container data-sidecar}" 6="{__meta_kubernetes_pod_annotation_prometheus_io_port 8077}" 7="{__meta_kubernetes_pod_annotation_prometheus_io_scrape false}" 8="{__meta_kubernetes_pod_container_name prometheus-configmap-reload}" 9="{__meta_kubernetes_pod_host_ip 172.20.42.119}" 10="{__meta_kubernetes_pod_ip 100.96.17.41}" 11="{__meta_kubernetes_pod_label_freshtracks_io_cluster bowl.freshtracks.io}" 12="{__meta_kubernetes_pod_label_pod_template_hash 1636686694}" 13="{__meta_kubernetes_pod_label_run data-sidecar}" 14="{__meta_kubernetes_pod_name data-sidecar-1636686694-83crm}" 15="{__meta_kubernetes_pod_node_name ip-xx-xxx-xx-xxx.us-west-2.compute.internal}" 16="{__meta_kubernetes_pod_ready false}" 17="{__metrics_path__ /metrics}" 18="{__scheme__ http}" 19="{job ftio-data-sidecar-calc}" <relabel_config> {__address__ 300.196.17.41:8077} {__scheme__ http} {__metrics_path__ /metrics} {job ftio-data-sidecar-calc} {kubernetes_namespace default} {container_name prometheus-configmap-reload} http_requests_total{region=”us-east”, az=”us-east-1”, instance_type=”m2.xlarge”, instance=”i-3582k8”, hostname=”host1”} = 5439 http_requests_total{region=”us-east”, az=”us-east-1”, instance_type=”m2.xlarge”, instance=”i-3582k8”, hostname=”host1”, instance=”300.196.17.41:8077”, job=”ftio-data-sidecar-calc”, kubernetes_namespace=”default”, container_name=”prometheus-configmap-reload”, } = 5439 <metric_relabel_config>
  • 26. Recording Rules - Derivative Series ● New series can be generated by querying existing series and storing them path:request_failures_per_requests:ratio_rate5m = sum(rate(http_requests_total{status=”500”}[5m])) by (path) sum(rate(http_requests_total[5m])) by (path)
  • 29. Long Term Storage and External Integrations Prometheus remote_write ● AppOptics: write ● Chronix: write ● Cortex: read and write ● CrateDB: read and write ● Elasticsearch: write ● Gnocchi: write ● Graphite: write ● InfluxDB: read and write ● OpenTSDB: write ● PostgreSQL/TimescaleD B: read and write ● SignalFx: write remote_read
  • 31. Alert Definition ALERT <alert name> EXPR <expression> [ FOR <duration> ] [ LABELS <label set> ] [ ANNOTATIONS <labelset> ] ALERT: IngesterCrowding EXPR: count by(ft_cluster, node) (cortex_ingester_ingested_samples_total) > 1 FOR: 30m LABELS: severity: critical ANNOTATIONS: description: https://github.com/Fresh-Tracks/gke-configs/blob/master /docs/alerts.md#ingestercrowding summary: Node {{ $labels.node }} is hosting {{ $value }} ingester pods
  • 32. Alert Manager ● Deduplication ● Grouping ● Routing ● Suppression
  • 33. Alert Manager Prometheus Prometheus Alert Manager Alert Manager PagerDuty VictorOps Slack
  • 35. Filling the Gaps ● A small Kubernetes cluster generate > 500K unique samples ○ Which metrics are important? ● Performance of any one container is easy ○ How is the whole microservice behaving? Node? Cluster? ● Prometheus has no anomaly detection ● Dashboard creation is tedious, even if you know what to watch ● How is my service behaving in the context of the cluster? ○ How do node/container/application metrics correlate to each other?
  • 36. Kubernetes Hierarchy Visibility Namespace Workload Pod Container (Workload can be a deployment, replicaSet, statefulSet, daemonSet or similar)
  • 37. Demo