Monitoring That Actually Works

Anjani Keshri

Published Jul 19, 2025

The time when I worked as an SRE, more than Slack notifications these PagerDuty alerts have given me nightmare. My phone used to explode with the same, and usually during odd hours.

"Database connection pool exhausted" "API response time above threshold" "Memory usage critical" "Disk space low"

These can turn your nights in disaster if you don't keep a check. Hence, I learned the difference between monitoring and noise. Most of us are drowning in the latter.

The Alert That Cried Wolf

Here's the thing about monitoring - if everything is critical, nothing is critical. I've seen teams get so desensitized to alerts that they miss actual outages because they're buried under false positives.

The golden rule: Only alert on things that require immediate human action.

This means:

✅ User-facing service is down
✅ Payment processing failed
✅ Database replication lag is increasing
❌ CPU usage hit 80% for 30 seconds
❌ Disk space is at 70%
❌ Memory usage is high but stable

Metrics That Actually Matter

After dealing with monitoring for years, I've narrowed it down to four categories that cover 95% of what you need:

1. The RED Method (for services)

Rate - How many requests per second Errors - How many of those requests failed Duration - How long those requests took

# prometheus
# Request rate
rate(http_requests_total[5m])

# Error rate
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])

# Duration (95th percentile)
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

2. The USE Method (for resources)

Utilization - How busy the resource is Saturation - How much work is queued Errors - Count of error events

#bash
# CPU utilization
100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory utilization
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Disk saturation
node_disk_io_time_seconds_total

3. Business Metrics

This is where most teams fail. You need metrics that business people care about:

Revenue per minute
Active users
Conversion rates
Cart abandonment

4. Infrastructure Health

Basic stuff, but critical:

Service uptime
Database connections
Queue depths
Certificate expiration

Alerting Rules That Don't Suck

Here's my framework for writing alerts that people actually respond to:

1. Make it actionable Bad: "High CPU usage" Good: "API response time above 500ms for 5 minutes - check application logs and consider scaling"

2. Include context

#YAML
- alert: HighErrorRate
  expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
  for: 5m
  annotations:
    summary: "High error rate detected"
    description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.service }}"
    runbook_url: "https://wiki.company.com/runbooks/high-error-rate"
    dashboard_url: "https://grafana.company.com/d/service-overview"

3. Use tiered alerting Not every problem needs to wake someone up:

Critical (page immediately): User-facing outage
Warning (Slack during business hours): Performance degradation
Info (email digest): Resource usage trends

The Monitoring Stack That Works

After trying everything from Nagios to DataDog, here's what I recommend:

For Metrics: Prometheus + Grafana

Why Prometheus:

Pull-based model is more reliable
Great query language (PromQL)
Built-in alerting
Huge ecosystem

Basic Prometheus setup:

# prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['localhost:9100']
  
  - job_name: 'my-app'
    static_configs:
      - targets: ['localhost:8080']

Essential exporters:

node_exporter for system metrics
cadvisor for container metrics
postgres_exporter for database metrics
nginx-prometheus-exporter for web server metrics

For Logs: ELK Stack or Loki

ELK (Elasticsearch, Logstash, Kibana) if you need complex log analysis Loki + Grafana if you want something simpler and cheaper

For Traces: Jaeger or Zipkin

Only if you're doing microservices and need to trace requests across services.

Setting Up Meaningful Dashboards

I see too many dashboards that look impressive but tell you nothing useful. Here's how to build dashboards people actually use:

1. Start with the user journey

Login success rate
Page load times
Transaction completion rate

2. Add infrastructure context

Response time vs CPU usage
Error rate vs deployment events
Traffic patterns vs resource usage

3. Use the inverted pyramid

Top level: Business metrics and SLIs
Middle: Service-level metrics (RED method)
Bottom: Infrastructure metrics (USE method)

Sample Grafana dashboard structure:

Row 1: Business KPIs
- Revenue/hour
- Active users
- Error budget remaining

Row 2: Application Health
- Request rate
- Error rate  
- Response time (p50, p95, p99)

Row 3: Infrastructure
- CPU/Memory usage
- Database connections
- Queue depths

SLIs and SLOs That Make Sense

Service Level Indicators (SLIs) and Service Level Objectives (SLOs) sound fancy, but they're just a way to define "good enough."

Good SLIs:

95% of requests complete in under 200ms
99.9% of requests return without error
99% uptime over a 30-day period

Bad SLIs:

CPU usage stays under 80%
Zero errors ever
100% uptime

Error budgets are your friend: If your SLO is 99.9% uptime, you have 43 minutes of downtime per month. Use it to take risks and deploy new features.

Monitoring Microservices

Microservices create unique monitoring challenges. Here's what works:

Distributed Tracing

When a request fails, you need to know which service caused it:

# Jaeger configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: jaeger-config
data:
  jaeger.yaml: |
    sampling:
      default_strategy:
        type: probabilistic
        param: 0.1  # Sample 10% of traces

Service Mesh Monitoring

If you're using Istio or Linkerd, you get monitoring for free:

# Istio metrics
istioctl proxy-config cluster productpage-v1-123456

# Service-to-service success rate
sum(rate(istio_requests_total{reporter="destination",response_code!~"5.*"}[5m])) 
/ sum(rate(istio_requests_total{reporter="destination"}[5m]))

Golden Signals for Each Service

Every service should expose:

/health - Is the service running?
/ready - Is the service ready to take traffic?
/metrics - Prometheus metrics endpoint

Common Monitoring Mistakes

1. Monitoring everything More metrics ≠ better monitoring. Focus on what matters.

2. Alerting on symptoms, not causes Alert on "users can't log in" not "login service CPU is high."

3. No runbooks Every alert should have a runbook explaining what to do.

4. Testing in prod only Monitor your staging environment the same way as production.

5. Forgetting about dependencies Your app might be healthy, but what about the database? The load balancer? External APIs?

Real-World Scenarios

"The site is slow"

Step 1: Check the business metrics

Page load times from user perspective
Conversion rates dropping?

Step 2: Look at application metrics

Response times by endpoint
Error rates increasing?

Step 3: Check infrastructure

Database query times
CPU/memory on app servers

"Customers can't checkout"

Step 1: Payment service health

Payment API response times
Payment gateway availability

Step 2: Database health

Connection pool status
Transaction rollback rates

Step 3: Upstream dependencies

Third-party payment processor status
CDN performance

Monitoring in Different Environments

Kubernetes Monitoring

Essential metrics for K8s:

# Pod resource usage
container_memory_usage_bytes / container_spec_memory_limit_bytes

# Pod restart count
increase(kube_pod_container_status_restarts_total[1h])

# Node resource usage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

Cloud Monitoring

Each cloud provider has their own monitoring service:

AWS CloudWatch:

# Create custom metric
aws cloudwatch put-metric-data --namespace "MyApp" --metric-data MetricName=ErrorRate,Value=0.05,Unit=Percent

Azure Monitor:

# Query logs
az monitor log-analytics query --workspace myworkspace --analytics-query "requests | where resultCode >= 400"

GCP Monitoring:

# Create alert policy
gcloud alpha monitoring policies create --policy-from-file=policy.yaml

Cost-Effective Monitoring

Monitoring can get expensive fast. Here's how to keep costs down:

1. Smart retention policies

High-resolution metrics: 7 days
Medium resolution: 30 days
Low resolution: 1 year

2. Sampling for high-volume services

# Sample 1% of traces
sampling:
  type: probabilistic
  param: 0.01

3. Use recording rules Pre-compute expensive queries:

groups:
- name: my-app-rules
  rules:
  - record: job:request_rate:5m
    expr: sum(rate(http_requests_total[5m])) by (job)

4. Alert on trends, not spikes

# Alert on sustained high error rate, not brief spikes
expr: rate(http_errors_total[5m]) > 0.05
for: 5m

The Monitoring Workflow

Here's my process for implementing monitoring on a new service:

1. Define what "working" means

What's the user experience?
What are the critical user journeys?

2. Implement basic metrics

Health check endpoint
Basic RED/USE metrics
Business metrics

3. Set up dashboards

Start simple with key metrics
Add detail as needed

4. Create alerts

Start with critical user-facing issues
Add more as you understand the system

5. Write runbooks

Document what each alert means
Include investigation steps

6. Test and iterate

Trigger alerts intentionally
Improve based on real incidents

Tools and Libraries

Application metrics (choose one):

Prometheus client libraries (Go, Python, Java, etc.)
StatsD for simple counting
OpenTelemetry for modern apps

System monitoring:

# Essential exporters
docker run -d --name=node-exporter -p 9100:9100 prom/node-exporter
docker run -d --name=cadvisor -p 8080:8080 gcr.io/cadvisor/cadvisor

Log shipping:

# Fluent Bit for lightweight log forwarding  
docker run -d --name=fluent-bit -v /var/log:/var/log fluent/fluent-bit

When Things Go Wrong

Incident response with good monitoring:

Get alerted to the right thing - User impact, not infrastructure noise
Quick diagnosis - Dashboards show you what's broken
Root cause analysis - Logs and traces show you why
Fix verification - Metrics confirm the fix worked
Post-mortem - Use monitoring data to understand what happened

Advanced Monitoring Concepts

Synthetic Monitoring

Monitor your service from the user's perspective:

# Blackbox exporter config
modules:
  http_2xx:
    prober: http
    http:
      valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
      valid_status_codes: []
      method: GET

Anomaly Detection

Use machine learning to detect unusual patterns:

Prometheus AlertManager with webhook integration
Cloud provider AI/ML services
Third-party tools like DataDog or New Relic

Chaos Engineering

Monitor how your system behaves under failure:

# Chaos Monkey for Kubernetes
apiVersion: v1
kind: ConfigMap
metadata:
  name: chaoskube
data:
  chaoskube.yaml: |
    interval: 10m
    dryRun: false
    metrics-addr: 0.0.0.0:8080

The Reality Check

Here's what I've learned after years of implementing monitoring:

Start simple. You don't need a perfect monitoring setup on day one. Get the basics right first.

Monitor user impact, not infrastructure. Your users don't care if CPU is at 90% if the site is still fast.

Alert fatigue is real. Better to have fewer, more meaningful alerts than hundreds of noisy ones.

Documentation matters. Every alert needs a runbook. Every dashboard needs context.

Test your monitoring. If you can't simulate the problem, you can't verify the alert works.

Monitoring is a team sport. Get input from developers, ops, and business stakeholders.

What Actually Works

After implementing monitoring at various places, here's my proven recipe:

Start with business metrics - What do users and stakeholders care about?
Add basic service health - RED method for services, USE method for resources
Create actionable alerts - Only alert on things that need immediate action
Build focused dashboards - Show what matters, hide what doesn't
Write runbooks - Every alert should have clear next steps
Test everything - Simulate failures to verify your monitoring works

The goal isn't perfect monitoring - it's monitoring that helps you sleep better at night and respond faster when things go wrong.

What's the worst monitoring false alarm you've dealt with? Drop a comment - we've all been there.

Next week: Docker security scanning and container image best practices. Because vulnerabilities don't announce themselves.

The Devops Dispatch

1,446 follower

+ Subscribe

Yashi Verma

4mo

Thanks for sharing, Anjani!!

See more comments

To view or add a comment, sign in

The Alert That Cried Wolf

Metrics That Actually Matter

1. The RED Method (for services)

2. The USE Method (for resources)

3. Business Metrics

4. Infrastructure Health

Alerting Rules That Don't Suck

The Monitoring Stack That Works

For Metrics: Prometheus + Grafana

For Logs: ELK Stack or Loki

For Traces: Jaeger or Zipkin

Setting Up Meaningful Dashboards

SLIs and SLOs That Make Sense

Monitoring Microservices

Distributed Tracing

Service Mesh Monitoring

Golden Signals for Each Service

Common Monitoring Mistakes

Real-World Scenarios

"The site is slow"

"Customers can't checkout"

Monitoring in Different Environments

Kubernetes Monitoring

Cloud Monitoring

Cost-Effective Monitoring

The Monitoring Workflow

Tools and Libraries

When Things Go Wrong

Advanced Monitoring Concepts

Synthetic Monitoring

Anomaly Detection

Chaos Engineering

The Reality Check

What Actually Works

The Devops Dispatch

1,446 follower

More articles by Anjani Keshri

Version Control Beyond Git Basics

Ephemeral Everything — Designing Short-Lived Infra for Speed & Cost Savings

Docker: Beyond the Basics

Your Guide to Ace Networking Concepts

Why YAML is the Universal Language of DevOps Automation?

Nomad and Consul Series — Part 3: Building an API Gateway with Consul and Envoy

Nomad and Consul Series — Part 2: Implementing Service Mesh with Consul Connect

Nomad and Consul Series — Part 1: Running Your First Nomad Job with Consul Integration

Building a Modern Service Mesh with Nomad, Consul, and Envoy: A DevOps Journey

Others also viewed

Enhancing Reliability with Dynatrace Site Reliability Guardian: A Deep Dive

Understanding the Operational Landscape: SysOps, DataOps, NetOps, DevOps, MLOps, and LLMOps ( Part 1 )

Day #28 - Troubleshooting - Handling common K8s issues

What’s the Difference Between Network Automation & Network Orchestration?

When Infrastructure Scales But Understanding Doesn't

How I Built a Scalable, Secure, and Hassle-Free DevSecOps Infrastructure (While Leading hundreds of Engineers)

From Containers to Cloud-Native Confidence: A Journey of Resilience, Automation, and Trust

Navigating the Observability Maze: How to Choose the Right Monitoring Platform for Your Environment

Well-Architected: Anticipating Failure

Kubernetes in Production: The Real-World Engineering Blueprint for Resilience, Security & Scale...

Explore content categories