Monitoring That Actually Works

Monitoring That Actually Works

The time when I worked as an SRE, more than Slack notifications these PagerDuty alerts have given me nightmare. My phone used to explode with the same, and usually during odd hours.

"Database connection pool exhausted" "API response time above threshold" "Memory usage critical" "Disk space low"

These can turn your nights in disaster if you don't keep a check. Hence, I learned the difference between monitoring and noise. Most of us are drowning in the latter.

The Alert That Cried Wolf

Here's the thing about monitoring - if everything is critical, nothing is critical. I've seen teams get so desensitized to alerts that they miss actual outages because they're buried under false positives.

The golden rule: Only alert on things that require immediate human action.

This means:

  • ✅ User-facing service is down
  • ✅ Payment processing failed
  • ✅ Database replication lag is increasing
  • ❌ CPU usage hit 80% for 30 seconds
  • ❌ Disk space is at 70%
  • ❌ Memory usage is high but stable


Metrics That Actually Matter

After dealing with monitoring for years, I've narrowed it down to four categories that cover 95% of what you need:

1. The RED Method (for services)

Rate - How many requests per second Errors - How many of those requests failed Duration - How long those requests took

# prometheus
# Request rate
rate(http_requests_total[5m])

# Error rate
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])

# Duration (95th percentile)
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
        

2. The USE Method (for resources)

Utilization - How busy the resource is Saturation - How much work is queued Errors - Count of error events

#bash
# CPU utilization
100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory utilization
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Disk saturation
node_disk_io_time_seconds_total
        

3. Business Metrics

This is where most teams fail. You need metrics that business people care about:

  • Revenue per minute
  • Active users
  • Conversion rates
  • Cart abandonment

4. Infrastructure Health

Basic stuff, but critical:

  • Service uptime
  • Database connections
  • Queue depths
  • Certificate expiration


Alerting Rules That Don't Suck

Here's my framework for writing alerts that people actually respond to:

1. Make it actionable Bad: "High CPU usage" Good: "API response time above 500ms for 5 minutes - check application logs and consider scaling"

2. Include context

#YAML
- alert: HighErrorRate
  expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
  for: 5m
  annotations:
    summary: "High error rate detected"
    description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.service }}"
    runbook_url: "https://wiki.company.com/runbooks/high-error-rate"
    dashboard_url: "https://grafana.company.com/d/service-overview"
        

3. Use tiered alerting Not every problem needs to wake someone up:

  • Critical (page immediately): User-facing outage
  • Warning (Slack during business hours): Performance degradation
  • Info (email digest): Resource usage trends


The Monitoring Stack That Works

After trying everything from Nagios to DataDog, here's what I recommend:

For Metrics: Prometheus + Grafana

Why Prometheus:

  • Pull-based model is more reliable
  • Great query language (PromQL)
  • Built-in alerting
  • Huge ecosystem

Basic Prometheus setup:

# prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['localhost:9100']
  
  - job_name: 'my-app'
    static_configs:
      - targets: ['localhost:8080']
        

Essential exporters:

  • node_exporter for system metrics
  • cadvisor for container metrics
  • postgres_exporter for database metrics
  • nginx-prometheus-exporter for web server metrics

For Logs: ELK Stack or Loki

ELK (Elasticsearch, Logstash, Kibana) if you need complex log analysis Loki + Grafana if you want something simpler and cheaper

For Traces: Jaeger or Zipkin

Only if you're doing microservices and need to trace requests across services.


Setting Up Meaningful Dashboards

I see too many dashboards that look impressive but tell you nothing useful. Here's how to build dashboards people actually use:

1. Start with the user journey

  • Login success rate
  • Page load times
  • Transaction completion rate

2. Add infrastructure context

  • Response time vs CPU usage
  • Error rate vs deployment events
  • Traffic patterns vs resource usage

3. Use the inverted pyramid

  • Top level: Business metrics and SLIs
  • Middle: Service-level metrics (RED method)
  • Bottom: Infrastructure metrics (USE method)

Sample Grafana dashboard structure:

Row 1: Business KPIs
- Revenue/hour
- Active users
- Error budget remaining

Row 2: Application Health
- Request rate
- Error rate  
- Response time (p50, p95, p99)

Row 3: Infrastructure
- CPU/Memory usage
- Database connections
- Queue depths
        

SLIs and SLOs That Make Sense

Service Level Indicators (SLIs) and Service Level Objectives (SLOs) sound fancy, but they're just a way to define "good enough."

Good SLIs:

  • 95% of requests complete in under 200ms
  • 99.9% of requests return without error
  • 99% uptime over a 30-day period

Bad SLIs:

  • CPU usage stays under 80%
  • Zero errors ever
  • 100% uptime

Error budgets are your friend: If your SLO is 99.9% uptime, you have 43 minutes of downtime per month. Use it to take risks and deploy new features.


Monitoring Microservices

Microservices create unique monitoring challenges. Here's what works:

Distributed Tracing

When a request fails, you need to know which service caused it:

# Jaeger configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: jaeger-config
data:
  jaeger.yaml: |
    sampling:
      default_strategy:
        type: probabilistic
        param: 0.1  # Sample 10% of traces
        

Service Mesh Monitoring

If you're using Istio or Linkerd, you get monitoring for free:

# Istio metrics
istioctl proxy-config cluster productpage-v1-123456

# Service-to-service success rate
sum(rate(istio_requests_total{reporter="destination",response_code!~"5.*"}[5m])) 
/ sum(rate(istio_requests_total{reporter="destination"}[5m]))
        

Golden Signals for Each Service

Every service should expose:

  • /health - Is the service running?
  • /ready - Is the service ready to take traffic?
  • /metrics - Prometheus metrics endpoint

Common Monitoring Mistakes

1. Monitoring everything More metrics ≠ better monitoring. Focus on what matters.

2. Alerting on symptoms, not causes Alert on "users can't log in" not "login service CPU is high."

3. No runbooks Every alert should have a runbook explaining what to do.

4. Testing in prod only Monitor your staging environment the same way as production.

5. Forgetting about dependencies Your app might be healthy, but what about the database? The load balancer? External APIs?


Real-World Scenarios

"The site is slow"

Step 1: Check the business metrics

  • Page load times from user perspective
  • Conversion rates dropping?

Step 2: Look at application metrics

  • Response times by endpoint
  • Error rates increasing?

Step 3: Check infrastructure

  • Database query times
  • CPU/memory on app servers

"Customers can't checkout"

Step 1: Payment service health

  • Payment API response times
  • Payment gateway availability

Step 2: Database health

  • Connection pool status
  • Transaction rollback rates

Step 3: Upstream dependencies

  • Third-party payment processor status
  • CDN performance


Monitoring in Different Environments

Kubernetes Monitoring

Essential metrics for K8s:

# Pod resource usage
container_memory_usage_bytes / container_spec_memory_limit_bytes

# Pod restart count
increase(kube_pod_container_status_restarts_total[1h])

# Node resource usage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
        

Cloud Monitoring

Each cloud provider has their own monitoring service:

AWS CloudWatch:

# Create custom metric
aws cloudwatch put-metric-data --namespace "MyApp" --metric-data MetricName=ErrorRate,Value=0.05,Unit=Percent
        

Azure Monitor:

# Query logs
az monitor log-analytics query --workspace myworkspace --analytics-query "requests | where resultCode >= 400"
        

GCP Monitoring:

# Create alert policy
gcloud alpha monitoring policies create --policy-from-file=policy.yaml
        

Cost-Effective Monitoring

Monitoring can get expensive fast. Here's how to keep costs down:

1. Smart retention policies

  • High-resolution metrics: 7 days
  • Medium resolution: 30 days
  • Low resolution: 1 year

2. Sampling for high-volume services

# Sample 1% of traces
sampling:
  type: probabilistic
  param: 0.01
        

3. Use recording rules Pre-compute expensive queries:

groups:
- name: my-app-rules
  rules:
  - record: job:request_rate:5m
    expr: sum(rate(http_requests_total[5m])) by (job)
        

4. Alert on trends, not spikes

# Alert on sustained high error rate, not brief spikes
expr: rate(http_errors_total[5m]) > 0.05
for: 5m
        

The Monitoring Workflow

Here's my process for implementing monitoring on a new service:

1. Define what "working" means

  • What's the user experience?
  • What are the critical user journeys?

2. Implement basic metrics

  • Health check endpoint
  • Basic RED/USE metrics
  • Business metrics

3. Set up dashboards

  • Start simple with key metrics
  • Add detail as needed

4. Create alerts

  • Start with critical user-facing issues
  • Add more as you understand the system

5. Write runbooks

  • Document what each alert means
  • Include investigation steps

6. Test and iterate

  • Trigger alerts intentionally
  • Improve based on real incidents


Tools and Libraries

Application metrics (choose one):

  • Prometheus client libraries (Go, Python, Java, etc.)
  • StatsD for simple counting
  • OpenTelemetry for modern apps

System monitoring:

# Essential exporters
docker run -d --name=node-exporter -p 9100:9100 prom/node-exporter
docker run -d --name=cadvisor -p 8080:8080 gcr.io/cadvisor/cadvisor
        

Log shipping:

# Fluent Bit for lightweight log forwarding  
docker run -d --name=fluent-bit -v /var/log:/var/log fluent/fluent-bit
        

When Things Go Wrong

Incident response with good monitoring:

  1. Get alerted to the right thing - User impact, not infrastructure noise
  2. Quick diagnosis - Dashboards show you what's broken
  3. Root cause analysis - Logs and traces show you why
  4. Fix verification - Metrics confirm the fix worked
  5. Post-mortem - Use monitoring data to understand what happened


Advanced Monitoring Concepts

Synthetic Monitoring

Monitor your service from the user's perspective:

# Blackbox exporter config
modules:
  http_2xx:
    prober: http
    http:
      valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
      valid_status_codes: []
      method: GET
        

Anomaly Detection

Use machine learning to detect unusual patterns:

  • Prometheus AlertManager with webhook integration
  • Cloud provider AI/ML services
  • Third-party tools like DataDog or New Relic

Chaos Engineering

Monitor how your system behaves under failure:

# Chaos Monkey for Kubernetes
apiVersion: v1
kind: ConfigMap
metadata:
  name: chaoskube
data:
  chaoskube.yaml: |
    interval: 10m
    dryRun: false
    metrics-addr: 0.0.0.0:8080
        

The Reality Check

Here's what I've learned after years of implementing monitoring:

Start simple. You don't need a perfect monitoring setup on day one. Get the basics right first.

Monitor user impact, not infrastructure. Your users don't care if CPU is at 90% if the site is still fast.

Alert fatigue is real. Better to have fewer, more meaningful alerts than hundreds of noisy ones.

Documentation matters. Every alert needs a runbook. Every dashboard needs context.

Test your monitoring. If you can't simulate the problem, you can't verify the alert works.

Monitoring is a team sport. Get input from developers, ops, and business stakeholders.

What Actually Works

After implementing monitoring at various places, here's my proven recipe:

  1. Start with business metrics - What do users and stakeholders care about?
  2. Add basic service health - RED method for services, USE method for resources
  3. Create actionable alerts - Only alert on things that need immediate action
  4. Build focused dashboards - Show what matters, hide what doesn't
  5. Write runbooks - Every alert should have clear next steps
  6. Test everything - Simulate failures to verify your monitoring works

The goal isn't perfect monitoring - it's monitoring that helps you sleep better at night and respond faster when things go wrong.

What's the worst monitoring false alarm you've dealt with? Drop a comment - we've all been there.


Next week: Docker security scanning and container image best practices. Because vulnerabilities don't announce themselves.

To view or add a comment, sign in

More articles by Anjani Keshri

Others also viewed

Explore content categories