Monitoring That Actually Works
The time when I worked as an SRE, more than Slack notifications these PagerDuty alerts have given me nightmare. My phone used to explode with the same, and usually during odd hours.
"Database connection pool exhausted" "API response time above threshold" "Memory usage critical" "Disk space low"
These can turn your nights in disaster if you don't keep a check. Hence, I learned the difference between monitoring and noise. Most of us are drowning in the latter.
The Alert That Cried Wolf
Here's the thing about monitoring - if everything is critical, nothing is critical. I've seen teams get so desensitized to alerts that they miss actual outages because they're buried under false positives.
The golden rule: Only alert on things that require immediate human action.
This means:
Metrics That Actually Matter
After dealing with monitoring for years, I've narrowed it down to four categories that cover 95% of what you need:
1. The RED Method (for services)
Rate - How many requests per second Errors - How many of those requests failed Duration - How long those requests took
# prometheus
# Request rate
rate(http_requests_total[5m])
# Error rate
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])
# Duration (95th percentile)
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
2. The USE Method (for resources)
Utilization - How busy the resource is Saturation - How much work is queued Errors - Count of error events
#bash
# CPU utilization
100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory utilization
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# Disk saturation
node_disk_io_time_seconds_total
3. Business Metrics
This is where most teams fail. You need metrics that business people care about:
4. Infrastructure Health
Basic stuff, but critical:
Alerting Rules That Don't Suck
Here's my framework for writing alerts that people actually respond to:
1. Make it actionable Bad: "High CPU usage" Good: "API response time above 500ms for 5 minutes - check application logs and consider scaling"
2. Include context
#YAML
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 5m
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.service }}"
runbook_url: "https://wiki.company.com/runbooks/high-error-rate"
dashboard_url: "https://grafana.company.com/d/service-overview"
3. Use tiered alerting Not every problem needs to wake someone up:
The Monitoring Stack That Works
After trying everything from Nagios to DataDog, here's what I recommend:
For Metrics: Prometheus + Grafana
Why Prometheus:
Basic Prometheus setup:
# prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'node-exporter'
static_configs:
- targets: ['localhost:9100']
- job_name: 'my-app'
static_configs:
- targets: ['localhost:8080']
Essential exporters:
For Logs: ELK Stack or Loki
ELK (Elasticsearch, Logstash, Kibana) if you need complex log analysis Loki + Grafana if you want something simpler and cheaper
For Traces: Jaeger or Zipkin
Only if you're doing microservices and need to trace requests across services.
Setting Up Meaningful Dashboards
I see too many dashboards that look impressive but tell you nothing useful. Here's how to build dashboards people actually use:
1. Start with the user journey
2. Add infrastructure context
3. Use the inverted pyramid
Sample Grafana dashboard structure:
Row 1: Business KPIs
- Revenue/hour
- Active users
- Error budget remaining
Row 2: Application Health
- Request rate
- Error rate
- Response time (p50, p95, p99)
Row 3: Infrastructure
- CPU/Memory usage
- Database connections
- Queue depths
SLIs and SLOs That Make Sense
Service Level Indicators (SLIs) and Service Level Objectives (SLOs) sound fancy, but they're just a way to define "good enough."
Good SLIs:
Bad SLIs:
Error budgets are your friend: If your SLO is 99.9% uptime, you have 43 minutes of downtime per month. Use it to take risks and deploy new features.
Monitoring Microservices
Microservices create unique monitoring challenges. Here's what works:
Distributed Tracing
When a request fails, you need to know which service caused it:
# Jaeger configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: jaeger-config
data:
jaeger.yaml: |
sampling:
default_strategy:
type: probabilistic
param: 0.1 # Sample 10% of traces
Service Mesh Monitoring
If you're using Istio or Linkerd, you get monitoring for free:
# Istio metrics
istioctl proxy-config cluster productpage-v1-123456
# Service-to-service success rate
sum(rate(istio_requests_total{reporter="destination",response_code!~"5.*"}[5m]))
/ sum(rate(istio_requests_total{reporter="destination"}[5m]))
Golden Signals for Each Service
Every service should expose:
Common Monitoring Mistakes
1. Monitoring everything More metrics ≠ better monitoring. Focus on what matters.
2. Alerting on symptoms, not causes Alert on "users can't log in" not "login service CPU is high."
3. No runbooks Every alert should have a runbook explaining what to do.
4. Testing in prod only Monitor your staging environment the same way as production.
5. Forgetting about dependencies Your app might be healthy, but what about the database? The load balancer? External APIs?
Real-World Scenarios
"The site is slow"
Step 1: Check the business metrics
Step 2: Look at application metrics
Step 3: Check infrastructure
"Customers can't checkout"
Step 1: Payment service health
Step 2: Database health
Step 3: Upstream dependencies
Monitoring in Different Environments
Kubernetes Monitoring
Essential metrics for K8s:
# Pod resource usage
container_memory_usage_bytes / container_spec_memory_limit_bytes
# Pod restart count
increase(kube_pod_container_status_restarts_total[1h])
# Node resource usage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
Cloud Monitoring
Each cloud provider has their own monitoring service:
AWS CloudWatch:
# Create custom metric
aws cloudwatch put-metric-data --namespace "MyApp" --metric-data MetricName=ErrorRate,Value=0.05,Unit=Percent
Azure Monitor:
# Query logs
az monitor log-analytics query --workspace myworkspace --analytics-query "requests | where resultCode >= 400"
GCP Monitoring:
# Create alert policy
gcloud alpha monitoring policies create --policy-from-file=policy.yaml
Cost-Effective Monitoring
Monitoring can get expensive fast. Here's how to keep costs down:
1. Smart retention policies
2. Sampling for high-volume services
# Sample 1% of traces
sampling:
type: probabilistic
param: 0.01
3. Use recording rules Pre-compute expensive queries:
groups:
- name: my-app-rules
rules:
- record: job:request_rate:5m
expr: sum(rate(http_requests_total[5m])) by (job)
4. Alert on trends, not spikes
# Alert on sustained high error rate, not brief spikes
expr: rate(http_errors_total[5m]) > 0.05
for: 5m
The Monitoring Workflow
Here's my process for implementing monitoring on a new service:
1. Define what "working" means
2. Implement basic metrics
3. Set up dashboards
4. Create alerts
5. Write runbooks
6. Test and iterate
Tools and Libraries
Application metrics (choose one):
System monitoring:
# Essential exporters
docker run -d --name=node-exporter -p 9100:9100 prom/node-exporter
docker run -d --name=cadvisor -p 8080:8080 gcr.io/cadvisor/cadvisor
Log shipping:
# Fluent Bit for lightweight log forwarding
docker run -d --name=fluent-bit -v /var/log:/var/log fluent/fluent-bit
When Things Go Wrong
Incident response with good monitoring:
Advanced Monitoring Concepts
Synthetic Monitoring
Monitor your service from the user's perspective:
# Blackbox exporter config
modules:
http_2xx:
prober: http
http:
valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
valid_status_codes: []
method: GET
Anomaly Detection
Use machine learning to detect unusual patterns:
Chaos Engineering
Monitor how your system behaves under failure:
# Chaos Monkey for Kubernetes
apiVersion: v1
kind: ConfigMap
metadata:
name: chaoskube
data:
chaoskube.yaml: |
interval: 10m
dryRun: false
metrics-addr: 0.0.0.0:8080
The Reality Check
Here's what I've learned after years of implementing monitoring:
Start simple. You don't need a perfect monitoring setup on day one. Get the basics right first.
Monitor user impact, not infrastructure. Your users don't care if CPU is at 90% if the site is still fast.
Alert fatigue is real. Better to have fewer, more meaningful alerts than hundreds of noisy ones.
Documentation matters. Every alert needs a runbook. Every dashboard needs context.
Test your monitoring. If you can't simulate the problem, you can't verify the alert works.
Monitoring is a team sport. Get input from developers, ops, and business stakeholders.
What Actually Works
After implementing monitoring at various places, here's my proven recipe:
The goal isn't perfect monitoring - it's monitoring that helps you sleep better at night and respond faster when things go wrong.
What's the worst monitoring false alarm you've dealt with? Drop a comment - we've all been there.
Next week: Docker security scanning and container image best practices. Because vulnerabilities don't announce themselves.
Thanks for sharing, Anjani!!