🔍 Day 19 of #100DaysOfCloud – Deep Dive into Prometheus Architecture for Cloud Native Monitoring

Sushant Kataria

Azure Cloud Engineer | SRE | DevOps Engineer

Published Jul 30, 2025

Monitoring modern cloud-native applications—especially on dynamic platforms like Kubernetes—requires scalable, flexible, and real-time observability tools. One of the most widely adopted tools in this space is Prometheus, known for its powerful time-series data handling, native Kubernetes support, and seamless integration with Grafana.

Today, I’m diving into Prometheus architecture, explaining how it collects, stores, and queries metrics, along with its core components like TSDB, exporters, service discovery, Push Gateway, and Alertmanager.

📊 Prometheus Architecture Overview

Here’s a high-level breakdown of how Prometheus functions as a robust monitoring system:

Article content — Prometheus Architecture

🧠 1. Prometheus Server (Scraper)

At the heart of the architecture is the Prometheus server. It scrapes metrics from configured targets over HTTP, usually from the /metrics endpoint. This is a pull-based model, meaning Prometheus initiates the data collection at fixed intervals.

You define these scrape intervals and targets in the Prometheus configuration file (prometheus.yml). The targets could be anything from Kubernetes pods to Linux servers to external APIs.

🗂️ 2. Time Series Database (TSDB)

The scraped data is stored in Prometheus’s internal Time Series Database (TSDB). This database organizes the data in a time series format:

<metric_name>{<labels>} timestamp value

Prometheus supports retention policies for TSDB to prevent storage overload:

Time-based retention: e.g., keep data for 15 days (default).
Size-based retention: e.g., set a max volume (like 50GB); older data is deleted once this limit is reached.

🔍 3. Service Discovery

Prometheus supports both static and dynamic service discovery:

Static config: For targets with fixed IPs or DNS (e.g., legacy VMs or APIs).
Kubernetes SD: Prometheus uses the Kubernetes API to auto-discover pods, services, or endpoints, adapting to changes caused by autoscaling and rolling updates.
File-based SD: You can also define targets in separate .json or .yaml files (e.g., using file_sd_configs).

This dynamic discovery is critical for cloud environments like AKS, where infrastructure is constantly changing.

📦 4. Exporters

Prometheus doesn’t always scrape metrics directly from services. Instead, it relies on exporters—lightweight agents that expose system or application metrics in Prometheus format.

Examples:

Node Exporter for Linux system metrics like CPU, memory, disk I/O.
Kube-State-Metrics for Kubernetes object states.
Custom exporters for proprietary applications.

Configuration is done in prometheus.yml by specifying the exporter endpoint.

📤 5. Push Gateway

Prometheus is fundamentally a pull-based system, which isn’t ideal for short-lived jobs like Kubernetes CronJobs or ephemeral batch jobs.

That’s where Push Gateway comes in.

These short-lived jobs push their metrics to the gateway via HTTP.
Prometheus then pulls metrics from the Push Gateway’s /metrics endpoint.

Use case: pushing backup success/failure metrics or batch file processing stats.

🚨 6. Alertmanager

Alertmanager handles alerts sent by Prometheus based on defined conditions.

Prometheus evaluates rules like:

alert HighCPUUsage
  if cpu_usage > 80
  for 5m

If the condition is true, it triggers an alert and sends it to Alertmanager.
Alertmanager then routes notifications via email, Slack, PagerDuty, etc., based on receiver configurations.

This decoupling ensures that Prometheus focuses on monitoring, while Alertmanager handles communication and deduplication.

🔎 7. PromQL – Prometheus Query Language

PromQL is a flexible, powerful query language used to retrieve and visualize time-series data. You can use it:

In Prometheus web UI (localhost:9090).
In Grafana dashboards (Prometheus as a data source).
In CLI for scripting or troubleshooting.

Sample query:

rate(http_requests_total[5m])

This would give you the per-second request rate over the last 5 minutes.

🧩 Summary

Prometheus is purpose-built for cloud-native monitoring and works exceptionally well with Kubernetes environments like AKS. Here’s a quick recap of how it works:

Prometheus pulls metrics from configured targets at defined intervals. The data is stored in a local time-series database, governed by retention policies. Targets are discovered either statically or dynamically via Kubernetes APIs. Exporters expose the metrics in Prometheus format, and batch jobs use Push Gateway to push their metrics. Alertmanager handles notifications, and PromQL helps query and visualize metrics.**

🔚 Final Thoughts

In my work with AKS and Node.js microservices, Prometheus has helped us proactively monitor resource usage, identify performance bottlenecks, and set up real-time alerts for key metrics. It’s a foundational tool for observability in DevOps and SRE practices.

✅ If you’re deploying services in a distributed environment, integrating Prometheus + Grafana is a must-have step for visibility and peace of mind.

📌 Next up on Day 20: I’ll dive into Grafana integration with Prometheus and how to build real-time dashboards for cloud-native apps.

Let me know if you’ve used Prometheus in your projects—or if you want to see a hands-on setup guide!

#DevOps #Azure #100DaysOfCloud #Prometheus #CloudMonitoring #AKS #Observability #Grafana

Tyler Robinson

Technical Leadership | Expert in Platform Software, Cloud Technologies, and Strategic Innovation | Personal Accountability Change Agent

Fantastic work, Sushant! Exploring Prometheus with AKS for real-time monitoring showcases its power in cloud-native observability. Keep pushing the boundaries! 📊

1 Reaction

See more comments

🔍 Day 19 of #100DaysOfCloud – Deep Dive into Prometheus Architecture for Cloud Native Monitoring

Sushant Kataria

Azure Cloud Engineer | SRE | DevOps Engineer

📊 Prometheus Architecture Overview

🧠 1. Prometheus Server (Scraper)

🗂️ 2. Time Series Database (TSDB)

🔍 3. Service Discovery

📦 4. Exporters

📤 5. Push Gateway

🚨 6. Alertmanager

🔎 7. PromQL – Prometheus Query Language

🧩 Summary

More articles by this author

Others also viewed

Week 25 (17 Jun - 23 Jun)

Automating Workflows with AWS Lambda and S3 Event Triggers: A Step-by-Step Guide

Building a Robust Message Queue System with Redis

How To Use Streams To React To Events In DynamoDB

Scale-up vs Scale-out

Kafka & Serverless: A Match Made in the Cloud

Optimize AWS Lambda: Use Cases, Security, Performance Tips & Cost Management

Kafka Summit London kicks off in style

What We Learned from Moving Complex Legacy Systems to Azure for a Global Client: A Tech-Led Perspective

Answers to Todays Questions

Explore topics

📊 Prometheus Architecture Overview

🧠 1. Prometheus Server (Scraper)

🗂️ 2. Time Series Database (TSDB)

🔍 3. Service Discovery

📦 4. Exporters

📤 5. Push Gateway

🚨 6. Alertmanager

🔎 7. PromQL – Prometheus Query Language

🧩 Summary

Day 20 of #100DaysOfCloud, Focusing on integrating Grafana with Prometheus on a Kubernetes cluster 🎯📊

Jul 31, 2025

🚀 Day 18 of #100DaysOfCloud – Getting Started with AKS: Pods, Nodes, Autoscaling & Ingress Controllers

Jul 29, 2025

🚀 Day 17 of #100DaysOfCloud — Secure Secrets in Azure with Key Vault for CI/CD Pipelines

Jul 28, 2025

Project Spotlight: Serverless CI/CD Feedback Mechanism with Azure Functions & Logic Apps

Jul 25, 2025

Day 15 of #100DaysOfCloud ☁️ — Azure Storage Demystified: What, Why & How

Jul 24, 2025

🏗️ Provisioning Resources in Azure Using ARM Templates

Jul 22, 2025

🐳 Day 13 of #100DaysOfCloud – Deep Dive into Docker Optimization & Networking

Jul 21, 2025

🌍 Day 12 of #100DaysOfCloud Azure Traffic Manager vs Front Door – Global Traffic Routing Demystified

Jul 20, 2025

🔒 Day 11 of #100DaysOfCloud – Secure Your Azure Environment like a Pro

Jul 19, 2025

🚀 Day 10 of #100DaysOfCloud — Deploying AKS with Terraform + Azure DevOps 💥

Jul 18, 2025

Others also viewed

Week 25 (17 Jun - 23 Jun)

Automating Workflows with AWS Lambda and S3 Event Triggers: A Step-by-Step Guide

Building a Robust Message Queue System with Redis

How To Use Streams To React To Events In DynamoDB

Scale-up vs Scale-out

Kafka & Serverless: A Match Made in the Cloud

Optimize AWS Lambda: Use Cases, Security, Performance Tips & Cost Management

Kafka Summit London kicks off in style

What We Learned from Moving Complex Legacy Systems to Azure for a Global Client: A Tech-Led Perspective

Answers to Todays Questions

Explore topics