SlideShare a Scribd company logo
TITLE IN CAPITAL
LETTERS
SUBTITLE IN CAPITAL LETTERS
JAN MUSSLER
jan.mussler@zalando.de
Twitter: @JanMussler
zmon.io
#NETWAYS #OSMC 30-11-2016
ZMON
Open Source Monitoring in the Cloud
15 countries
19+ million active customers
160+ million visits per month
200k+ articles
3.0+ billion € revenue
~ 1.600 employees in tech
Europe's Leading Online Fashion Platform
Visit us: tech.zalando.com
Zalando’s Technology History
RADICAL AGILITY
AUTONOMY
➊ One AWS account per Team
➋ Deployment with Docker
➌ Managed SSH Access
➍ REST/OAuth 2.0 mandatory
➎ Traceability of changes
IN A NUTSHELL
STUPS
Internet
*.abc.example.org *.xyz.example.org
Team ABC Team XYZ
ISOLATED AWS ACCOUNTS
EC2EC2
ELBELB
EC2
RESPONSIBILITY
OWNERSHIP
Host Host
Service 4 Service 4
Host
Service 3 Service 3
Service 1 Service 1Monitoring
Team?
Service 2 Service 2
Monitoring the old way?
Team
Team
Team
Team
Build with teams and services in mind ...
Host Host Host
Service 4 Service 4Service 4
Host
Team 3
Service 3 Service 3Service 3Team 2
Service 1 Service 1Service 1Team 1
Service 2 Service 2
ZMON.io
Flexible and extendable: Checks & Alerts in Python
Integrate: REST APIs, OAUTH2, Auto Discovery
Configurable via UI / API: no restarts required!
Great for teams: autonomy and responsibility
Fast/Scaling metrics: Redis, KairosDB + Grafana 3
ZMON - Highlights ;-)
OSMC 2016 | ZMON: Zalando's OS approach to monitoring in the cloud and DCs by Jan Mussler
OSMC 2016 | ZMON: Zalando's OS approach to monitoring in the cloud and DCs by Jan Mussler
OSMC 2016 | ZMON: Zalando's OS approach to monitoring in the cloud and DCs by Jan Mussler
OSMC 2016 | ZMON: Zalando's OS approach to monitoring in the cloud and DCs by Jan Mussler
OSMC 2016 | ZMON: Zalando's OS approach to monitoring in the cloud and DCs by Jan Mussler
Good old green and red boxes?
Full authentication for all endpoints
OAUTH2 login flow (e.g. Github login)
“TV Tokens” for “read-only” dashboard login
Grafana 3 bundled and API implemented
Proxy for KairosDB (timeseries db)
ZMON Controller - User Interface and REST API
Display historic data using Grafana 3
Various options for notifications ...
E-Mail
Twilio (phone call)
PUSH
ENTITIES
● hosts, databases, applications, instances ...
● generic key value object
● 20000+ entities in our deployment
Entities
{
"id": "node01:8080",
"type": "instance",
"host": "node01",
"ports": {"8080":8080,"8181":8181},
"application_id": "zmon",
"application_version": "0.1.0",
"dc":"dc1"
}
Entity "node01:8080"
Entity Service (part of controller)
id: localhost:5432
type: postgres
host: localhost
port: 5432
shards:
local_zmon_db: "localhost:5432/local_zmon_db"
local-postgres.yaml
Integrated easy-to-use entity store with REST API
Build your own discovery agent (K8S, …)
>zmon entities push local-postgres.yaml
CHECKS
● select subset of entities
● executes Python expression
○ powerful using eval with custom context
○ Builtins: HTTP, PostgreSQL, MySQL, CloudWatch,
Redis, SNMP/NRPE, tcp,Scalyr, ElasticSearch, …
○ Data filtering/formating/pivoting
● returns "value" object -> dicts everywhere
Checks
OSMC 2016 | ZMON: Zalando's OS approach to monitoring in the cloud and DCs by Jan Mussler
SNMP and Nagios NRPE support
REST API to update or use web front end
zmon check-definitions update select-1-check.yaml
Managing checks
name: "Select 1"
owning_team: "Team ZMON"
command: |
sql().execute("select 1 as a").results()
entities:
- type: postgresql
interval: 15
description: "Test connection to PostgreSQL"
select-1-check.yaml
Trial Run - Quick feedback and easier development
ALERTS
● Attached to a single check, inspect check result
● Defines team and responsible team
● Allows inheritance from other alert
● Evaluates Python expression yielding True/False
● No "WARNING" state, no "UNKNOWN" state
● Priorities(color) and tags
Alerts
OSMC 2016 | ZMON: Zalando's OS approach to monitoring in the cloud and DCs by Jan Mussler
Downtimes
● Set or schedule downtimes using the UI
● Use API to automate downtimes, e.g. in deployment tool
Reuse existing checks for core infrastructure
Anyone can add alerts to checks
Monitor application boundaries/dependencies
Make use of inheritance to customize
Sharing and reuse of alerts and checks
EXAMPLE
Tokeninfo (GO)Tokeninfo (GO)
Provider (Java)
Provider (Java)
Tokeninfo (GO)Tokeninfo (GO)
C* Nodes
C* Nodes
C* Nodes
C* Nodes
Plan B Deployment - Multi Region Setup (JWT issue/verification)
C* NodesProvider (Java)ELB
Tokeninfo (Go)ELB
C* NodesProvider (Java)ELB
Tokeninfo (Go)ELB
Will create “entities” to describe deployment
ELBs, ASGs, Application, instances,...
Crawls AWS API every 60 sec to update
ZMON AWS Agent - Auto Discovery
➜ ~ zmon entities get "planb-tokeninfo-cd44-oFM6x[aws:999:eu-west-1]"
id: planb-tokeninfo-cd44-oFM6x[aws:999:eu-west-1]
type: instance
application_id: planb-tokeninfo
host: 172.31.169.6
infrastructure_account: aws:999
instance_type: c4.xlarge
ip: 172.31.169.6
ports: { '9020': 9020, '9021': 9021 }
region: eu-west-1
source: registry.opensource.zalan.do/stups/planb-tokeninfo:cd44
stack_name: planb-tokeninfo-eu-west-1
stack_version: cd44
Example Instance Entity
➜ ~ zmon entities get " elb-data-service-cd79c9[aws:...:eu-central-1] "
id: elb-data-service-cd79c9[aws:...:eu-central-1]
type: elb
name: data-service-cd79c9
active_members: 5
cloudwatch_name: app/data-service-cd79c9/18b164bfa427486d
dns_name: data-service-cd79c9-961635181.eu-central-1.elb.amazonaws.com
dns_traffic: 'true'
dns_weight: 200
elb_type: application
members: 5
region: eu-central-1
scheme: internet-facing
Example Instance Entity
Instance Metrics
● Memory usage
● Disk space usage
● CPU usage
● Application logs
● Application metrics
Monitoring Plan-B EC2 instances on AWS
Scalyr Agent
Log shipping
Prometheus
Node Agent
:9100/metrics
Taupage AMI (Ubuntu base)
Application Container
Go / Spring Boot / Cassandra
Docker run time
:8080 -> app
:7979 -> metrics
Jolokia Request Example
Check Results
Check result - Grafana 3 link
AWS UI deep link
Monitor your deployments … data tagged with version
OSMC 2016 | ZMON: Zalando's OS approach to monitoring in the cloud and DCs by Jan Mussler
Annotated Metric Data in Grafana
HTTP requests reading JSON application metrics
Read JMX data via Jolokia/HTTP for Cassandra
Read Prometheus Node data for EC2 metrics
CloudWatch() queries for ELB metrics
Scalyr API queries for application logs
Check commands used so far
DEPLOYMENT
Workers
(Python)
Workers
(Python)
ZMON Core + UI + KairosDB
Scheduler
(jvm)
Redis
Worker
(Python)
KairosDB
(Java)
Controller
(Java)
PostgreSQL
Queue/State
CLI
(Python)
Check/Alert definition
Entity data
Cassandra
Frontend
(AngularJS)
Metric Cache
OSMC 2016 | ZMON: Zalando's OS approach to monitoring in the cloud and DCs by Jan Mussler
● Scheduler supports queue filters by entity
○ e.g. {"dc":"dc1"} vs {"dc":"dc2"} queue filters
● Scheduler can apply base filter
○ only handles entities with {"dc":"dc1"}
● Worker can report home using:
○ Redis (we use this across DCs)
○ HTTPS (AWS->DC)
Multi DC / Zone deployment possible
ZMON in AWS / Multi DC Setup
*.foo.example.org *.bar.example.org
Team "Foo" Team "Bar"
EC2
Instance
EC2
InstanceEC2
Instance
EC2
Instance
ZMON
Appliance
ZMON
ApplianceEC2
Instance
EC2
Instance
ZMON
Data Service
ELB ELB
MICROSERVICES
Application metrics
Continued ...
Spring Boot (extending metrics)
https://github.com/zalando/zmon-actuator
Python (Swagger first on Flask)
https://github.com/zalando/connexion
Clojure (Swagger first)
https://github.com/zalando-stups/friboo/
Scala Play
https://github.com/zalando-incubator/markscheider
Example libraries and framework support ...
Demo:
https://demo.zmon.io
ZMON and Slack:
https://zmon.io && https://slack.zmon.io
Documentation:
https://docs.zmon.io
Zalando Tech:
https://tech.zalando.com
Expose your data / Convention on key names/structure
{
"zmon.response.200.GET.checks.all-active-check-definitions.count": 10,
"zmon.response.200.GET.checks.all-active-check-definitions.fifteenMinuteRate": 0.18071,
"zmon.response.200.GET.checks.all-active-check-definitions.fiveMinuteRate": 0.15181,
"zmon.response.200.GET.checks.all-active-check-definitions.oneMinuteRate": 0.10512,
"zmon.response.200.GET.checks.all-active-check-definitions.75thPercentile": 1173,
"zmon.response.200.GET.checks.all-active-check-definitions.95thPercentile": 1233,
"zmon.response.200.GET.checks.all-active-check-definitions.999thPercentile": 1282,
"zmon.response.200.GET.checks.all-active-check-definitions.99thPercentile": 1282,
"zmon.response.200.GET.checks.all-active-check-definitions.max": 1282,
"zmon.response.200.GET.checks.all-active-check-definitions.median": 1161,
"zmon.response.200.GET.checks.all-active-check-definitions.min": 1114
}

More Related Content

PDF
Atmosphere 2016 - Jan Mussler - ZMON: Zalando's OS approach to monitoring in...
PDF
Declarative & workflow based infrastructure with Terraform
PDF
Flink Forward SF 2017: Jamie Grier - Apache Flink - The latest and greatest
PDF
Beautiful Monitoring With Grafana and InfluxDB
PDF
Terraform: Cloud Configuration Management (WTC/IPC'16)
PDF
Prometheus – a next-gen Monitoring System
PDF
Timeseries - data visualization in Grafana
PDF
Benchx: An XQuery benchmarking web application
Atmosphere 2016 - Jan Mussler - ZMON: Zalando's OS approach to monitoring in...
Declarative & workflow based infrastructure with Terraform
Flink Forward SF 2017: Jamie Grier - Apache Flink - The latest and greatest
Beautiful Monitoring With Grafana and InfluxDB
Terraform: Cloud Configuration Management (WTC/IPC'16)
Prometheus – a next-gen Monitoring System
Timeseries - data visualization in Grafana
Benchx: An XQuery benchmarking web application

What's hot (19)

PDF
Central LogFile Storage. ELK stack Elasticsearch, Logstash and Kibana.
PPTX
Presto overview
PDF
Collect distributed application logging using fluentd (EFK stack)
PDF
Flink Forward SF 2017: Kenneth Knowles - Back to Sessions overview
PPTX
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
PDF
Search@airbnb
PDF
Chronix as Long-Term Storage for Prometheus
ODP
Using Grails to power your electric car
PDF
Docker and Fluentd (revised)
PDF
Paul Dix (Founder InfluxDB) - Organising Metrics at #DOXLON
PDF
Using akka streams to access s3 objects
PPTX
Terraform at Scale
PDF
Apache Kafka: New Features That You Might Not Know About
PPTX
Scalable Applications with Scala
PDF
Rapid Infrastructure Provisioning
PPTX
Apache Incubator Samza: Stream Processing at LinkedIn
PDF
Kraken Front-Trends
PDF
Fullstack conf 2017 - Basic dev pipeline end-to-end
PDF
Apache Sling - Distributed Eventing, Discovery, and Jobs (adaptTo 2013)
Central LogFile Storage. ELK stack Elasticsearch, Logstash and Kibana.
Presto overview
Collect distributed application logging using fluentd (EFK stack)
Flink Forward SF 2017: Kenneth Knowles - Back to Sessions overview
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
Search@airbnb
Chronix as Long-Term Storage for Prometheus
Using Grails to power your electric car
Docker and Fluentd (revised)
Paul Dix (Founder InfluxDB) - Organising Metrics at #DOXLON
Using akka streams to access s3 objects
Terraform at Scale
Apache Kafka: New Features That You Might Not Know About
Scalable Applications with Scala
Rapid Infrastructure Provisioning
Apache Incubator Samza: Stream Processing at LinkedIn
Kraken Front-Trends
Fullstack conf 2017 - Basic dev pipeline end-to-end
Apache Sling - Distributed Eventing, Discovery, and Jobs (adaptTo 2013)
Ad

Similar to OSMC 2016 | ZMON: Zalando's OS approach to monitoring in the cloud and DCs by Jan Mussler (20)

PDF
ZMON: Monitoring Zalando's Engineering Platform
PPTX
PostgreSQL Finland October meetup - PostgreSQL monitoring in Zalando
PDF
Powering Radical Agility with Docker
PPTX
More about AWS.pptx
PDF
Developer Experience at Zalando - CNCF End User SIG-DX
PDF
Mainframe DevOps Using Zowe Open Source
PPTX
AWS_Simple_Icons_PPT.pptx
PPTX
AWS_Simple_Icons_PPT_v18.02.22.b2ea1dadee22ca9ba19f30384d69f48409fab707.pptx
PPTX
Aws simple icons_ppt_v18.02.22.b2ea1dadee22ca9ba19f30384d69f48409fab707
PPTX
Aws simple icons_ppt_v18.02.22.b2ea1dadee22ca9ba19f30384d69f48409fab707
PPTX
Aws simple icons_ppt_v18.02.22.b2ea1dadee22ca9ba19f30384d69f48409fab707
PPTX
Using AWS To Build A Scalable Machine Data Analytics Service
PPTX
Cloud Native with Kyma
PDF
AWS DevDay Berlin 2019 - Simplify your Web & Mobile apps with cloud-based ser...
PDF
Pdf tech deep dive 42 paris
PPTX
re:Invent CON320 Tracing and Debugging for Containerized Services
PPTX
Aws simple icons_ppt
PPTX
Aws simple icons_ppt (1)
PPTX
Aws simple _ppt
PPTX
Aws simple icons_ppt
ZMON: Monitoring Zalando's Engineering Platform
PostgreSQL Finland October meetup - PostgreSQL monitoring in Zalando
Powering Radical Agility with Docker
More about AWS.pptx
Developer Experience at Zalando - CNCF End User SIG-DX
Mainframe DevOps Using Zowe Open Source
AWS_Simple_Icons_PPT.pptx
AWS_Simple_Icons_PPT_v18.02.22.b2ea1dadee22ca9ba19f30384d69f48409fab707.pptx
Aws simple icons_ppt_v18.02.22.b2ea1dadee22ca9ba19f30384d69f48409fab707
Aws simple icons_ppt_v18.02.22.b2ea1dadee22ca9ba19f30384d69f48409fab707
Aws simple icons_ppt_v18.02.22.b2ea1dadee22ca9ba19f30384d69f48409fab707
Using AWS To Build A Scalable Machine Data Analytics Service
Cloud Native with Kyma
AWS DevDay Berlin 2019 - Simplify your Web & Mobile apps with cloud-based ser...
Pdf tech deep dive 42 paris
re:Invent CON320 Tracing and Debugging for Containerized Services
Aws simple icons_ppt
Aws simple icons_ppt (1)
Aws simple _ppt
Aws simple icons_ppt
Ad

Recently uploaded (20)

PDF
Tally Prime Crack Download New Version 5.1 [2025] (License Key Free
PPTX
Why Generative AI is the Future of Content, Code & Creativity?
PDF
Nekopoi APK 2025 free lastest update
PDF
Cost to Outsource Software Development in 2025
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PPTX
Reimagine Home Health with the Power of Agentic AI​
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PPTX
Weekly report ppt - harsh dattuprasad patel.pptx
PPTX
CHAPTER 2 - PM Management and IT Context
PPTX
WiFi Honeypot Detecscfddssdffsedfseztor.pptx
PDF
CapCut Video Editor 6.8.1 Crack for PC Latest Download (Fully Activated) 2025
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PDF
CCleaner Pro 6.38.11537 Crack Final Latest Version 2025
PPTX
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
How AI/LLM recommend to you ? GDG meetup 16 Aug by Fariman Guliev
PDF
Designing Intelligence for the Shop Floor.pdf
PDF
Autodesk AutoCAD Crack Free Download 2025
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Tally Prime Crack Download New Version 5.1 [2025] (License Key Free
Why Generative AI is the Future of Content, Code & Creativity?
Nekopoi APK 2025 free lastest update
Cost to Outsource Software Development in 2025
wealthsignaloriginal-com-DS-text-... (1).pdf
Reimagine Home Health with the Power of Agentic AI​
Odoo Companies in India – Driving Business Transformation.pdf
Weekly report ppt - harsh dattuprasad patel.pptx
CHAPTER 2 - PM Management and IT Context
WiFi Honeypot Detecscfddssdffsedfseztor.pptx
CapCut Video Editor 6.8.1 Crack for PC Latest Download (Fully Activated) 2025
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
CCleaner Pro 6.38.11537 Crack Final Latest Version 2025
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
Internet Downloader Manager (IDM) Crack 6.42 Build 41
How AI/LLM recommend to you ? GDG meetup 16 Aug by Fariman Guliev
Designing Intelligence for the Shop Floor.pdf
Autodesk AutoCAD Crack Free Download 2025
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx

OSMC 2016 | ZMON: Zalando's OS approach to monitoring in the cloud and DCs by Jan Mussler

  • 1. TITLE IN CAPITAL LETTERS SUBTITLE IN CAPITAL LETTERS JAN MUSSLER jan.mussler@zalando.de Twitter: @JanMussler zmon.io #NETWAYS #OSMC 30-11-2016 ZMON Open Source Monitoring in the Cloud
  • 2. 15 countries 19+ million active customers 160+ million visits per month 200k+ articles 3.0+ billion € revenue ~ 1.600 employees in tech Europe's Leading Online Fashion Platform Visit us: tech.zalando.com
  • 5. ➊ One AWS account per Team ➋ Deployment with Docker ➌ Managed SSH Access ➍ REST/OAuth 2.0 mandatory ➎ Traceability of changes IN A NUTSHELL STUPS
  • 6. Internet *.abc.example.org *.xyz.example.org Team ABC Team XYZ ISOLATED AWS ACCOUNTS EC2EC2 ELBELB EC2
  • 8. Host Host Service 4 Service 4 Host Service 3 Service 3 Service 1 Service 1Monitoring Team? Service 2 Service 2 Monitoring the old way? Team Team Team Team
  • 9. Build with teams and services in mind ... Host Host Host Service 4 Service 4Service 4 Host Team 3 Service 3 Service 3Service 3Team 2 Service 1 Service 1Service 1Team 1 Service 2 Service 2
  • 11. Flexible and extendable: Checks & Alerts in Python Integrate: REST APIs, OAUTH2, Auto Discovery Configurable via UI / API: no restarts required! Great for teams: autonomy and responsibility Fast/Scaling metrics: Redis, KairosDB + Grafana 3 ZMON - Highlights ;-)
  • 17. Good old green and red boxes?
  • 18. Full authentication for all endpoints OAUTH2 login flow (e.g. Github login) “TV Tokens” for “read-only” dashboard login Grafana 3 bundled and API implemented Proxy for KairosDB (timeseries db) ZMON Controller - User Interface and REST API
  • 19. Display historic data using Grafana 3
  • 20. Various options for notifications ... E-Mail Twilio (phone call)
  • 21. PUSH
  • 23. ● hosts, databases, applications, instances ... ● generic key value object ● 20000+ entities in our deployment Entities { "id": "node01:8080", "type": "instance", "host": "node01", "ports": {"8080":8080,"8181":8181}, "application_id": "zmon", "application_version": "0.1.0", "dc":"dc1" } Entity "node01:8080"
  • 24. Entity Service (part of controller) id: localhost:5432 type: postgres host: localhost port: 5432 shards: local_zmon_db: "localhost:5432/local_zmon_db" local-postgres.yaml Integrated easy-to-use entity store with REST API Build your own discovery agent (K8S, …) >zmon entities push local-postgres.yaml
  • 26. ● select subset of entities ● executes Python expression ○ powerful using eval with custom context ○ Builtins: HTTP, PostgreSQL, MySQL, CloudWatch, Redis, SNMP/NRPE, tcp,Scalyr, ElasticSearch, … ○ Data filtering/formating/pivoting ● returns "value" object -> dicts everywhere Checks
  • 28. SNMP and Nagios NRPE support
  • 29. REST API to update or use web front end zmon check-definitions update select-1-check.yaml Managing checks name: "Select 1" owning_team: "Team ZMON" command: | sql().execute("select 1 as a").results() entities: - type: postgresql interval: 15 description: "Test connection to PostgreSQL" select-1-check.yaml
  • 30. Trial Run - Quick feedback and easier development
  • 32. ● Attached to a single check, inspect check result ● Defines team and responsible team ● Allows inheritance from other alert ● Evaluates Python expression yielding True/False ● No "WARNING" state, no "UNKNOWN" state ● Priorities(color) and tags Alerts
  • 34. Downtimes ● Set or schedule downtimes using the UI ● Use API to automate downtimes, e.g. in deployment tool
  • 35. Reuse existing checks for core infrastructure Anyone can add alerts to checks Monitor application boundaries/dependencies Make use of inheritance to customize Sharing and reuse of alerts and checks
  • 37. Tokeninfo (GO)Tokeninfo (GO) Provider (Java) Provider (Java) Tokeninfo (GO)Tokeninfo (GO) C* Nodes C* Nodes C* Nodes C* Nodes Plan B Deployment - Multi Region Setup (JWT issue/verification) C* NodesProvider (Java)ELB Tokeninfo (Go)ELB C* NodesProvider (Java)ELB Tokeninfo (Go)ELB
  • 38. Will create “entities” to describe deployment ELBs, ASGs, Application, instances,... Crawls AWS API every 60 sec to update ZMON AWS Agent - Auto Discovery
  • 39. ➜ ~ zmon entities get "planb-tokeninfo-cd44-oFM6x[aws:999:eu-west-1]" id: planb-tokeninfo-cd44-oFM6x[aws:999:eu-west-1] type: instance application_id: planb-tokeninfo host: 172.31.169.6 infrastructure_account: aws:999 instance_type: c4.xlarge ip: 172.31.169.6 ports: { '9020': 9020, '9021': 9021 } region: eu-west-1 source: registry.opensource.zalan.do/stups/planb-tokeninfo:cd44 stack_name: planb-tokeninfo-eu-west-1 stack_version: cd44 Example Instance Entity
  • 40. ➜ ~ zmon entities get " elb-data-service-cd79c9[aws:...:eu-central-1] " id: elb-data-service-cd79c9[aws:...:eu-central-1] type: elb name: data-service-cd79c9 active_members: 5 cloudwatch_name: app/data-service-cd79c9/18b164bfa427486d dns_name: data-service-cd79c9-961635181.eu-central-1.elb.amazonaws.com dns_traffic: 'true' dns_weight: 200 elb_type: application members: 5 region: eu-central-1 scheme: internet-facing Example Instance Entity
  • 41. Instance Metrics ● Memory usage ● Disk space usage ● CPU usage ● Application logs ● Application metrics Monitoring Plan-B EC2 instances on AWS Scalyr Agent Log shipping Prometheus Node Agent :9100/metrics Taupage AMI (Ubuntu base) Application Container Go / Spring Boot / Cassandra Docker run time :8080 -> app :7979 -> metrics
  • 44. Check result - Grafana 3 link AWS UI deep link
  • 45. Monitor your deployments … data tagged with version
  • 47. Annotated Metric Data in Grafana
  • 48. HTTP requests reading JSON application metrics Read JMX data via Jolokia/HTTP for Cassandra Read Prometheus Node data for EC2 metrics CloudWatch() queries for ELB metrics Scalyr API queries for application logs Check commands used so far
  • 50. Workers (Python) Workers (Python) ZMON Core + UI + KairosDB Scheduler (jvm) Redis Worker (Python) KairosDB (Java) Controller (Java) PostgreSQL Queue/State CLI (Python) Check/Alert definition Entity data Cassandra Frontend (AngularJS) Metric Cache
  • 52. ● Scheduler supports queue filters by entity ○ e.g. {"dc":"dc1"} vs {"dc":"dc2"} queue filters ● Scheduler can apply base filter ○ only handles entities with {"dc":"dc1"} ● Worker can report home using: ○ Redis (we use this across DCs) ○ HTTPS (AWS->DC) Multi DC / Zone deployment possible
  • 53. ZMON in AWS / Multi DC Setup *.foo.example.org *.bar.example.org Team "Foo" Team "Bar" EC2 Instance EC2 InstanceEC2 Instance EC2 Instance ZMON Appliance ZMON ApplianceEC2 Instance EC2 Instance ZMON Data Service ELB ELB
  • 57. Spring Boot (extending metrics) https://github.com/zalando/zmon-actuator Python (Swagger first on Flask) https://github.com/zalando/connexion Clojure (Swagger first) https://github.com/zalando-stups/friboo/ Scala Play https://github.com/zalando-incubator/markscheider Example libraries and framework support ...
  • 58. Demo: https://demo.zmon.io ZMON and Slack: https://zmon.io && https://slack.zmon.io Documentation: https://docs.zmon.io Zalando Tech: https://tech.zalando.com
  • 59. Expose your data / Convention on key names/structure { "zmon.response.200.GET.checks.all-active-check-definitions.count": 10, "zmon.response.200.GET.checks.all-active-check-definitions.fifteenMinuteRate": 0.18071, "zmon.response.200.GET.checks.all-active-check-definitions.fiveMinuteRate": 0.15181, "zmon.response.200.GET.checks.all-active-check-definitions.oneMinuteRate": 0.10512, "zmon.response.200.GET.checks.all-active-check-definitions.75thPercentile": 1173, "zmon.response.200.GET.checks.all-active-check-definitions.95thPercentile": 1233, "zmon.response.200.GET.checks.all-active-check-definitions.999thPercentile": 1282, "zmon.response.200.GET.checks.all-active-check-definitions.99thPercentile": 1282, "zmon.response.200.GET.checks.all-active-check-definitions.max": 1282, "zmon.response.200.GET.checks.all-active-check-definitions.median": 1161, "zmon.response.200.GET.checks.all-active-check-definitions.min": 1114 }