SlideShare a Scribd company logo
Monitoring with no limits
Nikolay Tsvetkov
Nikolay Tsvetkov
Senior Software Engineer
Service Manager
At CERN since 2013
n.tsvetkov@cern.ch
1
Physics Lab
Since 1954
23 Member states
2500 Employees
CMS ATLAS
LHCb ALICE
Large Hadron Collider
Leading Physics Laboratory
2
Large Hadron Collider
~ 27km long
~ 100m under the ground
1.9K (-271.3 C) operating temperature
11245 rounds per second !
> 1 billion collisions per second
3
LHC Detectors
CMS / ATLAS / ALICE / LHCb
Heavier than the Eiffel Tower
CMS solenoid is the most powerful ever built:
o 4 Tesla magnetic field > 100 000 Earth’s
o Size 6x13m
4
Detectors Data Taking
> 1 billion collisions per second
Filtered out to ~ 200 “interesting” events/s
Data flow from all 4 detectors ~ 25 GB/s
5
CERN Data Centre (DC)
o 15 000 servers
o 260 000 processor cores
o 130 000 disks and 30 000 magnetic tapes
o 340 petabytes of data permanently archived
o 115 petabytes of data written on magnetic
tape only in 2018
6
WLCG
A community of 12,000 physicists:
• ~300,000 jobs running concurrently
• 170 sites
• 900,000 processing cores
• 700 PB storage available worldwide
• 15% of the resources are at CERN
• 20-40 Gbit/s connect CERN to Tier1s
7
CERN IT Monitoring
Monitoring as a Service for
CERN Data Centre (DC), IT Services
and the WLCG collaboration
Collect, transport, store and process
metrics and logs for applications and
infrastructure
8
2016 MONIT was born to
provide better monitoring
infrastructure to CERN IT
effective
scalable
sustainable
9
Challenges
o from ~ 40k machines
o > 3 TB/day (compressed)
o Input rate ~ 100 kHz
Data rate & volume
10
Challenges
Variety
Heterogeneous clients:
o IT Data Center
o WLCG transfers
o Experiments
Challenges
11
Reliability
o spikes in rate and volume
o external service dependencies
Challenges
12
Migrate from legacy dashboards and tools
Stay up to date with upstream tools & trends
Build community, internal and external
Non-technical
13
Challenges
Goals
Flexible on schema requirements
JSON/HTTP gateways
o Integrate custom metrics, logs and alarms
Specific gateways
o Collectd, Prometheus, ActiveMQ, JDBC …
Easy Data Integration
14
Goals
Schema independent
Data aggregation / enrichment functionality
Steering to the required storage backend
Fully based on open-source technologies
Data Pipeline
15
Architecture
HTTP
JMS
JDBC
AVRO
Processing
HDFS
InfluxDB
ES
16
KC
Source / Sink the data pipeline
Validation and simple data filtering
Metadata enrichment
HTTP
JMS
JDBC
AVRO
{
"producer": "myproducer",
"type": "mytype",
...
"mymetricfield": "value"
}
17
Connectors
Apache Flume
Protocol-based agents (sources and sinks) :
o JDBC, JMS, HDFS, Elastic, HTTP, Kafka
Interceptor / Morphlines for event transformation
14 agent “types” in MONIT
> 200 instances
Scale horizontally
Connectors
18
DC metrics producer
Running on > 40k machines in the Data Center
Collectd daemon collects metrics / alarms locally
o Plugin based, out of the box OS monitoring
o Framework for implementing custom plugins
Local Flume agents for data buffering
19
Transport layer
Backbone of our pipeline
o decouples producers / consumers
o enables stream processing
o resilient (72 hours data retention)
o reliable (3 replicas)
20
Kafka cluster
On-premises ( v1.0.2) based on Openstack VMs
o 20 brokers
o ~ 15k partitions in total
o CEPH volume (2TB each) as spool
(be careful with storage latencies !)
o Rack-awareness: 1 replica per “availability zone”
21
Transport layer
Processing platform
Transformation
o parsing, field extraction and filtering
Enrichment
o combine data from different sources
Correlation / Aggregation
o over time or other dimensions
o anomaly detection
Processing
22
Users
Mesos
(Marathon & Chronos)
Mesoscluster
Orchestrate
CERN IT
Hadoop/HDFSGitlab CI
23
Processing platform
Logstash integrated for on-the-fly log transformation
Spark Structured Streaming in the lead role
o joining data streams easily
o handles late event
Running ~ 20 Spark production jobs (24/7)
24
Processing platform
Providing the right storage for each use-case
Integrating as data sources for visualization
Direct query access through APIs
Long term data archive
Storage
25
Timeseries DB for storing metrics / alarms
o > 30 instances
(due to lack of cluster mode for free version)
o Performance related to the data cardinality
o Up to 15 years of retention policy
(thanks to the automatic down-sampling)
InfluxDB
26
Storage
Elasticsearch
Distributed search and indexing engine
o 3 clusters (syslog, service logs and metrics)
o Store TS data with high cardinality fields
o ~100 TB (total storage at 1 month RP)
27
Storage
HDFS
Long term data archive platform for
Big Data analysis
o Kept forever (or by GDPR agreement)
o Compressed JSON / Parquet
o Partitioned by “date / producer / type”
28
Storage
Kafka Connect
Kafka framework for exchanging data
with other systems
o Support variety of connector types
o HDFS, S3, Elasticsearch, Influx, …
o Single KC cluster handle different connectors
o Resilient & scalable
CONNEC
T
29
o 10 VMs, 44 topics, 880 tasks
o Writing to HDFS directly in Parquet
(converting records from JSON )
o Connector per topic distributes well
o Compaction required afterwards
(creates too small files as buffers full block in memory)
30
Cluster:
Kafka Connect
Visualization
Grafana is a ”first-class citizen”
• ~ 1000 dashboards over > 20 organizations
• Users in charge of creating their own ones
Kibana data exploration
• Secured private endpoints for sensitive logs
SWAN for data-analysis (notebooks)
31
Monitoring Of the Monitoring
Second data pipeline for monitoring our infra
All MONIT metrics, logs sent to both flows
Data de-duplicated and merged at the storage level
Using more external services for MOM to avoid
replicating configuration problems (Kafka)
32
MOM Data Flow
MOM
MONIT
Metrics
& Logs
33
HTTP
JMS
JDBC
AVRO KC
HDFS
HTTP
HTTP
AVRO
HTTP
External
Lessons Learned
The pipeline approach pays back
o reliable / resilient service
o decouple / buffers / stream processing
Kafka is a solid system backbone
Connectors & Storages are the most
operational expensive
34
What are the next steps?
Extend the Kafka Connect usage
Run the connectors on Kubernetes (K8s)
Spark on K8s for processing platform
Why not also looking into KSQL ?
Looking into alternative Timeseries DBs
35
MONIT keep growing
Only for the last 12 months:
o + 30% new data producers (180 total)
o + 20% data volume per day (~ 3.2 TB/day total)
o + 400% new dashbords (1000 total)
o increase to ~ 1 000 000 queries/day
… more clients means new challenges !
36
Summary
MONIT is a flexible general purpose monitoring
infrastructure
Easily implementable at lower scale
Approach that might serve other use-cases
outside of the MONIT scope
37
Thank you !
Spare slides
39
Alarms
Local on the machine
o Simple Threshold / Actuators
Grafana dashboard alarms
External (Spark, Kapacitor , custom sources…)
Integration with ticketing system
o ServiceNow
40

More Related Content

PPTX
Product management vs project management
PPTX
projec kickoff presentation template
PDF
Project Closure Process Steps PowerPoint Presentation Slides
PPT
How to set up a project management office (PMO)
PDF
Bridging the cost schedule divide - integrating primavera and cost systems ppt
PPTX
OpenText Extended ECM for Microsoft Dynamics Customer Engagement
PDF
Graph Databases for Master Data Management
PDF
Rundeck Office Hours: Best Practices Access Control Policies
Product management vs project management
projec kickoff presentation template
Project Closure Process Steps PowerPoint Presentation Slides
How to set up a project management office (PMO)
Bridging the cost schedule divide - integrating primavera and cost systems ppt
OpenText Extended ECM for Microsoft Dynamics Customer Engagement
Graph Databases for Master Data Management
Rundeck Office Hours: Best Practices Access Control Policies

What's hot (20)

PDF
Making Problem Management Work for Your Organization
PDF
Agile Delivery PowerPoint Presentation Slides
PDF
Project Management Powerpoint Presentation Slides
PDF
Data Security & Data Privacy: Data Anonymization
PPTX
SAP Migration Overview
PDF
RWDG Slides: Data Governance Roles and Responsibilities
PPTX
Project Mangement
PDF
BABOK Summer Bootcamp - Chapter 3: Business Analysis Planning & Monitoring
PPT
Building an Agile framework that fits your organisation
PDF
Giving Presentations to Senior Managers
PPTX
Selling MDM to Leadership: Defining the Why
PDF
205410 primavera and sap
PDF
Your roles as key user sap
PPTX
Office 365 periodic table - editable
PDF
S4 h 188 sap s4hana cloud implementation with sap activate
PPT
How to Migrate G Suite to Office 365 Readily!
PPTX
Introduction to Project Management (workshop) - v.2
PDF
All Plans Comparison - Office 365 and Microsoft 365 Plans
PDF
Strategic Cost Optimization: Driving Business Innovation While Reducing IT Costs
PDF
Making Problem Management Work for Your Organization
Agile Delivery PowerPoint Presentation Slides
Project Management Powerpoint Presentation Slides
Data Security & Data Privacy: Data Anonymization
SAP Migration Overview
RWDG Slides: Data Governance Roles and Responsibilities
Project Mangement
BABOK Summer Bootcamp - Chapter 3: Business Analysis Planning & Monitoring
Building an Agile framework that fits your organisation
Giving Presentations to Senior Managers
Selling MDM to Leadership: Defining the Why
205410 primavera and sap
Your roles as key user sap
Office 365 periodic table - editable
S4 h 188 sap s4hana cloud implementation with sap activate
How to Migrate G Suite to Office 365 Readily!
Introduction to Project Management (workshop) - v.2
All Plans Comparison - Office 365 and Microsoft 365 Plans
Strategic Cost Optimization: Driving Business Innovation While Reducing IT Costs
Ad

Similar to CERN IT Monitoring (20)

PPTX
Louise McCluskey, Kx Engineer at Kx Systems
PDF
Leveraging the Power of Solr with Spark
PDF
Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAware
PDF
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
PPTX
1005 cern-active mq-v2
PDF
Using the Open Science Data Cloud for Data Science Research
PPTX
Telegraph Cq English
PDF
What Are Science Clouds?
PDF
C2MON - A highly scalable monitoring platform for Big Data scenarios @CERN by...
PDF
Introduction to Data streaming - 05/12/2014
PPTX
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...
PPTX
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...
PPT
Grid optical network service architecture for data intensive applications
PPTX
Chicago Flink Meetup: Flink's streaming architecture
PPTX
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
PDF
MetaCloud Computing Environment
PDF
Distributed Virtual Transaction Directory Server
PPT
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
PPT
On the need for a W3C community group on RDF Stream Processing
Louise McCluskey, Kx Engineer at Kx Systems
Leveraging the Power of Solr with Spark
Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAware
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
1005 cern-active mq-v2
Using the Open Science Data Cloud for Data Science Research
Telegraph Cq English
What Are Science Clouds?
C2MON - A highly scalable monitoring platform for Big Data scenarios @CERN by...
Introduction to Data streaming - 05/12/2014
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...
Grid optical network service architecture for data intensive applications
Chicago Flink Meetup: Flink's streaming architecture
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
MetaCloud Computing Environment
Distributed Virtual Transaction Directory Server
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
On the need for a W3C community group on RDF Stream Processing
Ad

More from Tim Bell (20)

PPTX
CERN Status at OpenStack Shanghai Summit November 2019
PPTX
20190620 accelerating containers v3
PPTX
20190314 cern register v3
PPTX
20181219 ucc open stack 5 years v3
PPTX
20181219 ucc open stack 5 years v3
PPTX
OpenStack at CERN : A 5 year perspective
PPTX
20170926 cern cloud v4
PPTX
The OpenStack Cloud at CERN - OpenStack Nordic
PPTX
20161025 OpenStack at CERN Barcelona
PPTX
20150924 rda federation_v1
PPTX
OpenStack Paris 2014 - Federation, are we there yet ?
PPTX
20141103 cern open_stack_paris_v3
PPTX
CERN Mass and Agility talk at OSCON 2014
PPTX
20140509 cern open_stack_linuxtag_v3
PPTX
Open stack operations feedback loop v1.4
PPT
CERN clouds and culture at GigaOm London 2013
PPTX
20130529 openstack cee_day_v6
PDF
Academic cloud experiences cern v4
PDF
Ceilometer lsf-intergration-openstack-summit
PDF
Havana survey results-final-v2
CERN Status at OpenStack Shanghai Summit November 2019
20190620 accelerating containers v3
20190314 cern register v3
20181219 ucc open stack 5 years v3
20181219 ucc open stack 5 years v3
OpenStack at CERN : A 5 year perspective
20170926 cern cloud v4
The OpenStack Cloud at CERN - OpenStack Nordic
20161025 OpenStack at CERN Barcelona
20150924 rda federation_v1
OpenStack Paris 2014 - Federation, are we there yet ?
20141103 cern open_stack_paris_v3
CERN Mass and Agility talk at OSCON 2014
20140509 cern open_stack_linuxtag_v3
Open stack operations feedback loop v1.4
CERN clouds and culture at GigaOm London 2013
20130529 openstack cee_day_v6
Academic cloud experiences cern v4
Ceilometer lsf-intergration-openstack-summit
Havana survey results-final-v2

Recently uploaded (20)

PDF
2025 Shadow report on Ukraine's progression regarding Chapter 29 of the acquis
PPTX
11Sept2023_LTIA-Cluster-Training-Presentation.pptx
PPTX
Inferenceahaiajaoaakakakakakakakakakakakakaka
PPTX
sepsis.pptxMNGHGBDHSB KJHDGBSHVCJB KJDCGHBYUHFB SDJKFHDUJ
PDF
2026 RMHC Terms & Conditions agreement - updated 8.1.25.pdf
PDF
Storytelling youth indigenous from Bolivia 2025.pdf
PPTX
The DFARS - Part 250 - Extraordinary Contractual Actions
PPTX
Omnibus rules on leave administration.pptx
PPTX
PCCR-ROTC-UNIT-ORGANIZATIONAL-STRUCTURE-pptx-Copy (1).pptx
PDF
Item # 4 -- 328 Albany St. compt. review
PDF
ISO-9001-2015-internal-audit-checklist2-sample.pdf
PDF
The Role of FPOs in Advancing Rural Agriculture in India
PDF
26.1.2025 venugopal K Awarded with commendation certificate.pdf
PDF
Population Estimates 2025 Regional Snapshot 08.11.25
PDF
Item # 2 - 934 Patterson Specific Use Permit (SUP)
PDF
ISO-9001-2015-gap-analysis-checklist-sample.pdf
PDF
Contributi dei parlamentari del PD - Contributi L. 3/2019
PPTX
GSA Q+A Follow-Up To EO's, Requirements & Timelines
PPTX
Weekly Report 17-10-2024_cybersecutity.pptx
PPT
Quality Management Ssystem PPT - Introduction.ppt
2025 Shadow report on Ukraine's progression regarding Chapter 29 of the acquis
11Sept2023_LTIA-Cluster-Training-Presentation.pptx
Inferenceahaiajaoaakakakakakakakakakakakakaka
sepsis.pptxMNGHGBDHSB KJHDGBSHVCJB KJDCGHBYUHFB SDJKFHDUJ
2026 RMHC Terms & Conditions agreement - updated 8.1.25.pdf
Storytelling youth indigenous from Bolivia 2025.pdf
The DFARS - Part 250 - Extraordinary Contractual Actions
Omnibus rules on leave administration.pptx
PCCR-ROTC-UNIT-ORGANIZATIONAL-STRUCTURE-pptx-Copy (1).pptx
Item # 4 -- 328 Albany St. compt. review
ISO-9001-2015-internal-audit-checklist2-sample.pdf
The Role of FPOs in Advancing Rural Agriculture in India
26.1.2025 venugopal K Awarded with commendation certificate.pdf
Population Estimates 2025 Regional Snapshot 08.11.25
Item # 2 - 934 Patterson Specific Use Permit (SUP)
ISO-9001-2015-gap-analysis-checklist-sample.pdf
Contributi dei parlamentari del PD - Contributi L. 3/2019
GSA Q+A Follow-Up To EO's, Requirements & Timelines
Weekly Report 17-10-2024_cybersecutity.pptx
Quality Management Ssystem PPT - Introduction.ppt

CERN IT Monitoring

  • 1. Monitoring with no limits Nikolay Tsvetkov
  • 2. Nikolay Tsvetkov Senior Software Engineer Service Manager At CERN since 2013 n.tsvetkov@cern.ch 1
  • 3. Physics Lab Since 1954 23 Member states 2500 Employees CMS ATLAS LHCb ALICE Large Hadron Collider Leading Physics Laboratory 2
  • 4. Large Hadron Collider ~ 27km long ~ 100m under the ground 1.9K (-271.3 C) operating temperature 11245 rounds per second ! > 1 billion collisions per second 3
  • 5. LHC Detectors CMS / ATLAS / ALICE / LHCb Heavier than the Eiffel Tower CMS solenoid is the most powerful ever built: o 4 Tesla magnetic field > 100 000 Earth’s o Size 6x13m 4
  • 6. Detectors Data Taking > 1 billion collisions per second Filtered out to ~ 200 “interesting” events/s Data flow from all 4 detectors ~ 25 GB/s 5
  • 7. CERN Data Centre (DC) o 15 000 servers o 260 000 processor cores o 130 000 disks and 30 000 magnetic tapes o 340 petabytes of data permanently archived o 115 petabytes of data written on magnetic tape only in 2018 6
  • 8. WLCG A community of 12,000 physicists: • ~300,000 jobs running concurrently • 170 sites • 900,000 processing cores • 700 PB storage available worldwide • 15% of the resources are at CERN • 20-40 Gbit/s connect CERN to Tier1s 7
  • 9. CERN IT Monitoring Monitoring as a Service for CERN Data Centre (DC), IT Services and the WLCG collaboration Collect, transport, store and process metrics and logs for applications and infrastructure 8
  • 10. 2016 MONIT was born to provide better monitoring infrastructure to CERN IT effective scalable sustainable 9
  • 11. Challenges o from ~ 40k machines o > 3 TB/day (compressed) o Input rate ~ 100 kHz Data rate & volume 10 Challenges
  • 12. Variety Heterogeneous clients: o IT Data Center o WLCG transfers o Experiments Challenges 11
  • 13. Reliability o spikes in rate and volume o external service dependencies Challenges 12
  • 14. Migrate from legacy dashboards and tools Stay up to date with upstream tools & trends Build community, internal and external Non-technical 13 Challenges
  • 15. Goals Flexible on schema requirements JSON/HTTP gateways o Integrate custom metrics, logs and alarms Specific gateways o Collectd, Prometheus, ActiveMQ, JDBC … Easy Data Integration 14
  • 16. Goals Schema independent Data aggregation / enrichment functionality Steering to the required storage backend Fully based on open-source technologies Data Pipeline 15
  • 18. Source / Sink the data pipeline Validation and simple data filtering Metadata enrichment HTTP JMS JDBC AVRO { "producer": "myproducer", "type": "mytype", ... "mymetricfield": "value" } 17 Connectors
  • 19. Apache Flume Protocol-based agents (sources and sinks) : o JDBC, JMS, HDFS, Elastic, HTTP, Kafka Interceptor / Morphlines for event transformation 14 agent “types” in MONIT > 200 instances Scale horizontally Connectors 18
  • 20. DC metrics producer Running on > 40k machines in the Data Center Collectd daemon collects metrics / alarms locally o Plugin based, out of the box OS monitoring o Framework for implementing custom plugins Local Flume agents for data buffering 19
  • 21. Transport layer Backbone of our pipeline o decouples producers / consumers o enables stream processing o resilient (72 hours data retention) o reliable (3 replicas) 20
  • 22. Kafka cluster On-premises ( v1.0.2) based on Openstack VMs o 20 brokers o ~ 15k partitions in total o CEPH volume (2TB each) as spool (be careful with storage latencies !) o Rack-awareness: 1 replica per “availability zone” 21 Transport layer
  • 23. Processing platform Transformation o parsing, field extraction and filtering Enrichment o combine data from different sources Correlation / Aggregation o over time or other dimensions o anomaly detection Processing 22
  • 24. Users Mesos (Marathon & Chronos) Mesoscluster Orchestrate CERN IT Hadoop/HDFSGitlab CI 23 Processing platform
  • 25. Logstash integrated for on-the-fly log transformation Spark Structured Streaming in the lead role o joining data streams easily o handles late event Running ~ 20 Spark production jobs (24/7) 24 Processing platform
  • 26. Providing the right storage for each use-case Integrating as data sources for visualization Direct query access through APIs Long term data archive Storage 25
  • 27. Timeseries DB for storing metrics / alarms o > 30 instances (due to lack of cluster mode for free version) o Performance related to the data cardinality o Up to 15 years of retention policy (thanks to the automatic down-sampling) InfluxDB 26 Storage
  • 28. Elasticsearch Distributed search and indexing engine o 3 clusters (syslog, service logs and metrics) o Store TS data with high cardinality fields o ~100 TB (total storage at 1 month RP) 27 Storage
  • 29. HDFS Long term data archive platform for Big Data analysis o Kept forever (or by GDPR agreement) o Compressed JSON / Parquet o Partitioned by “date / producer / type” 28 Storage
  • 30. Kafka Connect Kafka framework for exchanging data with other systems o Support variety of connector types o HDFS, S3, Elasticsearch, Influx, … o Single KC cluster handle different connectors o Resilient & scalable CONNEC T 29
  • 31. o 10 VMs, 44 topics, 880 tasks o Writing to HDFS directly in Parquet (converting records from JSON ) o Connector per topic distributes well o Compaction required afterwards (creates too small files as buffers full block in memory) 30 Cluster: Kafka Connect
  • 32. Visualization Grafana is a ”first-class citizen” • ~ 1000 dashboards over > 20 organizations • Users in charge of creating their own ones Kibana data exploration • Secured private endpoints for sensitive logs SWAN for data-analysis (notebooks) 31
  • 33. Monitoring Of the Monitoring Second data pipeline for monitoring our infra All MONIT metrics, logs sent to both flows Data de-duplicated and merged at the storage level Using more external services for MOM to avoid replicating configuration problems (Kafka) 32
  • 34. MOM Data Flow MOM MONIT Metrics & Logs 33 HTTP JMS JDBC AVRO KC HDFS HTTP HTTP AVRO HTTP External
  • 35. Lessons Learned The pipeline approach pays back o reliable / resilient service o decouple / buffers / stream processing Kafka is a solid system backbone Connectors & Storages are the most operational expensive 34
  • 36. What are the next steps? Extend the Kafka Connect usage Run the connectors on Kubernetes (K8s) Spark on K8s for processing platform Why not also looking into KSQL ? Looking into alternative Timeseries DBs 35
  • 37. MONIT keep growing Only for the last 12 months: o + 30% new data producers (180 total) o + 20% data volume per day (~ 3.2 TB/day total) o + 400% new dashbords (1000 total) o increase to ~ 1 000 000 queries/day … more clients means new challenges ! 36
  • 38. Summary MONIT is a flexible general purpose monitoring infrastructure Easily implementable at lower scale Approach that might serve other use-cases outside of the MONIT scope 37
  • 41. Alarms Local on the machine o Simple Threshold / Actuators Grafana dashboard alarms External (Spark, Kapacitor , custom sources…) Integration with ticketing system o ServiceNow 40