CERN IT Monitoring

Monitoring with no limits
Nikolay Tsvetkov

Nikolay Tsvetkov
Senior Software Engineer
Service Manager
At CERN since 2013
n.tsvetkov@cern.ch
1

Physics Lab
Since 1954
23 Member states
2500 Employees
CMS ATLAS
LHCb ALICE
Large Hadron Collider
Leading Physics Laboratory
2

Large Hadron Collider
~ 27km long
~ 100m under the ground
1.9K (-271.3 C) operating temperature
11245 rounds per second !
> 1 billion collisions per second
3

LHC Detectors
CMS / ATLAS / ALICE / LHCb
Heavier than the Eiffel Tower
CMS solenoid is the most powerful ever built:
o 4 Tesla magnetic field > 100 000 Earth’s
o Size 6x13m
4

Detectors Data Taking
> 1 billion collisions per second
Filtered out to ~ 200 “interesting” events/s
Data flow from all 4 detectors ~ 25 GB/s
5

CERN Data Centre (DC)
o 15 000 servers
o 260 000 processor cores
o 130 000 disks and 30 000 magnetic tapes
o 340 petabytes of data permanently archived
o 115 petabytes of data written on magnetic
tape only in 2018
6

WLCG
A community of 12,000 physicists:
• ~300,000 jobs running concurrently
• 170 sites
• 900,000 processing cores
• 700 PB storage available worldwide
• 15% of the resources are at CERN
• 20-40 Gbit/s connect CERN to Tier1s
7

CERN IT Monitoring
Monitoring as a Service for
CERN Data Centre (DC), IT Services
and the WLCG collaboration
Collect, transport, store and process
metrics and logs for applications and
infrastructure
8

2016 MONIT was born to
provide better monitoring
infrastructure to CERN IT
effective
scalable
sustainable
9

Challenges
o from ~ 40k machines
o > 3 TB/day (compressed)
o Input rate ~ 100 kHz
Data rate & volume
10
Challenges

Variety
Heterogeneous clients:
o IT Data Center
o WLCG transfers
o Experiments
Challenges
11

Reliability
o spikes in rate and volume
o external service dependencies
Challenges
12

Migrate from legacy dashboards and tools
Stay up to date with upstream tools & trends
Build community, internal and external
Non-technical
13
Challenges

Goals
Flexible on schema requirements
JSON/HTTP gateways
o Integrate custom metrics, logs and alarms
Specific gateways
o Collectd, Prometheus, ActiveMQ, JDBC …
Easy Data Integration
14

Goals
Schema independent
Data aggregation / enrichment functionality
Steering to the required storage backend
Fully based on open-source technologies
Data Pipeline
15

Architecture
HTTP
JMS
JDBC
AVRO
Processing
HDFS
InfluxDB
ES
16
KC

Source / Sink the data pipeline
Validation and simple data filtering
Metadata enrichment
HTTP
JMS
JDBC
AVRO
{
"producer": "myproducer",
"type": "mytype",
...
"mymetricfield": "value"
}
17
Connectors

Apache Flume
Protocol-based agents (sources and sinks) :
o JDBC, JMS, HDFS, Elastic, HTTP, Kafka
Interceptor / Morphlines for event transformation
14 agent “types” in MONIT
> 200 instances
Scale horizontally
Connectors
18

DC metrics producer
Running on > 40k machines in the Data Center
Collectd daemon collects metrics / alarms locally
o Plugin based, out of the box OS monitoring
o Framework for implementing custom plugins
Local Flume agents for data buffering
19

Transport layer
Backbone of our pipeline
o decouples producers / consumers
o enables stream processing
o resilient (72 hours data retention)
o reliable (3 replicas)
20

Kafka cluster
On-premises ( v1.0.2) based on Openstack VMs
o 20 brokers
o ~ 15k partitions in total
o CEPH volume (2TB each) as spool
(be careful with storage latencies !)
o Rack-awareness: 1 replica per “availability zone”
21
Transport layer

Processing platform
Transformation
o parsing, field extraction and filtering
Enrichment
o combine data from different sources
Correlation / Aggregation
o over time or other dimensions
o anomaly detection
Processing
22

Users
Mesos
(Marathon & Chronos)
Mesoscluster
Orchestrate
CERN IT
Hadoop/HDFSGitlab CI
23
Processing platform

Logstash integrated for on-the-fly log transformation
Spark Structured Streaming in the lead role
o joining data streams easily
o handles late event
Running ~ 20 Spark production jobs (24/7)
24
Processing platform

Providing the right storage for each use-case
Integrating as data sources for visualization
Direct query access through APIs
Long term data archive
Storage
25

Timeseries DB for storing metrics / alarms
o > 30 instances
(due to lack of cluster mode for free version)
o Performance related to the data cardinality
o Up to 15 years of retention policy
(thanks to the automatic down-sampling)
InfluxDB
26
Storage

Elasticsearch
Distributed search and indexing engine
o 3 clusters (syslog, service logs and metrics)
o Store TS data with high cardinality fields
o ~100 TB (total storage at 1 month RP)
27
Storage

HDFS
Long term data archive platform for
Big Data analysis
o Kept forever (or by GDPR agreement)
o Compressed JSON / Parquet
o Partitioned by “date / producer / type”
28
Storage

Kafka Connect
Kafka framework for exchanging data
with other systems
o Support variety of connector types
o HDFS, S3, Elasticsearch, Influx, …
o Single KC cluster handle different connectors
o Resilient & scalable
CONNEC
T
29

o 10 VMs, 44 topics, 880 tasks
o Writing to HDFS directly in Parquet
(converting records from JSON )
o Connector per topic distributes well
o Compaction required afterwards
(creates too small files as buffers full block in memory)
30
Cluster:
Kafka Connect

Visualization
Grafana is a ”first-class citizen”
• ~ 1000 dashboards over > 20 organizations
• Users in charge of creating their own ones
Kibana data exploration
• Secured private endpoints for sensitive logs
SWAN for data-analysis (notebooks)
31

Monitoring Of the Monitoring
Second data pipeline for monitoring our infra
All MONIT metrics, logs sent to both flows
Data de-duplicated and merged at the storage level
Using more external services for MOM to avoid
replicating configuration problems (Kafka)
32

MOM Data Flow
MOM
MONIT
Metrics
& Logs
33
HTTP
JMS
JDBC
AVRO KC
HDFS
HTTP
HTTP
AVRO
HTTP
External

Lessons Learned
The pipeline approach pays back
o reliable / resilient service
o decouple / buffers / stream processing
Kafka is a solid system backbone
Connectors & Storages are the most
operational expensive
34

What are the next steps?
Extend the Kafka Connect usage
Run the connectors on Kubernetes (K8s)
Spark on K8s for processing platform
Why not also looking into KSQL ?
Looking into alternative Timeseries DBs
35

MONIT keep growing
Only for the last 12 months:
o + 30% new data producers (180 total)
o + 20% data volume per day (~ 3.2 TB/day total)
o + 400% new dashbords (1000 total)
o increase to ~ 1 000 000 queries/day
… more clients means new challenges !
36

Summary
MONIT is a flexible general purpose monitoring
infrastructure
Easily implementable at lower scale
Approach that might serve other use-cases
outside of the MONIT scope
37

Alarms
Local on the machine
o Simple Threshold / Actuators
Grafana dashboard alarms
External (Spark, Kapacitor , custom sources…)
Integration with ticketing system
o ServiceNow
40

CERN IT Monitoring

More Related Content

What's hot (20)

Similar to CERN IT Monitoring (20)

More from Tim Bell (20)

Recently uploaded (20)

CERN IT Monitoring