SlideShare a Scribd company logo
Netdata
the open-source observability
platform everyone needs
Costa Tsaousis
Page:
2
About Netdata
● 72k Github Stars!
Netdata is one of the most popular observability projects!
● Leading the Observability category in CNCF!
In terms of Github stars: Netdata is 1st, then was Elasticsearch with 70k stars,
Grafana has 65k, Prometheus has 56k, etc.
● 1.5 million downloads every day!
Docker hub is counting 650M pulls so far.
Cloudflare reports 51TB of transferred data per month.
GitHub URL:
Page:
3
What makes Netdata unique?
● Easy
Just install it on your servers and you are done!
Dashboards and alerts are automatically created for you.
● Real-Time
Per second resolution of all data.
Just 1-second data collection to visualization latency!
● A.I. everywhere
Machine Learning learns the patterns of all your data and detects anomalies without
any manual intervention or configuration!
● Cost Efficient
Designed to be used out-of-the-box and significantly lower operational costs.
Page:
4
Netdata Promises
● Feel the pulse of your infrastructure!
High fidelity monitoring, revealing the micro-world of your services and applications.
● 50% to 90% observability cost reduction!
Most of our users experience significant observability cost reduction, compared to the
solutions they were using before Netdata.
● Half Mean Time to Resolution!
Machine learning for all metrics helps correlating and revealing the root cause of issues,
faster and easier.
Page:
5
Observability Generations
1st
generation
Checks
2nd
generation
Metrics
3rd
generation
Logs
5th
generation
Distributed
Check it is there
and works
4rd
generation
Integrated
Sample it
periodically
Collect and
index its logs
All in One All in One,
Real-Time,
High-Fidelity,
AI-Powered,
Cost-Efficient
Nagios, Icinga,
Zabbix, Sensu,
CheckMk, PRTG,
SolarWinds
Graphite,
OpenTSDB,
Prometheus,
InfluxDB
ELK
Splunk
Datadog,
Dynatrace,
New Relic,
Instana,
Grafana
Netdata
Blog post for more details:
The Observability Pipeline
Page:
7
Typical Observability Design
Servers, VMs, Applications
Central
Observability
Databases
Observability Data
Metrics, Logs, Traces
Dashboards
📊 📈
Alert Notifications
⚠ 🔥
Page:
● Business logic is everywhere.
What to collect, what to visualize and how, what to alert,
what to index, what to retain and for how long.
● Scalability is challenging
As the infra grows, these central DBs need to scale, be
highly available and perform at scale.
● A lot of moving parts
Database servers, application servers, a lot of integrations
and protocols, and a tone of different technologies.
● Expensive
Skills, resources, a lot of time - or a lot of money.
Low-fidelity
and expensive!
Observability
requires serious
resources and
skills, and is
never enough!
8
The Effects of a Traditional Observability Pipeline
Page:
9
The Results…
● Low Fidelity Insights
The design of most tools force users to lower granularity and cardinality, resulting in abstract
views, that are not useful in revealing the true dynamics of the infrastructure.
● Inefficient
They enforce a development lifecycle in setting up and maintaining observability, requiring
advanced skills, a lot of discipline and huge amounts of time and effort.
● Limited (or no) AI and Machine Learning
Most observability solutions are pretty dump. And even when they say they are smarter, they
usually use tricks, not true machine learning.
● Expensive
Observability is frequently more expensive than the cost of the infrastructure that is being
monitored.
Page:
10
The Current State of Observability
● Too little observability
Traditional check-based systems, like Nagios, Icinga, Zabbix, Sensu, PRTG, CheckMk,
SolarWinds, etc. have evolved around the idea of checks or sensors. Each check has a status,
probably annotated with some text and a few metrics. Checks are executed every minute.
Equivalent to “traffic lights” monitoring. Reliable and robust, but not enough for today's needs,
suffering from extensive “blind spots”.
● Too complex observability
DIY platforms, like the Grafana ecosystem, have evolved around a much better design (metrics,
logs, traces) and although they are more powerful and highly customizable, they introduce way
too many moving parts, require significant skills and expertise, have a steep learning curve, and
they quickly become overcomplicated to maintain and scale.
● Too expensive observability
Commercial vendors, like Datadog, Dynatrace, NewRelic, Instana, each with its own strengths
and weaknesses, although sometimes they are easier and better (to a different extend each), they
are unrealistically expensive.
Page:
11
The Design of Observability keeps Fidelity low
● What affects fidelity?
Granularity (the resolution of the data) and cardinality (the number of entities monitored),
control the amount of data a monitoring system must ingest, process, store and retain.
Low granularity = blurry data, not that detailed
Low cardinality = blind spots, not everything is monitored
Low granularity + low cardinality = abstract view of the infrastructure, lacking detail and coverage
● Why not having high fidelity?
Centralization is the key reason for keeping granularity and cardinality low.
Example: a system monitoring 3000 metrics every second (3k samples/s), has to process, store
and query 450x more data compared to a system monitoring 100 metrics every 15 seconds (<7
samples/s).
Centralization makes fidelity and cost proportional to each other; increasing fidelity results in
higher costs, and reducing costs leads to a decrease in fidelity.
Netdata’s Design
Page:
13
Decentralized Design For High Fidelity
● Keep data at the edge
By keeping the data at the edge:
○ Compute & storage resources are already available and spare
○ No need for network resources
○ The work to be done is small and it can be optimized,
so that monitoring is a “polite citizen” to production applications
● Multiple independent centralization points
Mini centralization points may exist, as required for operational needs:
○ Ephemeral nodes, that may vanish at any point in time
○ High availability of observability data
○ Offloading “sensitive” production systems from observability work
● Unify and integrate everything at query time
To provide unified infrastructure-wide views, query edge systems (or the mini centralization points), aggregate
their responses and provide high-resolution, real-time dashboards and alerts.
Page:
14
Netdata’s Decentralized Design: Smart Agent
Servers, VMs, Applications
Instead of centralizing
the data, Netdata
distributes the code!
All servers provide
dashboards and alerts,
all data remain at the
edge!
Page:
15
Netdata’s Decentralized Design: Netdata Parents
Data Center A
Netdata Parent A
Cloud Provider B Cloud Provider C
Multiple independent centralization points for
high availability, each providing, aggregated
dashboards and alerts!
A Netdata Parent is the open-source Netdata Agent software, configured to accept data from other Netdata Agents.
Netdata Parent B Netdata Parent C
Page:
16
Netdata’s Decentralized Design: Netdata Cloud
Servers, VMs, Applications
Netdata Cloud
A thin layer, a proxy, that
distributes queries to
Netdata agents and merges
their responses on the fly, to
provide aggregated views.
Netdata Cloud does not store any
observability data. It is available as
SaaS or On-Prem.
Netdata
Cloud
Page:
17
Hybrid & Multi-Cloud Support
Data Center A
Netdata Parent A
Cloud Provider B Cloud Provider C
Netdata Cloud communicates only with Netdata Parents to
provide infrastructure level views.
Netdata Parent B Netdata Parent C
Netdata Cloud
Page:
18
The pipeline inside every Netdata Agent
Metrics Logs
Page:
19
Common Concerns about Decentralized Designs
● The agent will be heavy
No! The Netdata agent, that is a complete monitoring pipeline in a box, processing may thousands of metrics per
second (vs. others that process just a few hundreds every minute), is one of the lightest observability agents
available.
● Queries will influence production systems
No! Each agent serves only its own data. Querying such a small dataset is lightweight and does not influence
operations. For very sensitive or weak production systems, a mini-centralization point next to these systems will
isolate them from queries (and also offload them from ingestion, processing, storage and retention).
● Queries will be slower
No! They are actually faster! Distributing tiny queries in parallel to multiple systems, provides an aggregate compute
power that is many times higher to what any single system can provide.
● Will require more bandwidth
No! Querying is selective, most of the observability data are never queried unless required for exploration or
troubleshooting. And even then, just a small portion of the data is examined.
So, the overall bandwidth used is a tiny fraction compared to centralized systems.
Page:
20
Edge Resources Required (per node)
Resource Dynatrace Datadog Instana Grafana Netdata
Granularity 1-minute 15-seconds 1-second 1-minute 1-second
Technology Coverage Average High Low Average Excellent
CPU Usage (100% = 1 core) 12% 14% 6.7% 3.3% 3.6%
Memory Usage 1400 MB 972 MB 588 MB 414 MB 181 MB
Disk Space 2 GB 1.2 GB 0.2 GB - 3 GB
Disk Read Rate - 0.2 KB/s - - 0.3 KB/s
Disk Write Rate 38.6 KB/s 8.3 KB/s - 1.6 KB/s 4.8 KB/s
Egress Internet Bandwidth 11.4 GB/mo 11.1 GB/mo 5.4 GB/mo 4.8 GB/mo 0.01 GB/mo
The Netdata agent is a full monitoring in a box, and still is
one of the most lightweight agents among commercial
observability offerings! full comparison URL:
Page:
● -35% CPU Utilization
Netdata: 1.8 CPU cores per million of metrics/s
Prometheus: 2.9 CPU cores per million of metrics/s
● -49% Peak Memory Consumption
Netdata: 49 GiB
Prometheus: 89 GiB
● -12% Bandwidth
Netdata: 227 Mbps
Prometheus: 257 Mbps
● -98% Disk I/O
Netdata: 3 MiB/s (no reads, 3 MiB/s writes)
Prometheus: 129 MiB/s (73.6 MiB/s reads, 55.2 MiB/s writes)
● -75% Storage Footprint
On 3 TiB of storage:
Netdata: 10 days per-sec, 43 days per-min, 467 days per-hour
Prometheus: 7 days per-sec
Stress tested a
Netdata parent
and a Prometheus
with 500 servers,
40k containers,
at 2.7 million
metrics/s
21
Netdata as a time-series db vs Prometheus
:full comparison URL
Page:
In December 2023, University of Amsterdam published a study
related to the impact of monitoring tools for Docker based systems,
aiming to answer 2 questions:
● The impact of monitoring tools on the energy efficiency of
Docker-based systems
● The impact of monitoring tools on the performance of
Docker-based systems
They found that:
- Netdata is the most efficient tool,
Requiring significantly less system resources than the others.
- Netdata is excellent in terms of performance impact,
Allowing containers and applications to run without any
measurable impact due to observability.
Outperforming
other monitoring
solutions in edge
resources
efficiency!
22
Netdata is the most lightweight platform!
:full comparison URL
Netdata
Challenge 1:
Automated data collection,
visualization and alerting!
Page:
● Similar physical or virtual hardware
We all use a finite set of physical and virtual hardware. This
hardware may be different in terms of performance and
capacity, but the technologies involved are standardized.
● Similar operating systems
We all use flavors of a small set of operating systems,
exposing a finite set of metrics covering the monitoring of all
system components and operating system layers.
● Packaged applications
Most of our infrastructure is based on packaged applications,
like web servers, database servers, message brokers, caching
servers, etc.
● Standard libraries
Even for our custom applications, we usually rely on packaged
libraries that expose telemetry in a finite and predictable way.
Since we have
so much in
common, why it
takes so long to
set up a
monitoring
solution?
24
We have a lot in common
Page:
Netdata attaches NIDL metadata to time-series data, allowing
the identification of the infrastructure components (instances)
monitored.
This framework enables:
● Automated data collection
Netdata auto-discovers and collects all data sources on the nodes
it runs.
● Automated visualization
Netdata dashboards are generated by an algorithm. They are not
pre-configured or hard-coded. Each Netdata dashboard is unique
and is driven by the available metrics.
● Automated alerts
Alerts are pre-configured templates that are automatically attached
to their relative components (disks, network interfaces, systems,
databases, processes, web servers, etc).
NIDL stands for:
- Nodes
- Instances
- Dimensions
- Labels
The name comes from the slicing
controls on all Netdata charts.
25
NIDL, the model for rapid deployment
Page:
● Just install it!
One moving part: Netdata. Batteries included!
(i.e. data collection plugins and all needed modules are shipped
with Netdata).
● Battle-tested out-of-the-box alerts!
Netdata stock alerts detect common misconfigurations, errors,
and issues.
● Troubleshoot in seconds!
Data collection does not require configuration, unless the
monitored data are password protected (Netdata needs the
password).
Data collection plugins provide metrics with the NIDL framework
embedded into them.
Designed to be
installed
mid-crisis!
26
Mission accomplished!
Netdata
Challenge 2:
get rid of the query language
for slicing and dicing data
Page:
● Since users haven’t configured the metrics
themselves, can we provide a UI that can
explain what users see?
● How users will be able to slice and dice the
data on any chart, the way it makes sense for
them?
Netdata
collects a vast
number of
metrics you will
probably see for
the first time
28
Slice and dice from the UI
Page:
29
A Netdata Chart
Netdata Cloud Live Demo URL:
:Netdata Parent URL
Page:
30
Info Button: Help about this chart
Info button includes links to relevant documentation
and/or some helpful message about the metrics on each
chart.
Page:
31
A Netdata Chart - controls
Anomaly
rate ribbon
NIDL Controls - review data sources and slice/filter them
(NIDL = Nodes, Instances, Dimensions, Labels) Aggregation
across time
Aggregation
across metrics
Info
ribbon
Dice
the data
Page:
32
A Netdata Chart - anomaly rate per node
Instances per Node
contributing to this chart
Unique time-series per Node
contributing to this chart
The visible volume each Node
is contributing to this chart
The anomaly rate each Node
contributes to this chart
Clicked
on
Nodes
The minimum, average and maximum
values across all metrics this Node
contributes
Similar analysis is available per Instance
(“application” in this chart), dimensions, and
labels.
Filter Nodes
contributing data to this chart
Page:
33
Dicing any chart, without queries
Result: dimension,device_type
Page:
34
Info Ribbon: Missing data collections
A missed data collection is a gap, because something is wrong!
Netdata does not smooth out the data.
Page:
The Netdata query engine, does all the calculations, for all
drop down menus and ribbons in one go and returns
everything in a single query response.
All queries, include all information needed:
- Per Node
- Per Instance (disk, container, database, etc)
- Per Dimension
- Per Label Key
- Per Label Value
Providing:
- Availability of the samples (gaps), over time
- Min, Average and Maximum values
- Anomaly Rate for the visible timeframe
- Volume contributing to the chart
- Number of Nodes, Instances, Dimensions, Label Keys, Label
Values matched
All this
additional
information is
available on
every query,
every chart,
every metric!
35
Mission accomplished!
Netdata
Challenge 3:
make machine learning and anomaly
detection useful for observability
Page:
Wednesday, 2 October, 2019
Todd Underwood, Google
The vast majority of proposed production engineering uses
of Machine Learning (ML) will never work. They are
structurally unsuited to their intended purposes. There are
many key problem domains where SREs want to apply ML
but most of them do not have the right characteristics to be
feasible in the way that we hope. After addressing the most
common proposed uses of ML for production engineering
and explaining why they won't work, several options will be
considered, including approaches to evaluating proposed
applications of ML for feasibility. ML cannot solve most of
the problems most people want it to, but it can solve
some problems. Probably.
Google:
All of Our ML
Ideas Are Bad
(and We Should Feel Bad)
37
AI for observability is tricky
:URL
Page:
ML is probably the simplest way to model the
behavior of individual metrics.
So, given enough past values of a metric, ML can
tell us if the value we just collected is an outlier
or not.
We call this Anomaly Detection.
It is just a bit. True or False.
Over a period of time, we calculate the Anomaly
Rate, ie the % of samples that found to be
anomalous.
Using ML we
can have a
simple and
effective way to
learn the
behavior of our
servers.
38
So what can ML do for us?
Page:
● Unsupervised and Reliable
It should work by itself to detect anomalies, reliably.
● Real-time
Anomalies should be detected in real-time, as metrics are
collected.
● Lightweight
Training at the edge should not affect production
applications.
● Stored in the database
The anomaly bit of each sample, should be part of the sample
for its lifetime, so that we can query for past anomaly rates.
● Integrated everywhere
Anomaly information should be an integral part of the
platform.
Netdata offers
Anomaly
Detection for all
metrics, all
charts, all
dashboards,
and it just
works, totally
unsupervised.
39
Objectives for ML in Netdata
Page:
● Netdata trains a ML model per metric, every 3
hours, using the last 6 hours of data of each
metric. The models are overlapping.
● It maintains 18 ML models per metric.
● Every 3 hours, a new model is generated and the
oldest is removed, So, Netdata remembers the last
57 hours (2days, 9 hours).
● All available ML models for a metric need to agree
that a collected sample is an outlier, for Netdata to
consider it an Anomaly.
This reduces the noise significantly.
Machine Learning
needs to forget
the past,
otherwise
anomalies will be
“business as
usual” forever.
40
Rolling Unsupervised Anomaly Detection
Page:
● Many metrics are usually zero.
Like hardware errors, rare exceptions, etc. These are
usually covered by alerts that check for non-zero values.
So, all such metrics do not need to be trained.
● Other metrics are usually constant.
Like the available memory of a server, or the pool of
connections of a database server. These metrics do not
need to be trained either.
● Sliced training for the rest.
For the rest of the metrics, we need to train a model for
just 6 hours of their data, trained every 3 hours.
This is usually less than training 1 metric per second.
Netdata needs ~2 CPU cores, for training ML models and
detecting anomalies, for 1 million unique metrics/s.
Sliced training and
careful
consideration of
the metrics that
benefit from ML,
allows Netdata to
be lightweight.
41
Lightweight
Page:
● The anomaly bit is stored in the db.
Anomaly information is stored in the database together
with each sample collected.
We developed a custom floating point number, which
includes the anomaly bit (much like IEEE 745 stores the
sign of floating point numbers), ensuring that there is no
overhead at all.
● Anomaly rate is calculated on the fly.
The Netdata query engine calculates the anomaly rates
for all metrics, on the fly, in one go.
● Aggregated anomaly rate.
The Netdata query engine calculates aggregated anomaly
rates when combining multiple metrics in the same query,
providing a high level anomaly rate for each chart.
Netdata stores
anomalies
together with the
samples, so
anomaly based
queries are
possible.
42
Stored in the database
Page:
Point Anomalies or Strange Points: Single points that represent
very big or very small values, not seen before (in some statistical
sense).
43
What it can detect? (1/5)
Examples:
● A sudden, extreme spike in
the number of failed
transactions for your
database server.
● An unexpected, moment of
high CPU utilization or
sudden memory spike for
your application server.
Page:
Contextual Anomalies or Strange Patterns: Not strange points in
their own, but unexpected sequences of points, given the history
of the time-series.
44
What it can detect? (2/5)
Examples:
● A regular database job, or a
backup that did not run.
● A cap on the number of web
requests received.
Page:
Collective Anomalies or Strange Multivariate Patterns: Neither
strange points nor strange patterns, but in global sense something
looks off.
45
What it can detect? (3/5)
Examples:
● A network issue that
introduces a lot of
retransmits, lowers the
throughput of the web
server or the workload on
the database server.
Page:
Concept Drifts or Strange Trends: A slow and steady drift to a new
state.
46
What it can detect? (4/5)
Examples:
● A memory leak in an
application.
● An attack that is gradually
increased to its full load.
● A gradual increase in
response time latency.
Page:
Change Point Detection or Strange Step: A shift occurred and
gradually a new normal is established.
47
What it can detect? (5/5)
Examples:
● A faulty deployment that
does not serve all the
workload.
Page:
48
A Netdata dashboard
One fully automated
dashboard, with infinite
scrolling, presenting and
grouping all metrics
available.
Quick access to all sections
using the index on the right.
Multi-dimensional data on
every chart, using chart
controls to slice and dice
any dataset.
AI assisting on every step.
Page:
49
A Netdata Dashboard - what is anomalous?
Time-frame picker
Anomaly rate
per section for
the time-frame
Anomaly Rate button
Page:
● Uses Host Anomaly Rate to identify
durations of interest.
● Host Anomaly Rate is the percentage of
the metrics of a host, that were found to
be anomalous concurrently.
● So, 10% host anomaly rate, means that
10% of all the metrics the host exposes,
were anomalous at the same time,
showing the spread of an anomaly.
Anomaly advisor
assists in finding
the needle in the
haystack.
50
Anomaly Advisor
Page:
51
Anomaly Advisor - starting point
Percentage of
Host Anomaly
Rate
Number of metrics
concurrently
anomalous
Page:
52
Anomaly Advisor - triggering the analysis
Highlighting an area
on the chart, triggers
the analysis
Page:
53
Anomaly Advisor - the analysis
Anomaly advisor
presents a sorted
list of all metrics,
ordered by their
anomaly rate,
during the
highlighted
time-frame.
Page:
Netdata turns AI to a consultant that can help you spot
what is interesting, what is related, what needs your
attention.
● Unsupervised
There are plenty of settings, but it just works behind the
scenes, learning how metrics behave and providing an
anomaly score for them.
● It is just another attribute for each of your metrics
Anomaly Rate is stored in the metrics database together with
every sample collected, making it possible to query the past
for anomalies.
● Can detect the spread of an anomaly across systems
and applications.
● Can assist finding the aha! moment while
troubleshooting.
Unsupervised
Anomaly
Detection is an
advisor!
54
Mission accomplished!
Netdata
Challenge 4:
Make logs exploration and analytics,
easy and affordable.
Page:
● Is available everywhere!
We use it already, even when we don’t realize it.
● Is secure by design!
○ FSS, to seal the logs
○ Survives disk failures (uses tmpfs)
○ Its file format is designed for minimal data loss on disk
corruption
● Is unique!
○ Supports any number of fields, even per log entry
(think huge cardinality)
○ Indexes all fields provided
○ Queries on any combination of fields
○ Maintenance free - just works!
● Amazing ingestion performance!
● Can build logs centralization points
It provides all the tools and processes to centralize all the logs
of an infra to a central place.
systemd-journald
is a hidden gem,
that already lives
in our systems!
56
Systemd-journald
Page:
57
Netdata systemd-journal Logs Explorer
Page:
● Yes and No.
The query performance issues are simple implementation
glitches, easy to fix.
● We submitted patches to systemd
We analyzed journalctl and found several issues that once
fixed they improve query performance 14x.
We submitted these patches to systemd.
● Netdata systemd-journal Explorer
We managed to bypass all the performance issues
systemd-journal has, independently of the version of systemd
installed on a system.
Netdata is fast when querying systemd-journal logs on all
systems, even with a slow systemd-journal and journalctl.
systemd-journal
is not slow when
used with
Netdata
58
Systemd-journald: it is slow to query
Page:
● Yes it did.
Generally, very few tools are available to push structured logs
to systemd-journals.
● Netdata log2journal
We released log2journal, a powerful command line tool to
convert any kind of log into structured systemd-journal
entries. Think of it as the equivalent to promtail.
For json and logfmt formatted logs, almost zero
configuration is needed.
● Netdata systemd-cat-native
We released systemd-cat-native, a tool similar to the
standard systemd-cat, which however allows sending a
stream of entries formatted in the systemd-journal native
format to a local or remote systemd-journald.
The value of a
logging system
depends on its
integrations
59
Systemd-journald: it lacks integrations
:URL for log2journal
Page:
Netdata provides all the tools and dashboards to explore
and analyse your system and applications logs, without
actually requiring a dedicated logs database server.
Despite the storage requirements of systemd-journald, the
tool is amazing, especially for developers, since it provides
great flexibility and troubleshooting features.
Even if you don’t want to push your traefik, haproxy or
nginx access logs to it due to its storage requirements, we
strongly recommend to use it for application error logs
and exceptions. Your troubleshooting efforts can become
a lot simpler with this environment.
Netdata provides
the easiest and
more efficient way
to access your
logs, by utilizing
resources and
tools you already
use today.
60
Mission accomplished!
Netdata
Challenge 5:
Observability is more than metrics, logs
and traces. What is missing?
Page:
To completely understand or effectively troubleshoot an
issue, metrics, logs and traces may not be enough.
What if we need to examine:
● the slow queries on a database,
● the list of network connections an application has,
● the files in a filesystem,
● … and the plethora of non-metric, non-log,
non-tracing information available?
Most monitoring systems give up. You have to use the
console of your database server, ssh to the server, or (for
others :) restart the problematic component or application
and hope the issue goes away…
Can a monitoring system help?
To completely
understand, or
effectively
troubleshoot an
issue, we need
more!
62
Challenge
Page:
plugin
GP
63
Netdata Functions
A1 A2 A3 A4 A5
B1 B2 B3 B4 B5
C1 C2 C3 C4 C5
PA
PB
PC
Data Center 1 Data Center 2
Cloud Provider 1
User is accessing a
function exposed by
a data collection
plugin on B5
Alerting
Netdata Parent
Netdata Parent
Netdata Parent
Netdata
Grandparent
function
Page:
64
Example: Network Connections Explorer
Page:
● Data collection plugins expose Functions.
Functions have a name, some parameters, accept a payload,
return a payload and require some permissions to access them.
All these can be custom for each and every function.
● Parents are aware of their childrens’ Functions.
Parents are updated in real-time about changes to Functions, so
that all nodes involved in a streaming and replication chain are
always up to date for the available functions of the entire infra
behind them.
● Dashboards provide the list of Functions.
● Netdata UI supports widgets for Functions.
We are standardizing a set of UI widgets capable of presenting
different kinds of data, depending on which is the most
appropriate way for them to be presented.
Functions are
data collection
plugin features
to query
non-metric data
of any kind
65
Mission accomplished!
Netdata
Monetization Strategy
Page:
● Horizontal Scalability
NC provides unified dashboards and alerts, and dispatches alerts centrally,
without the need to centralize all data on one server. Behind the scenes it queries
multiple Netdata and aggregates their responses on the fly.
● Role Based Access Control (RBAC)
NC allows grouping infrastructure and users in “war rooms”, limiting and
controlling users’ access to the infrastructure.
NC also acts as a Single-Sign-On provider for all your Netdata, limiting what users
can see even when they access Netdata directly.
● Access from anywhere
NC allows accessing your Netdata servers from anywhere, without the need for a
VPN.
● Mobile App for Notifications
NC enables the use of the Netdata Mobile App (iOS, Android) for receiving alert
notifications.
● Persisted Customizations and Dynamic Configuration
NC enables dynamic configuration and stores user settings, custom dashboards,
personalized views and related settings and options, per node, user, room, and
space.
Netdata Cloud (NC)
complements
Netdata
67
Monetization through SaaS
IMPORTANT
Netdata Cloud does not centralize your data.
Your data are always, and exclusively on-prem, inside the Netdata you install.
Netdata Cloud queries your Netdata in real-time, to present dashboards and alerts.
Thank You!
Costa Tsaousis
:GitHub URL, https://github.com/netdata/netdata

More Related Content

PDF
stackconf 2024 | Netdata: Open Source, Distributed Observability Pipeline – J...
PDF
Network Observability – 5 Best Platforms for Observability
PDF
The Present and Future of Serverless Observability
PDF
The present and future of Serverless observability (Serverless Computing London)
PDF
初探 OpenTelemetry - 蒐集遙測數據的新標準
PPTX
DockerCon SF 2019 - Observability Workshop
PDF
How to build observability into a serverless application
PDF
How to build observability into a serverless application
stackconf 2024 | Netdata: Open Source, Distributed Observability Pipeline – J...
Network Observability – 5 Best Platforms for Observability
The Present and Future of Serverless Observability
The present and future of Serverless observability (Serverless Computing London)
初探 OpenTelemetry - 蒐集遙測數據的新標準
DockerCon SF 2019 - Observability Workshop
How to build observability into a serverless application
How to build observability into a serverless application

Similar to OSMC 2024 | Netdata: Open Source, Distributed Observability Pipeline – Journey and Challenge.pdf (20)

PDF
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
PDF
The present and future of serverless observability
PPTX
PPTX
What is Platform Observability? An Overview
PPTX
ADDO Open Source Observability Tools
PPTX
Prometheus - Open Source Forum Japan
PPTX
Evolution of Monitoring and Prometheus (Dublin 2018)
PDF
How to build observability into Serverless (BuildStuff 2018)
PDF
How to build observability into a serverless application
PDF
stackconf 2025 | Evolving Shift Left: Integrating Observability into Modern S...
PDF
Yan Cui - How to build observability into a serverless application - Codemoti...
DOCX
Observability A Critical Practice to Enable Digital Transformation
PDF
The present and future of serverless observability (QCon London)
PDF
The present and future of Serverless observability
PDF
The present and future of Serverless observability
PPTX
Observability for Application Developers (1)-1.pptx
PDF
SRE Certification and SRE Courses Online in India – Visualpath.pdf
PDF
Final observability starts_with_data
PDF
Cloud Observability in Action MEAP V06 Michael Mh9 Hausenblas
PPTX
Agile Gurugram 2023 | Observability for Modern Applications. How does it help...
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
The present and future of serverless observability
What is Platform Observability? An Overview
ADDO Open Source Observability Tools
Prometheus - Open Source Forum Japan
Evolution of Monitoring and Prometheus (Dublin 2018)
How to build observability into Serverless (BuildStuff 2018)
How to build observability into a serverless application
stackconf 2025 | Evolving Shift Left: Integrating Observability into Modern S...
Yan Cui - How to build observability into a serverless application - Codemoti...
Observability A Critical Practice to Enable Digital Transformation
The present and future of serverless observability (QCon London)
The present and future of Serverless observability
The present and future of Serverless observability
Observability for Application Developers (1)-1.pptx
SRE Certification and SRE Courses Online in India – Visualpath.pdf
Final observability starts_with_data
Cloud Observability in Action MEAP V06 Michael Mh9 Hausenblas
Agile Gurugram 2023 | Observability for Modern Applications. How does it help...
Ad

Recently uploaded (20)

PPTX
Sustainable Forest Management ..SFM.pptx
PPTX
water for all cao bang - a charity project
PDF
Presentation1 [Autosaved].pdf diagnosiss
PPTX
_ISO_Presentation_ISO 9001 and 45001.pptx
PDF
Nykaa-Strategy-Case-Fixing-Retention-UX-and-D2C-Engagement (1).pdf
PPTX
Human Mind & its character Characteristics
PPTX
fundraisepro pitch deck elegant and modern
DOCX
"Project Management: Ultimate Guide to Tools, Techniques, and Strategies (2025)"
PDF
Swiggy’s Playbook: UX, Logistics & Monetization
PPTX
Tour Presentation Educational Activity.pptx
PPTX
lesson6-211001025531lesson plan ppt.pptx
PPTX
nose tajweed for the arabic alphabets for the responsive
PPTX
Relationship Management Presentation In Banking.pptx
PPTX
Project and change Managment: short video sequences for IBA
PDF
oil_refinery_presentation_v1 sllfmfls.pdf
PPTX
INTERNATIONAL LABOUR ORAGNISATION PPT ON SOCIAL SCIENCE
PPTX
2025-08-10 Joseph 02 (shared slides).pptx
PPT
First Aid Training Presentation Slides.ppt
PPTX
Tablets And Capsule Preformulation Of Paracetamol
PPTX
An Unlikely Response 08 10 2025.pptx
Sustainable Forest Management ..SFM.pptx
water for all cao bang - a charity project
Presentation1 [Autosaved].pdf diagnosiss
_ISO_Presentation_ISO 9001 and 45001.pptx
Nykaa-Strategy-Case-Fixing-Retention-UX-and-D2C-Engagement (1).pdf
Human Mind & its character Characteristics
fundraisepro pitch deck elegant and modern
"Project Management: Ultimate Guide to Tools, Techniques, and Strategies (2025)"
Swiggy’s Playbook: UX, Logistics & Monetization
Tour Presentation Educational Activity.pptx
lesson6-211001025531lesson plan ppt.pptx
nose tajweed for the arabic alphabets for the responsive
Relationship Management Presentation In Banking.pptx
Project and change Managment: short video sequences for IBA
oil_refinery_presentation_v1 sllfmfls.pdf
INTERNATIONAL LABOUR ORAGNISATION PPT ON SOCIAL SCIENCE
2025-08-10 Joseph 02 (shared slides).pptx
First Aid Training Presentation Slides.ppt
Tablets And Capsule Preformulation Of Paracetamol
An Unlikely Response 08 10 2025.pptx
Ad

OSMC 2024 | Netdata: Open Source, Distributed Observability Pipeline – Journey and Challenge.pdf

  • 1. Netdata the open-source observability platform everyone needs Costa Tsaousis
  • 2. Page: 2 About Netdata ● 72k Github Stars! Netdata is one of the most popular observability projects! ● Leading the Observability category in CNCF! In terms of Github stars: Netdata is 1st, then was Elasticsearch with 70k stars, Grafana has 65k, Prometheus has 56k, etc. ● 1.5 million downloads every day! Docker hub is counting 650M pulls so far. Cloudflare reports 51TB of transferred data per month. GitHub URL:
  • 3. Page: 3 What makes Netdata unique? ● Easy Just install it on your servers and you are done! Dashboards and alerts are automatically created for you. ● Real-Time Per second resolution of all data. Just 1-second data collection to visualization latency! ● A.I. everywhere Machine Learning learns the patterns of all your data and detects anomalies without any manual intervention or configuration! ● Cost Efficient Designed to be used out-of-the-box and significantly lower operational costs.
  • 4. Page: 4 Netdata Promises ● Feel the pulse of your infrastructure! High fidelity monitoring, revealing the micro-world of your services and applications. ● 50% to 90% observability cost reduction! Most of our users experience significant observability cost reduction, compared to the solutions they were using before Netdata. ● Half Mean Time to Resolution! Machine learning for all metrics helps correlating and revealing the root cause of issues, faster and easier.
  • 5. Page: 5 Observability Generations 1st generation Checks 2nd generation Metrics 3rd generation Logs 5th generation Distributed Check it is there and works 4rd generation Integrated Sample it periodically Collect and index its logs All in One All in One, Real-Time, High-Fidelity, AI-Powered, Cost-Efficient Nagios, Icinga, Zabbix, Sensu, CheckMk, PRTG, SolarWinds Graphite, OpenTSDB, Prometheus, InfluxDB ELK Splunk Datadog, Dynatrace, New Relic, Instana, Grafana Netdata Blog post for more details:
  • 7. Page: 7 Typical Observability Design Servers, VMs, Applications Central Observability Databases Observability Data Metrics, Logs, Traces Dashboards 📊 📈 Alert Notifications ⚠ 🔥
  • 8. Page: ● Business logic is everywhere. What to collect, what to visualize and how, what to alert, what to index, what to retain and for how long. ● Scalability is challenging As the infra grows, these central DBs need to scale, be highly available and perform at scale. ● A lot of moving parts Database servers, application servers, a lot of integrations and protocols, and a tone of different technologies. ● Expensive Skills, resources, a lot of time - or a lot of money. Low-fidelity and expensive! Observability requires serious resources and skills, and is never enough! 8 The Effects of a Traditional Observability Pipeline
  • 9. Page: 9 The Results… ● Low Fidelity Insights The design of most tools force users to lower granularity and cardinality, resulting in abstract views, that are not useful in revealing the true dynamics of the infrastructure. ● Inefficient They enforce a development lifecycle in setting up and maintaining observability, requiring advanced skills, a lot of discipline and huge amounts of time and effort. ● Limited (or no) AI and Machine Learning Most observability solutions are pretty dump. And even when they say they are smarter, they usually use tricks, not true machine learning. ● Expensive Observability is frequently more expensive than the cost of the infrastructure that is being monitored.
  • 10. Page: 10 The Current State of Observability ● Too little observability Traditional check-based systems, like Nagios, Icinga, Zabbix, Sensu, PRTG, CheckMk, SolarWinds, etc. have evolved around the idea of checks or sensors. Each check has a status, probably annotated with some text and a few metrics. Checks are executed every minute. Equivalent to “traffic lights” monitoring. Reliable and robust, but not enough for today's needs, suffering from extensive “blind spots”. ● Too complex observability DIY platforms, like the Grafana ecosystem, have evolved around a much better design (metrics, logs, traces) and although they are more powerful and highly customizable, they introduce way too many moving parts, require significant skills and expertise, have a steep learning curve, and they quickly become overcomplicated to maintain and scale. ● Too expensive observability Commercial vendors, like Datadog, Dynatrace, NewRelic, Instana, each with its own strengths and weaknesses, although sometimes they are easier and better (to a different extend each), they are unrealistically expensive.
  • 11. Page: 11 The Design of Observability keeps Fidelity low ● What affects fidelity? Granularity (the resolution of the data) and cardinality (the number of entities monitored), control the amount of data a monitoring system must ingest, process, store and retain. Low granularity = blurry data, not that detailed Low cardinality = blind spots, not everything is monitored Low granularity + low cardinality = abstract view of the infrastructure, lacking detail and coverage ● Why not having high fidelity? Centralization is the key reason for keeping granularity and cardinality low. Example: a system monitoring 3000 metrics every second (3k samples/s), has to process, store and query 450x more data compared to a system monitoring 100 metrics every 15 seconds (<7 samples/s). Centralization makes fidelity and cost proportional to each other; increasing fidelity results in higher costs, and reducing costs leads to a decrease in fidelity.
  • 13. Page: 13 Decentralized Design For High Fidelity ● Keep data at the edge By keeping the data at the edge: ○ Compute & storage resources are already available and spare ○ No need for network resources ○ The work to be done is small and it can be optimized, so that monitoring is a “polite citizen” to production applications ● Multiple independent centralization points Mini centralization points may exist, as required for operational needs: ○ Ephemeral nodes, that may vanish at any point in time ○ High availability of observability data ○ Offloading “sensitive” production systems from observability work ● Unify and integrate everything at query time To provide unified infrastructure-wide views, query edge systems (or the mini centralization points), aggregate their responses and provide high-resolution, real-time dashboards and alerts.
  • 14. Page: 14 Netdata’s Decentralized Design: Smart Agent Servers, VMs, Applications Instead of centralizing the data, Netdata distributes the code! All servers provide dashboards and alerts, all data remain at the edge!
  • 15. Page: 15 Netdata’s Decentralized Design: Netdata Parents Data Center A Netdata Parent A Cloud Provider B Cloud Provider C Multiple independent centralization points for high availability, each providing, aggregated dashboards and alerts! A Netdata Parent is the open-source Netdata Agent software, configured to accept data from other Netdata Agents. Netdata Parent B Netdata Parent C
  • 16. Page: 16 Netdata’s Decentralized Design: Netdata Cloud Servers, VMs, Applications Netdata Cloud A thin layer, a proxy, that distributes queries to Netdata agents and merges their responses on the fly, to provide aggregated views. Netdata Cloud does not store any observability data. It is available as SaaS or On-Prem. Netdata Cloud
  • 17. Page: 17 Hybrid & Multi-Cloud Support Data Center A Netdata Parent A Cloud Provider B Cloud Provider C Netdata Cloud communicates only with Netdata Parents to provide infrastructure level views. Netdata Parent B Netdata Parent C Netdata Cloud
  • 18. Page: 18 The pipeline inside every Netdata Agent Metrics Logs
  • 19. Page: 19 Common Concerns about Decentralized Designs ● The agent will be heavy No! The Netdata agent, that is a complete monitoring pipeline in a box, processing may thousands of metrics per second (vs. others that process just a few hundreds every minute), is one of the lightest observability agents available. ● Queries will influence production systems No! Each agent serves only its own data. Querying such a small dataset is lightweight and does not influence operations. For very sensitive or weak production systems, a mini-centralization point next to these systems will isolate them from queries (and also offload them from ingestion, processing, storage and retention). ● Queries will be slower No! They are actually faster! Distributing tiny queries in parallel to multiple systems, provides an aggregate compute power that is many times higher to what any single system can provide. ● Will require more bandwidth No! Querying is selective, most of the observability data are never queried unless required for exploration or troubleshooting. And even then, just a small portion of the data is examined. So, the overall bandwidth used is a tiny fraction compared to centralized systems.
  • 20. Page: 20 Edge Resources Required (per node) Resource Dynatrace Datadog Instana Grafana Netdata Granularity 1-minute 15-seconds 1-second 1-minute 1-second Technology Coverage Average High Low Average Excellent CPU Usage (100% = 1 core) 12% 14% 6.7% 3.3% 3.6% Memory Usage 1400 MB 972 MB 588 MB 414 MB 181 MB Disk Space 2 GB 1.2 GB 0.2 GB - 3 GB Disk Read Rate - 0.2 KB/s - - 0.3 KB/s Disk Write Rate 38.6 KB/s 8.3 KB/s - 1.6 KB/s 4.8 KB/s Egress Internet Bandwidth 11.4 GB/mo 11.1 GB/mo 5.4 GB/mo 4.8 GB/mo 0.01 GB/mo The Netdata agent is a full monitoring in a box, and still is one of the most lightweight agents among commercial observability offerings! full comparison URL:
  • 21. Page: ● -35% CPU Utilization Netdata: 1.8 CPU cores per million of metrics/s Prometheus: 2.9 CPU cores per million of metrics/s ● -49% Peak Memory Consumption Netdata: 49 GiB Prometheus: 89 GiB ● -12% Bandwidth Netdata: 227 Mbps Prometheus: 257 Mbps ● -98% Disk I/O Netdata: 3 MiB/s (no reads, 3 MiB/s writes) Prometheus: 129 MiB/s (73.6 MiB/s reads, 55.2 MiB/s writes) ● -75% Storage Footprint On 3 TiB of storage: Netdata: 10 days per-sec, 43 days per-min, 467 days per-hour Prometheus: 7 days per-sec Stress tested a Netdata parent and a Prometheus with 500 servers, 40k containers, at 2.7 million metrics/s 21 Netdata as a time-series db vs Prometheus :full comparison URL
  • 22. Page: In December 2023, University of Amsterdam published a study related to the impact of monitoring tools for Docker based systems, aiming to answer 2 questions: ● The impact of monitoring tools on the energy efficiency of Docker-based systems ● The impact of monitoring tools on the performance of Docker-based systems They found that: - Netdata is the most efficient tool, Requiring significantly less system resources than the others. - Netdata is excellent in terms of performance impact, Allowing containers and applications to run without any measurable impact due to observability. Outperforming other monitoring solutions in edge resources efficiency! 22 Netdata is the most lightweight platform! :full comparison URL
  • 23. Netdata Challenge 1: Automated data collection, visualization and alerting!
  • 24. Page: ● Similar physical or virtual hardware We all use a finite set of physical and virtual hardware. This hardware may be different in terms of performance and capacity, but the technologies involved are standardized. ● Similar operating systems We all use flavors of a small set of operating systems, exposing a finite set of metrics covering the monitoring of all system components and operating system layers. ● Packaged applications Most of our infrastructure is based on packaged applications, like web servers, database servers, message brokers, caching servers, etc. ● Standard libraries Even for our custom applications, we usually rely on packaged libraries that expose telemetry in a finite and predictable way. Since we have so much in common, why it takes so long to set up a monitoring solution? 24 We have a lot in common
  • 25. Page: Netdata attaches NIDL metadata to time-series data, allowing the identification of the infrastructure components (instances) monitored. This framework enables: ● Automated data collection Netdata auto-discovers and collects all data sources on the nodes it runs. ● Automated visualization Netdata dashboards are generated by an algorithm. They are not pre-configured or hard-coded. Each Netdata dashboard is unique and is driven by the available metrics. ● Automated alerts Alerts are pre-configured templates that are automatically attached to their relative components (disks, network interfaces, systems, databases, processes, web servers, etc). NIDL stands for: - Nodes - Instances - Dimensions - Labels The name comes from the slicing controls on all Netdata charts. 25 NIDL, the model for rapid deployment
  • 26. Page: ● Just install it! One moving part: Netdata. Batteries included! (i.e. data collection plugins and all needed modules are shipped with Netdata). ● Battle-tested out-of-the-box alerts! Netdata stock alerts detect common misconfigurations, errors, and issues. ● Troubleshoot in seconds! Data collection does not require configuration, unless the monitored data are password protected (Netdata needs the password). Data collection plugins provide metrics with the NIDL framework embedded into them. Designed to be installed mid-crisis! 26 Mission accomplished!
  • 27. Netdata Challenge 2: get rid of the query language for slicing and dicing data
  • 28. Page: ● Since users haven’t configured the metrics themselves, can we provide a UI that can explain what users see? ● How users will be able to slice and dice the data on any chart, the way it makes sense for them? Netdata collects a vast number of metrics you will probably see for the first time 28 Slice and dice from the UI
  • 29. Page: 29 A Netdata Chart Netdata Cloud Live Demo URL: :Netdata Parent URL
  • 30. Page: 30 Info Button: Help about this chart Info button includes links to relevant documentation and/or some helpful message about the metrics on each chart.
  • 31. Page: 31 A Netdata Chart - controls Anomaly rate ribbon NIDL Controls - review data sources and slice/filter them (NIDL = Nodes, Instances, Dimensions, Labels) Aggregation across time Aggregation across metrics Info ribbon Dice the data
  • 32. Page: 32 A Netdata Chart - anomaly rate per node Instances per Node contributing to this chart Unique time-series per Node contributing to this chart The visible volume each Node is contributing to this chart The anomaly rate each Node contributes to this chart Clicked on Nodes The minimum, average and maximum values across all metrics this Node contributes Similar analysis is available per Instance (“application” in this chart), dimensions, and labels. Filter Nodes contributing data to this chart
  • 33. Page: 33 Dicing any chart, without queries Result: dimension,device_type
  • 34. Page: 34 Info Ribbon: Missing data collections A missed data collection is a gap, because something is wrong! Netdata does not smooth out the data.
  • 35. Page: The Netdata query engine, does all the calculations, for all drop down menus and ribbons in one go and returns everything in a single query response. All queries, include all information needed: - Per Node - Per Instance (disk, container, database, etc) - Per Dimension - Per Label Key - Per Label Value Providing: - Availability of the samples (gaps), over time - Min, Average and Maximum values - Anomaly Rate for the visible timeframe - Volume contributing to the chart - Number of Nodes, Instances, Dimensions, Label Keys, Label Values matched All this additional information is available on every query, every chart, every metric! 35 Mission accomplished!
  • 36. Netdata Challenge 3: make machine learning and anomaly detection useful for observability
  • 37. Page: Wednesday, 2 October, 2019 Todd Underwood, Google The vast majority of proposed production engineering uses of Machine Learning (ML) will never work. They are structurally unsuited to their intended purposes. There are many key problem domains where SREs want to apply ML but most of them do not have the right characteristics to be feasible in the way that we hope. After addressing the most common proposed uses of ML for production engineering and explaining why they won't work, several options will be considered, including approaches to evaluating proposed applications of ML for feasibility. ML cannot solve most of the problems most people want it to, but it can solve some problems. Probably. Google: All of Our ML Ideas Are Bad (and We Should Feel Bad) 37 AI for observability is tricky :URL
  • 38. Page: ML is probably the simplest way to model the behavior of individual metrics. So, given enough past values of a metric, ML can tell us if the value we just collected is an outlier or not. We call this Anomaly Detection. It is just a bit. True or False. Over a period of time, we calculate the Anomaly Rate, ie the % of samples that found to be anomalous. Using ML we can have a simple and effective way to learn the behavior of our servers. 38 So what can ML do for us?
  • 39. Page: ● Unsupervised and Reliable It should work by itself to detect anomalies, reliably. ● Real-time Anomalies should be detected in real-time, as metrics are collected. ● Lightweight Training at the edge should not affect production applications. ● Stored in the database The anomaly bit of each sample, should be part of the sample for its lifetime, so that we can query for past anomaly rates. ● Integrated everywhere Anomaly information should be an integral part of the platform. Netdata offers Anomaly Detection for all metrics, all charts, all dashboards, and it just works, totally unsupervised. 39 Objectives for ML in Netdata
  • 40. Page: ● Netdata trains a ML model per metric, every 3 hours, using the last 6 hours of data of each metric. The models are overlapping. ● It maintains 18 ML models per metric. ● Every 3 hours, a new model is generated and the oldest is removed, So, Netdata remembers the last 57 hours (2days, 9 hours). ● All available ML models for a metric need to agree that a collected sample is an outlier, for Netdata to consider it an Anomaly. This reduces the noise significantly. Machine Learning needs to forget the past, otherwise anomalies will be “business as usual” forever. 40 Rolling Unsupervised Anomaly Detection
  • 41. Page: ● Many metrics are usually zero. Like hardware errors, rare exceptions, etc. These are usually covered by alerts that check for non-zero values. So, all such metrics do not need to be trained. ● Other metrics are usually constant. Like the available memory of a server, or the pool of connections of a database server. These metrics do not need to be trained either. ● Sliced training for the rest. For the rest of the metrics, we need to train a model for just 6 hours of their data, trained every 3 hours. This is usually less than training 1 metric per second. Netdata needs ~2 CPU cores, for training ML models and detecting anomalies, for 1 million unique metrics/s. Sliced training and careful consideration of the metrics that benefit from ML, allows Netdata to be lightweight. 41 Lightweight
  • 42. Page: ● The anomaly bit is stored in the db. Anomaly information is stored in the database together with each sample collected. We developed a custom floating point number, which includes the anomaly bit (much like IEEE 745 stores the sign of floating point numbers), ensuring that there is no overhead at all. ● Anomaly rate is calculated on the fly. The Netdata query engine calculates the anomaly rates for all metrics, on the fly, in one go. ● Aggregated anomaly rate. The Netdata query engine calculates aggregated anomaly rates when combining multiple metrics in the same query, providing a high level anomaly rate for each chart. Netdata stores anomalies together with the samples, so anomaly based queries are possible. 42 Stored in the database
  • 43. Page: Point Anomalies or Strange Points: Single points that represent very big or very small values, not seen before (in some statistical sense). 43 What it can detect? (1/5) Examples: ● A sudden, extreme spike in the number of failed transactions for your database server. ● An unexpected, moment of high CPU utilization or sudden memory spike for your application server.
  • 44. Page: Contextual Anomalies or Strange Patterns: Not strange points in their own, but unexpected sequences of points, given the history of the time-series. 44 What it can detect? (2/5) Examples: ● A regular database job, or a backup that did not run. ● A cap on the number of web requests received.
  • 45. Page: Collective Anomalies or Strange Multivariate Patterns: Neither strange points nor strange patterns, but in global sense something looks off. 45 What it can detect? (3/5) Examples: ● A network issue that introduces a lot of retransmits, lowers the throughput of the web server or the workload on the database server.
  • 46. Page: Concept Drifts or Strange Trends: A slow and steady drift to a new state. 46 What it can detect? (4/5) Examples: ● A memory leak in an application. ● An attack that is gradually increased to its full load. ● A gradual increase in response time latency.
  • 47. Page: Change Point Detection or Strange Step: A shift occurred and gradually a new normal is established. 47 What it can detect? (5/5) Examples: ● A faulty deployment that does not serve all the workload.
  • 48. Page: 48 A Netdata dashboard One fully automated dashboard, with infinite scrolling, presenting and grouping all metrics available. Quick access to all sections using the index on the right. Multi-dimensional data on every chart, using chart controls to slice and dice any dataset. AI assisting on every step.
  • 49. Page: 49 A Netdata Dashboard - what is anomalous? Time-frame picker Anomaly rate per section for the time-frame Anomaly Rate button
  • 50. Page: ● Uses Host Anomaly Rate to identify durations of interest. ● Host Anomaly Rate is the percentage of the metrics of a host, that were found to be anomalous concurrently. ● So, 10% host anomaly rate, means that 10% of all the metrics the host exposes, were anomalous at the same time, showing the spread of an anomaly. Anomaly advisor assists in finding the needle in the haystack. 50 Anomaly Advisor
  • 51. Page: 51 Anomaly Advisor - starting point Percentage of Host Anomaly Rate Number of metrics concurrently anomalous
  • 52. Page: 52 Anomaly Advisor - triggering the analysis Highlighting an area on the chart, triggers the analysis
  • 53. Page: 53 Anomaly Advisor - the analysis Anomaly advisor presents a sorted list of all metrics, ordered by their anomaly rate, during the highlighted time-frame.
  • 54. Page: Netdata turns AI to a consultant that can help you spot what is interesting, what is related, what needs your attention. ● Unsupervised There are plenty of settings, but it just works behind the scenes, learning how metrics behave and providing an anomaly score for them. ● It is just another attribute for each of your metrics Anomaly Rate is stored in the metrics database together with every sample collected, making it possible to query the past for anomalies. ● Can detect the spread of an anomaly across systems and applications. ● Can assist finding the aha! moment while troubleshooting. Unsupervised Anomaly Detection is an advisor! 54 Mission accomplished!
  • 55. Netdata Challenge 4: Make logs exploration and analytics, easy and affordable.
  • 56. Page: ● Is available everywhere! We use it already, even when we don’t realize it. ● Is secure by design! ○ FSS, to seal the logs ○ Survives disk failures (uses tmpfs) ○ Its file format is designed for minimal data loss on disk corruption ● Is unique! ○ Supports any number of fields, even per log entry (think huge cardinality) ○ Indexes all fields provided ○ Queries on any combination of fields ○ Maintenance free - just works! ● Amazing ingestion performance! ● Can build logs centralization points It provides all the tools and processes to centralize all the logs of an infra to a central place. systemd-journald is a hidden gem, that already lives in our systems! 56 Systemd-journald
  • 58. Page: ● Yes and No. The query performance issues are simple implementation glitches, easy to fix. ● We submitted patches to systemd We analyzed journalctl and found several issues that once fixed they improve query performance 14x. We submitted these patches to systemd. ● Netdata systemd-journal Explorer We managed to bypass all the performance issues systemd-journal has, independently of the version of systemd installed on a system. Netdata is fast when querying systemd-journal logs on all systems, even with a slow systemd-journal and journalctl. systemd-journal is not slow when used with Netdata 58 Systemd-journald: it is slow to query
  • 59. Page: ● Yes it did. Generally, very few tools are available to push structured logs to systemd-journals. ● Netdata log2journal We released log2journal, a powerful command line tool to convert any kind of log into structured systemd-journal entries. Think of it as the equivalent to promtail. For json and logfmt formatted logs, almost zero configuration is needed. ● Netdata systemd-cat-native We released systemd-cat-native, a tool similar to the standard systemd-cat, which however allows sending a stream of entries formatted in the systemd-journal native format to a local or remote systemd-journald. The value of a logging system depends on its integrations 59 Systemd-journald: it lacks integrations :URL for log2journal
  • 60. Page: Netdata provides all the tools and dashboards to explore and analyse your system and applications logs, without actually requiring a dedicated logs database server. Despite the storage requirements of systemd-journald, the tool is amazing, especially for developers, since it provides great flexibility and troubleshooting features. Even if you don’t want to push your traefik, haproxy or nginx access logs to it due to its storage requirements, we strongly recommend to use it for application error logs and exceptions. Your troubleshooting efforts can become a lot simpler with this environment. Netdata provides the easiest and more efficient way to access your logs, by utilizing resources and tools you already use today. 60 Mission accomplished!
  • 61. Netdata Challenge 5: Observability is more than metrics, logs and traces. What is missing?
  • 62. Page: To completely understand or effectively troubleshoot an issue, metrics, logs and traces may not be enough. What if we need to examine: ● the slow queries on a database, ● the list of network connections an application has, ● the files in a filesystem, ● … and the plethora of non-metric, non-log, non-tracing information available? Most monitoring systems give up. You have to use the console of your database server, ssh to the server, or (for others :) restart the problematic component or application and hope the issue goes away… Can a monitoring system help? To completely understand, or effectively troubleshoot an issue, we need more! 62 Challenge
  • 63. Page: plugin GP 63 Netdata Functions A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 C1 C2 C3 C4 C5 PA PB PC Data Center 1 Data Center 2 Cloud Provider 1 User is accessing a function exposed by a data collection plugin on B5 Alerting Netdata Parent Netdata Parent Netdata Parent Netdata Grandparent function
  • 65. Page: ● Data collection plugins expose Functions. Functions have a name, some parameters, accept a payload, return a payload and require some permissions to access them. All these can be custom for each and every function. ● Parents are aware of their childrens’ Functions. Parents are updated in real-time about changes to Functions, so that all nodes involved in a streaming and replication chain are always up to date for the available functions of the entire infra behind them. ● Dashboards provide the list of Functions. ● Netdata UI supports widgets for Functions. We are standardizing a set of UI widgets capable of presenting different kinds of data, depending on which is the most appropriate way for them to be presented. Functions are data collection plugin features to query non-metric data of any kind 65 Mission accomplished!
  • 67. Page: ● Horizontal Scalability NC provides unified dashboards and alerts, and dispatches alerts centrally, without the need to centralize all data on one server. Behind the scenes it queries multiple Netdata and aggregates their responses on the fly. ● Role Based Access Control (RBAC) NC allows grouping infrastructure and users in “war rooms”, limiting and controlling users’ access to the infrastructure. NC also acts as a Single-Sign-On provider for all your Netdata, limiting what users can see even when they access Netdata directly. ● Access from anywhere NC allows accessing your Netdata servers from anywhere, without the need for a VPN. ● Mobile App for Notifications NC enables the use of the Netdata Mobile App (iOS, Android) for receiving alert notifications. ● Persisted Customizations and Dynamic Configuration NC enables dynamic configuration and stores user settings, custom dashboards, personalized views and related settings and options, per node, user, room, and space. Netdata Cloud (NC) complements Netdata 67 Monetization through SaaS IMPORTANT Netdata Cloud does not centralize your data. Your data are always, and exclusively on-prem, inside the Netdata you install. Netdata Cloud queries your Netdata in real-time, to present dashboards and alerts.
  • 68. Thank You! Costa Tsaousis :GitHub URL, https://github.com/netdata/netdata