SlideShare a Scribd company logo
Tim E. Hall @thallinflux
VP, Products InfluxData
Monitoring InfluxEnterprise
Discussion Topics
• Background
• Gathering Metrics...and Logs
• Visualization, Monitoring, and Alerting
• Troubleshooting Scenarios
From
development to
production
• Change is required
• Establish monitoring baselines
• Ensure visibility into health of the system
• Notifications for most common issues,
before they become outages
From OSS to Enterprise
InfluxDB
OSS
Meta 1 Meta 3Meta 2
Data Node
2
Data Node
1
InfluxDB Enterprise
https://docs.influxdata.com
Gathering Metrics…and Logs
Deploy Telegraf on all nodes (meta and data)
By enabling these plugins, KPI’s routinely associated with infrastructure and database
performance can be measured and serve as a good starting point for monitoring.
Minimum Recommendation:
1. CPU: collects standard CPU metrics
2. System: gathers general stats on system load
3. Processes: uptime, and number of users logged in
4. DiskIO: gathers metrics about disk traffic and timing
5. Disk: gathers metrics about disk usage
6. Mem: collects system memory metrics
7. NetStat: Network related metrics
8. http_response: Setup local ping
9. filestat: Files to gather stats about (meta node only)
10. InfluxDB: gather stats from the InfluxDB Instance. (data node only)
Optional:
1. Logs: requires syslog
2. Swap: collects system swap metrics
3. Internal: gather Telegraf related stats
4. Docker: if deployed in containers
But where should these metrics land?
• You’ve got lots of options
– Typical recommendation: use an Open Source instance as the “watcher
of the watchers”
• If there are a small number of clusters that need to be monitored this is the easiest,
simplest way to go
– Other options that can be considered:
• 2 instances -- monitor each other
• Separate by environment -- and eliminate the environment global tag in the Telegraf
config
• Unleash your creativity…
Key Point
– Production InfluxDB instances
should not monitor themselves
– WHY?
• Because…visibility is lost if the
database is unreachable, for any
reason.
[monitor]
store-enabled = false
Telegraf Configuration: Global
[global_tags]
cluster_id = $CLUSTER_ID
environment = $ENVIRONMENT
[agent]
interval = "10s"
round_interval = true
metric_buffer_limit = 10000
metric_batch_size = 1000
collection_jitter = "0s"
flush_interval = "30s"
flush_jitter = "30s"
debug = false
hostname = ""
All plugins are controlled by the telegraf.conf file. Administrators can easily enable/disable plugins and options by
activating them.
Global tags can be specified in the [global_tags]
section of the config file in key="value" format. Use
a GUID which uniquely identifies each “cluster” and
ensure that environment variable exists consistently
on all hosts (meta and data). Optionally, add other
tags if desired. Example: dev, prod for environment.
Agent Configuration recommended config settings
for InfluxDB data collection. Adjust the interval and
flush_interval based on:
● desire around “speed of observability”
● retention policy for the data
Telegraf Configuration: Inputs (common)
# INPUTS
[[inputs.cpu]]
percpu = false
totalcpu = true
fieldpass = ["usage_idle",
"usage_user", "usage_system",
"usage_steal"]
[[inputs.mem]]
[[inputs.netstat]]
[[inputs.system]]
[[inputs.diskio]]
Input Configuration items include grabbing metrics
from the various infrastructure, database, and
system components in play.
For the other plug-ins, default config is sufficient.
Telegraf Configuration: Inputs Data Nodes
# INPUTS
[[inputs.influxdb]]
interval = "15s"
urls = ["http://<localhost>:8086/debug/vars"]
timeout = "15s”
[[inputs.http_response]] #DATA
address = "http://<localhost>:8086/ping”
[[inputs.disk]]
mount_points =
["/var/lib/influxdb/data","/var/lib/influxdb/wal",
"/var/lib/influxdb/hh”,"/"]
InfluxDB grabs all metrics from the
exposed endpoint.
http_response allows you to ping
individual data nodes and track
response output.
You can also setup a separate Telegraf
agent elsewhere within your
infrastructure to ping the available
cluster(s) through the load balancer.
disk allows you to configure the
various volumes/mount points on
disk -- locations of data, wal, hinted
handoff -- and root. (default config
options shown)
Telegraf Configuration: Inputs Meta Nodes
# INPUTS
[[inputs.http_response]] #META
address = "http://<localhost>:8091/ping"
[[inputs.filestat]]
files =
["/ivar/lib/influxdb/meta/snapshots/*/state.bin"]
md5 = false
[[inputs.disk]]
mount_points = ["/var/lib/influxdb/meta", "/"]
http_response allows you to ping
individual meta nodes and track response
output.
filestat allows you to monitor metadata
snapshots.
disk allows you to configure the
various volumes/mount points on
disk -- locations of meta store -- and
root. (default config options shown)
Telegraf Configuration: Outputs
# OUTPUTS
[[outputs.influxdb]]
urls = [ "<target URL of DB>" ]
database = "telegraf"
retention_policy = "autogen"
timeout = "10s"
username = <uname>
password = <pword>
content_encoding = "gzip"
Output Configuration tells telegraf which
output sink to send the data . Multiple
output sinks can be specified in the
configuration file.
** NOTE: This should point to the load
balancer, if you are storing the metrics into a
cluster.
Telegraf Configuration: Gathering Logs
# INPUT
[[inputs.syslog]]
# OUTPUTS
[[outputs.influxdb]]
urls = [ "http://localhost:8086" ]
database = "telegraf"
# Drop all measurements that start
with "syslog"
namedrop = [ "syslog*" ]
[[outputs.influxdb]]
urls = [ "http://localhost:8086" ]
database = "telegraf"
retention_policy = "14days"
# Only accept syslog data:
namepass = [ "syslog*" ]
Output Configuration use
namepass/namedrop to
direct metrics/logs to
different db.rp targets
** NOTE: This should point to
the load balancer, if you are
storing the metrics into a
cluster.
Input Configuration add the
syslog input plug-in.
Review the settings for
your environment.
InfluxDB can be used to capture both metrics and events. The syslog protocol is used to gather the logs.
Visualization, Monitoring, Alerting
We’ve gathered a wide variety of metrics...so now what?
• Dashboards!
Alerting: Common Metrics to Watch
• Disk Usage
• Hinted Handoff Queue
• No metrics…. aka Deadman
Disk Usage Batch Task: TICKscript
// Monitor disk usage for all hosts
var data = batch
|query('''
SELECT last(used_percent)
FROM "telegraf"."autogen"."disk"
WHERE ("host" =~ /prod-.*/)
AND ("path" = '/var/lib/influxdb/data'
OR "path" = '/var/lib/influxdb/wal'
OR "path" = '/var/lib/influxdb/hh'
OR "path" = '/')
''')
.period(5m)
.every(10m)
.groupBy('host', 'role', 'environment', 'device')
Disk Usage Alert: TICKscript
var warn_threshold = 85
var critical_threshold = 95
data
|alert()
.id('Host: {{ index .Tags "host" }}, Environment: {{ index .Tags
"environment" }}')
.message('Alert: Disk Usage, Level: {{ .Level }}, Device: {{ index
.Tags "device" }}, {{ .ID }}, Usage: %{{ index .Fields "used_percent" }}')
.warn(lambda: "used_percent" > warn_threshold)
.crit(lambda: "used_percent" > critical_threshold)
.slack()
.channel('#monitoring')
Hinted Handoff Queue Batch Task: TICKscript
// This generates alerts for high hinted-handoff queues for InfluxEnterprise
var queue_size = batch
|query('''
SELECT max(queueBytes) as "max"
FROM "telegraf"."autogen"."influxdb_hh_processor"
WHERE ("host" =~ /prod-.*/)
''')
.groupBy('host', 'cluster_id')
.period(5m)
.every(10m)
|eval(lambda: "max" / 1048576.0)
.as('queue_size_mb')
Hinted Handoff Queue Alert: TICKscript
var warn_threshold = 3500
var crit_threshold = 5000
queue_size
|alert()
.id(’InfluxEnterprise/{{ .TaskName }}/{{ index .Tags "cluster_id"
}}/{{ index .Tags "host" }}')
.message('Host {{ index .Tags "host" }} (cluster {{ index .Tags
"cluster_id" }}) has a hinted-handoff queue size of {{ index .Fields
"queue_size_mb" }}MB')
.details('')
.warn(lambda: "queue_size_mb" > warn_threshold)
.crit(lambda: "queue_size_mb" > crit_threshold)
.stateChangesOnly()
.slack()
.pagerDuty()
Deadman Batch Task: TICKscript
// Ensure hosts are running. If no CPU usage statistics can be retrieved
// We assume the host has locked up, disappeared or is otherwise unreachable
var cpu_stats = batch
|barrier().idle(5m)
|query('''
SELECT count(usage_system)
FROM "telegraf"."autogen"."cpu"
WHERE ("host" =~ /prod-.*/)
''')
.period(5m)
.every(10m)
.groupBy('cluster_id', 'host')
Deadman Alert: TICKscript
var trigger = cpu_stats
|deadman(0.0, 10m)
.id('Host: {{ index .Tags "host" }}, Cluster ID: {{ index .Tags
"cluster_id" }}')
.message('Alert: Kapacitor Deadman, Level: {{ .Level }}, {{ .ID }}')
.idTag('alertID')
.messageField('message')
.durationField('duration')
.levelTag('level')
.stateChangesOnly()
.slack()
.channel('#monitoring')
Deadman Evaluate & Visualize Alert in Chronograf: TICKscript
trigger
|eval(lambda: "emitted")
.as('value')
.keep('value', 'message', 'duration')
|eval(lambda: float("value"))
.as('value')
.keep()
|influxDBOut()
.create()
.database('chronograf')
.retentionPolicy('autogen')
.measurement('alerts')
.tag('alertName', 'Deadman')
.tag('triggerType', 'deadman')
For Chronograf
Monitoring InfluxEnterprise
Troubleshooting
Common Troubleshooting Scenarios
• OOM Loop
• Runaway Series Cardinality
Common Troubleshooting Scenarios
Workload Type
• Which type are you?
– Read heavy
– Write heavy
– Mixed?
– Establish baselines and
understand “normal”
using metrics and
visualization
– Baselines allow you to
understand change over
time and help determine
when is time to scale up
Log Analysis
• Metrics First!
– Highlights where you
should look within the
log files
• Logs allow for pin
pointing root-cause of
issue observed by
metrics
– Cache max memory size
– Hinted Handoff Queue
“Blocked”
IOPS & Disk Throughput
• Understand the
capabilities of your
hardware
– We recommend SSD-
based deployments
• Deploying in an IaaS
environment?
– Understand max read
and write limits based
on machine class and
drive types – these can
change as you scale!
Recap
• Gather Metrics...and Logs
• Visualize, Monitor, and Alert… tune based on your environment
• Review Common Troubleshooting Scenarios
https://community.influxdata.com https://docs.influxdata.com
Thank You

More Related Content

PDF
INFLUXQL & TICKSCRIPT
PDF
Flux and InfluxDB 2.0
PDF
Optimizing the Grafana Platform for Flux
PDF
Finding OOMS in Legacy Systems with the Syslog Telegraf Plugin
PPTX
Using Grafana with InfluxDB 2.0 and Flux Lang by Jacob Lisi
PDF
Flux and InfluxDB 2.0 by Paul Dix
PDF
Performance Profiling in Rust
PPTX
9:40 am InfluxDB 2.0 and Flux – The Road Ahead Paul Dix, Founder and CTO | ...
INFLUXQL & TICKSCRIPT
Flux and InfluxDB 2.0
Optimizing the Grafana Platform for Flux
Finding OOMS in Legacy Systems with the Syslog Telegraf Plugin
Using Grafana with InfluxDB 2.0 and Flux Lang by Jacob Lisi
Flux and InfluxDB 2.0 by Paul Dix
Performance Profiling in Rust
9:40 am InfluxDB 2.0 and Flux – The Road Ahead Paul Dix, Founder and CTO | ...

What's hot (20)

PDF
Obtaining the Perfect Smoke By Monitoring Your BBQ with InfluxDB and Telegraf
PDF
InfluxData Platform Future and Vision
PDF
Extending Flux to Support Other Databases and Data Stores | Adam Anthony | In...
PDF
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
PDF
Time Series Data with InfluxDB
PPTX
InfluxDB 1.0 - Optimizing InfluxDB by Sam Dillard
PDF
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
PDF
Advanced kapacitor
PDF
Write your own telegraf plugin
PPTX
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
PDF
OPTIMIZING THE TICK STACK
PPTX
Taming the Tiger: Tips and Tricks for Using Telegraf
PDF
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
PPTX
Kapacitor - Real Time Data Processing Engine
PPTX
How to Introduce Telemetry Streaming (gNMI) in Your Network with SNMP with Te...
PDF
Downsampling your data October 2017
PPTX
Extending Flux - Writing Your Own Functions by Adam Anthony
PDF
Anais Dotis-Georgiou [InfluxData] | Learn Flux by Example | InfluxDays NA 2021
PDF
How to Build a Telegraf Plugin by Noah Crowley
PPTX
Ordered Record Collection
Obtaining the Perfect Smoke By Monitoring Your BBQ with InfluxDB and Telegraf
InfluxData Platform Future and Vision
Extending Flux to Support Other Databases and Data Stores | Adam Anthony | In...
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
Time Series Data with InfluxDB
InfluxDB 1.0 - Optimizing InfluxDB by Sam Dillard
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
Advanced kapacitor
Write your own telegraf plugin
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
OPTIMIZING THE TICK STACK
Taming the Tiger: Tips and Tricks for Using Telegraf
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
Kapacitor - Real Time Data Processing Engine
How to Introduce Telemetry Streaming (gNMI) in Your Network with SNMP with Te...
Downsampling your data October 2017
Extending Flux - Writing Your Own Functions by Adam Anthony
Anais Dotis-Georgiou [InfluxData] | Learn Flux by Example | InfluxDays NA 2021
How to Build a Telegraf Plugin by Noah Crowley
Ordered Record Collection
Ad

Similar to Monitoring InfluxEnterprise (20)

PDF
Lessons Learned: Running InfluxDB Cloud and Other Cloud Services at Scale | T...
PPTX
Lessons Learned Running InfluxDB Cloud and Other Cloud Services at Scale by T...
PPTX
How to Use Telegraf and Its Plugin Ecosystem
PDF
Taming the Tiger: Tips and Tricks for Using Telegraf
PDF
Getting Started: Intro to Telegraf - July 2021
PDF
Virtual training Intro to InfluxDB & Telegraf
PDF
Intro to Telegraf
PPTX
Improving Clinical Data Accuracy: How to Streamline a Data Pipeline Using Nod...
PDF
INTERFACE by apidays 2023 - Data Collection Basics, Anais Dotis-Georgiou, Inf...
PDF
InfluxDB Live Product Training
PDF
OSMC 2024 | Telegraf – A data collection agent by Sven Rebhan.pdf
PDF
How Sysbee Manages Infrastructures and Provides Advanced Monitoring by Using ...
PPTX
Maksim Vazhenin [Dell Technologies] | InfluxDB for Storage System Monitoring ...
PDF
The Telegraf Toolbelt | David McKay | InfluxData
PDF
The Telegraf Toolbelt: It Can Do That, Really? | David McKay | InfluxData
PDF
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
PDF
OSMC 2019 | The Telegraf Toolbelt: It Can Do That, Really? by David McKay
PDF
3 Reasons to Select Time Series Platforms for Cloud Native Applications Monit...
PDF
Jess Ingrassellino [InfluxData] | How to Get Data Into InfluxDB | InfluxDays ...
PPTX
Shashi Raina [AWS] & Al Sargent [InfluxData] | Build Modern Monitoring with I...
Lessons Learned: Running InfluxDB Cloud and Other Cloud Services at Scale | T...
Lessons Learned Running InfluxDB Cloud and Other Cloud Services at Scale by T...
How to Use Telegraf and Its Plugin Ecosystem
Taming the Tiger: Tips and Tricks for Using Telegraf
Getting Started: Intro to Telegraf - July 2021
Virtual training Intro to InfluxDB & Telegraf
Intro to Telegraf
Improving Clinical Data Accuracy: How to Streamline a Data Pipeline Using Nod...
INTERFACE by apidays 2023 - Data Collection Basics, Anais Dotis-Georgiou, Inf...
InfluxDB Live Product Training
OSMC 2024 | Telegraf – A data collection agent by Sven Rebhan.pdf
How Sysbee Manages Infrastructures and Provides Advanced Monitoring by Using ...
Maksim Vazhenin [Dell Technologies] | InfluxDB for Storage System Monitoring ...
The Telegraf Toolbelt | David McKay | InfluxData
The Telegraf Toolbelt: It Can Do That, Really? | David McKay | InfluxData
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
OSMC 2019 | The Telegraf Toolbelt: It Can Do That, Really? by David McKay
3 Reasons to Select Time Series Platforms for Cloud Native Applications Monit...
Jess Ingrassellino [InfluxData] | How to Get Data Into InfluxDB | InfluxDays ...
Shashi Raina [AWS] & Al Sargent [InfluxData] | Build Modern Monitoring with I...
Ad

More from InfluxData (20)

PPTX
Announcing InfluxDB Clustered
PDF
Best Practices for Leveraging the Apache Arrow Ecosystem
PDF
How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...
PDF
Power Your Predictive Analytics with InfluxDB
PDF
How Teréga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base
PDF
Build an Edge-to-Cloud Solution with the MING Stack
PDF
Meet the Founders: An Open Discussion About Rewriting Using Rust
PDF
Introducing InfluxDB Cloud Dedicated
PDF
Gain Better Observability with OpenTelemetry and InfluxDB
PPTX
How a Heat Treating Plant Ensures Tight Process Control and Exceptional Quali...
PDF
How Delft University's Engineering Students Make Their EV Formula-Style Race ...
PPTX
Introducing InfluxDB’s New Time Series Database Storage Engine
PDF
Start Automating InfluxDB Deployments at the Edge with balena
PDF
Understanding InfluxDB’s New Storage Engine
PDF
Streamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDB
PPTX
Ward Bowman [PTC] | ThingWorx Long-Term Data Storage with InfluxDB | InfluxDa...
PDF
Scott Anderson [InfluxData] | New & Upcoming Flux Features | InfluxDays 2022
PDF
Steinkamp, Clifford [InfluxData] | Closing Thoughts | InfluxDays 2022
PDF
Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...
PDF
Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022
Announcing InfluxDB Clustered
Best Practices for Leveraging the Apache Arrow Ecosystem
How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...
Power Your Predictive Analytics with InfluxDB
How Teréga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base
Build an Edge-to-Cloud Solution with the MING Stack
Meet the Founders: An Open Discussion About Rewriting Using Rust
Introducing InfluxDB Cloud Dedicated
Gain Better Observability with OpenTelemetry and InfluxDB
How a Heat Treating Plant Ensures Tight Process Control and Exceptional Quali...
How Delft University's Engineering Students Make Their EV Formula-Style Race ...
Introducing InfluxDB’s New Time Series Database Storage Engine
Start Automating InfluxDB Deployments at the Edge with balena
Understanding InfluxDB’s New Storage Engine
Streamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDB
Ward Bowman [PTC] | ThingWorx Long-Term Data Storage with InfluxDB | InfluxDa...
Scott Anderson [InfluxData] | New & Upcoming Flux Features | InfluxDays 2022
Steinkamp, Clifford [InfluxData] | Closing Thoughts | InfluxDays 2022
Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...
Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022

Recently uploaded (20)

PPTX
presentation_pfe-universite-molay-seltan.pptx
DOCX
Unit-3 cyber security network security of internet system
PDF
Paper PDF World Game (s) Great Redesign.pdf
PDF
WebRTC in SignalWire - troubleshooting media negotiation
PPTX
Introduction to Information and Communication Technology
PPTX
522797556-Unit-2-Temperature-measurement-1-1.pptx
PPTX
Job_Card_System_Styled_lorem_ipsum_.pptx
PDF
Triggering QUIC, presented by Geoff Huston at IETF 123
PPTX
Introuction about ICD -10 and ICD-11 PPT.pptx
PDF
Sims 4 Historia para lo sims 4 para jugar
PPTX
international classification of diseases ICD-10 review PPT.pptx
PPTX
QR Codes Qr codecodecodecodecocodedecodecode
PPTX
June-4-Sermon-Powerpoint.pptx USE THIS FOR YOUR MOTIVATION
PPTX
CHE NAA, , b,mn,mblblblbljb jb jlb ,j , ,C PPT.pptx
PDF
Slides PDF The World Game (s) Eco Economic Epochs.pdf
PPTX
Slides PPTX World Game (s) Eco Economic Epochs.pptx
PDF
The New Creative Director: How AI Tools for Social Media Content Creation Are...
PDF
Decoding a Decade: 10 Years of Applied CTI Discipline
PPTX
INTERNET------BASICS-------UPDATED PPT PRESENTATION
PPTX
Digital Literacy And Online Safety on internet
presentation_pfe-universite-molay-seltan.pptx
Unit-3 cyber security network security of internet system
Paper PDF World Game (s) Great Redesign.pdf
WebRTC in SignalWire - troubleshooting media negotiation
Introduction to Information and Communication Technology
522797556-Unit-2-Temperature-measurement-1-1.pptx
Job_Card_System_Styled_lorem_ipsum_.pptx
Triggering QUIC, presented by Geoff Huston at IETF 123
Introuction about ICD -10 and ICD-11 PPT.pptx
Sims 4 Historia para lo sims 4 para jugar
international classification of diseases ICD-10 review PPT.pptx
QR Codes Qr codecodecodecodecocodedecodecode
June-4-Sermon-Powerpoint.pptx USE THIS FOR YOUR MOTIVATION
CHE NAA, , b,mn,mblblblbljb jb jlb ,j , ,C PPT.pptx
Slides PDF The World Game (s) Eco Economic Epochs.pdf
Slides PPTX World Game (s) Eco Economic Epochs.pptx
The New Creative Director: How AI Tools for Social Media Content Creation Are...
Decoding a Decade: 10 Years of Applied CTI Discipline
INTERNET------BASICS-------UPDATED PPT PRESENTATION
Digital Literacy And Online Safety on internet

Monitoring InfluxEnterprise

  • 1. Tim E. Hall @thallinflux VP, Products InfluxData Monitoring InfluxEnterprise
  • 2. Discussion Topics • Background • Gathering Metrics...and Logs • Visualization, Monitoring, and Alerting • Troubleshooting Scenarios
  • 3. From development to production • Change is required • Establish monitoring baselines • Ensure visibility into health of the system • Notifications for most common issues, before they become outages
  • 4. From OSS to Enterprise InfluxDB OSS Meta 1 Meta 3Meta 2 Data Node 2 Data Node 1 InfluxDB Enterprise
  • 7. Deploy Telegraf on all nodes (meta and data) By enabling these plugins, KPI’s routinely associated with infrastructure and database performance can be measured and serve as a good starting point for monitoring. Minimum Recommendation: 1. CPU: collects standard CPU metrics 2. System: gathers general stats on system load 3. Processes: uptime, and number of users logged in 4. DiskIO: gathers metrics about disk traffic and timing 5. Disk: gathers metrics about disk usage 6. Mem: collects system memory metrics 7. NetStat: Network related metrics 8. http_response: Setup local ping 9. filestat: Files to gather stats about (meta node only) 10. InfluxDB: gather stats from the InfluxDB Instance. (data node only) Optional: 1. Logs: requires syslog 2. Swap: collects system swap metrics 3. Internal: gather Telegraf related stats 4. Docker: if deployed in containers
  • 8. But where should these metrics land? • You’ve got lots of options – Typical recommendation: use an Open Source instance as the “watcher of the watchers” • If there are a small number of clusters that need to be monitored this is the easiest, simplest way to go – Other options that can be considered: • 2 instances -- monitor each other • Separate by environment -- and eliminate the environment global tag in the Telegraf config • Unleash your creativity…
  • 9. Key Point – Production InfluxDB instances should not monitor themselves – WHY? • Because…visibility is lost if the database is unreachable, for any reason. [monitor] store-enabled = false
  • 10. Telegraf Configuration: Global [global_tags] cluster_id = $CLUSTER_ID environment = $ENVIRONMENT [agent] interval = "10s" round_interval = true metric_buffer_limit = 10000 metric_batch_size = 1000 collection_jitter = "0s" flush_interval = "30s" flush_jitter = "30s" debug = false hostname = "" All plugins are controlled by the telegraf.conf file. Administrators can easily enable/disable plugins and options by activating them. Global tags can be specified in the [global_tags] section of the config file in key="value" format. Use a GUID which uniquely identifies each “cluster” and ensure that environment variable exists consistently on all hosts (meta and data). Optionally, add other tags if desired. Example: dev, prod for environment. Agent Configuration recommended config settings for InfluxDB data collection. Adjust the interval and flush_interval based on: ● desire around “speed of observability” ● retention policy for the data
  • 11. Telegraf Configuration: Inputs (common) # INPUTS [[inputs.cpu]] percpu = false totalcpu = true fieldpass = ["usage_idle", "usage_user", "usage_system", "usage_steal"] [[inputs.mem]] [[inputs.netstat]] [[inputs.system]] [[inputs.diskio]] Input Configuration items include grabbing metrics from the various infrastructure, database, and system components in play. For the other plug-ins, default config is sufficient.
  • 12. Telegraf Configuration: Inputs Data Nodes # INPUTS [[inputs.influxdb]] interval = "15s" urls = ["http://<localhost>:8086/debug/vars"] timeout = "15s” [[inputs.http_response]] #DATA address = "http://<localhost>:8086/ping” [[inputs.disk]] mount_points = ["/var/lib/influxdb/data","/var/lib/influxdb/wal", "/var/lib/influxdb/hh”,"/"] InfluxDB grabs all metrics from the exposed endpoint. http_response allows you to ping individual data nodes and track response output. You can also setup a separate Telegraf agent elsewhere within your infrastructure to ping the available cluster(s) through the load balancer. disk allows you to configure the various volumes/mount points on disk -- locations of data, wal, hinted handoff -- and root. (default config options shown)
  • 13. Telegraf Configuration: Inputs Meta Nodes # INPUTS [[inputs.http_response]] #META address = "http://<localhost>:8091/ping" [[inputs.filestat]] files = ["/ivar/lib/influxdb/meta/snapshots/*/state.bin"] md5 = false [[inputs.disk]] mount_points = ["/var/lib/influxdb/meta", "/"] http_response allows you to ping individual meta nodes and track response output. filestat allows you to monitor metadata snapshots. disk allows you to configure the various volumes/mount points on disk -- locations of meta store -- and root. (default config options shown)
  • 14. Telegraf Configuration: Outputs # OUTPUTS [[outputs.influxdb]] urls = [ "<target URL of DB>" ] database = "telegraf" retention_policy = "autogen" timeout = "10s" username = <uname> password = <pword> content_encoding = "gzip" Output Configuration tells telegraf which output sink to send the data . Multiple output sinks can be specified in the configuration file. ** NOTE: This should point to the load balancer, if you are storing the metrics into a cluster.
  • 15. Telegraf Configuration: Gathering Logs # INPUT [[inputs.syslog]] # OUTPUTS [[outputs.influxdb]] urls = [ "http://localhost:8086" ] database = "telegraf" # Drop all measurements that start with "syslog" namedrop = [ "syslog*" ] [[outputs.influxdb]] urls = [ "http://localhost:8086" ] database = "telegraf" retention_policy = "14days" # Only accept syslog data: namepass = [ "syslog*" ] Output Configuration use namepass/namedrop to direct metrics/logs to different db.rp targets ** NOTE: This should point to the load balancer, if you are storing the metrics into a cluster. Input Configuration add the syslog input plug-in. Review the settings for your environment. InfluxDB can be used to capture both metrics and events. The syslog protocol is used to gather the logs.
  • 17. We’ve gathered a wide variety of metrics...so now what? • Dashboards!
  • 18. Alerting: Common Metrics to Watch • Disk Usage • Hinted Handoff Queue • No metrics…. aka Deadman
  • 19. Disk Usage Batch Task: TICKscript // Monitor disk usage for all hosts var data = batch |query(''' SELECT last(used_percent) FROM "telegraf"."autogen"."disk" WHERE ("host" =~ /prod-.*/) AND ("path" = '/var/lib/influxdb/data' OR "path" = '/var/lib/influxdb/wal' OR "path" = '/var/lib/influxdb/hh' OR "path" = '/') ''') .period(5m) .every(10m) .groupBy('host', 'role', 'environment', 'device')
  • 20. Disk Usage Alert: TICKscript var warn_threshold = 85 var critical_threshold = 95 data |alert() .id('Host: {{ index .Tags "host" }}, Environment: {{ index .Tags "environment" }}') .message('Alert: Disk Usage, Level: {{ .Level }}, Device: {{ index .Tags "device" }}, {{ .ID }}, Usage: %{{ index .Fields "used_percent" }}') .warn(lambda: "used_percent" > warn_threshold) .crit(lambda: "used_percent" > critical_threshold) .slack() .channel('#monitoring')
  • 21. Hinted Handoff Queue Batch Task: TICKscript // This generates alerts for high hinted-handoff queues for InfluxEnterprise var queue_size = batch |query(''' SELECT max(queueBytes) as "max" FROM "telegraf"."autogen"."influxdb_hh_processor" WHERE ("host" =~ /prod-.*/) ''') .groupBy('host', 'cluster_id') .period(5m) .every(10m) |eval(lambda: "max" / 1048576.0) .as('queue_size_mb')
  • 22. Hinted Handoff Queue Alert: TICKscript var warn_threshold = 3500 var crit_threshold = 5000 queue_size |alert() .id(’InfluxEnterprise/{{ .TaskName }}/{{ index .Tags "cluster_id" }}/{{ index .Tags "host" }}') .message('Host {{ index .Tags "host" }} (cluster {{ index .Tags "cluster_id" }}) has a hinted-handoff queue size of {{ index .Fields "queue_size_mb" }}MB') .details('') .warn(lambda: "queue_size_mb" > warn_threshold) .crit(lambda: "queue_size_mb" > crit_threshold) .stateChangesOnly() .slack() .pagerDuty()
  • 23. Deadman Batch Task: TICKscript // Ensure hosts are running. If no CPU usage statistics can be retrieved // We assume the host has locked up, disappeared or is otherwise unreachable var cpu_stats = batch |barrier().idle(5m) |query(''' SELECT count(usage_system) FROM "telegraf"."autogen"."cpu" WHERE ("host" =~ /prod-.*/) ''') .period(5m) .every(10m) .groupBy('cluster_id', 'host')
  • 24. Deadman Alert: TICKscript var trigger = cpu_stats |deadman(0.0, 10m) .id('Host: {{ index .Tags "host" }}, Cluster ID: {{ index .Tags "cluster_id" }}') .message('Alert: Kapacitor Deadman, Level: {{ .Level }}, {{ .ID }}') .idTag('alertID') .messageField('message') .durationField('duration') .levelTag('level') .stateChangesOnly() .slack() .channel('#monitoring')
  • 25. Deadman Evaluate & Visualize Alert in Chronograf: TICKscript trigger |eval(lambda: "emitted") .as('value') .keep('value', 'message', 'duration') |eval(lambda: float("value")) .as('value') .keep() |influxDBOut() .create() .database('chronograf') .retentionPolicy('autogen') .measurement('alerts') .tag('alertName', 'Deadman') .tag('triggerType', 'deadman') For Chronograf
  • 28. Common Troubleshooting Scenarios • OOM Loop • Runaway Series Cardinality
  • 29. Common Troubleshooting Scenarios Workload Type • Which type are you? – Read heavy – Write heavy – Mixed? – Establish baselines and understand “normal” using metrics and visualization – Baselines allow you to understand change over time and help determine when is time to scale up Log Analysis • Metrics First! – Highlights where you should look within the log files • Logs allow for pin pointing root-cause of issue observed by metrics – Cache max memory size – Hinted Handoff Queue “Blocked” IOPS & Disk Throughput • Understand the capabilities of your hardware – We recommend SSD- based deployments • Deploying in an IaaS environment? – Understand max read and write limits based on machine class and drive types – these can change as you scale!
  • 30. Recap • Gather Metrics...and Logs • Visualize, Monitor, and Alert… tune based on your environment • Review Common Troubleshooting Scenarios https://community.influxdata.com https://docs.influxdata.com