SlideShare a Scribd company logo
Finding OOMs with Telegraf
Dylan Ferreira
@dylanferreira
Finding OOMS in Legacy Systems with the Syslog Telegraf Plugin
Agenda
● A whole bunch of dry stuff
● Maybe snacks?
Agenda.real
● Our TICK stack layout
● Configuring Rsyslog & Telegraf
● Doing more with Telegraf plugins
● Using Kapacitor to search &
sanitize log data
● Exploring this data with Grafana
How did we get here?
● Multi-dimensional TSDB.
● Very high performance.
● All data is aggregated on ingest.
● Can store raw metrics.
● Multi-dimensional TSDB.
● Complex aggregations (CQs).
● Multiple fields!
● Can store raw metrics.
● Flat structure with metadata stored
in the metric name (pre-v1).
● Fixed per-series aggregations
What is Telegraf?
“Telegraf is InfluxData's open source
plugin-driven server agent for collecting
and reporting metrics.”
● Inputs
● Processors
● Aggregators
● Outputs
● Stream Buffer
●● Metrics Router
TICK Stack Layout
Applications
Host Setup
Applications sending log data to syslog
Syslog relaying log data to Telegraf
Applications
Applications
Applications
Single DC Setup
Multi DC Setup
DC1
DC2
Syslog into Telegraf
The OOM Killer
/**
* oom_badness - heuristic function to determine which candidate task to kill
* @p: task struct of which task we should calculate
* @totalpages: total present RAM allowed for page allocation
* @memcg: task's memory controller, if constrained
* @nodemask: nodemask passed to page allocator for mempolicy ooms
*
* The heuristic for determining which task to kill is made to be as simple and
* predictable as possible. The goal is to return the highest value for the
* task consuming the most memory to avoid subsequent oom failures.
*/
Armouring Telegraf Against the OOM Killer
[Service]
OOMScoreAdjust=-600
/etc/systemd/system/telegraf.service.d/oom_score_adj.conf
value between -1000 and +1000/proc/$PID/oom_score_adj
/proc/$PID/oom__adj value between -17 and +15
Telegraf Config
[[inputs.syslog]]
server = "tcp://:6514"
[[outputs.influxdb]]
urls = [ "http://influxdb-syslog.service.consul:8086" ]
database = "syslog"
retention_policy = "autogen"
precision = "ns"
user_agent = "telegraf"
namepass = ["syslog"]
Rsyslog Config
$ActionQueueType LinkedList
$ActionQueueFileName telegraf
$ActionResumeRetryCount -1
$ActionQueueSaveOnShutdown on
# forward over tcp with octet framing according to RFC 5425
*.* @@(o)localhost:6514;RSYSLOG_SyslogProtocol23Format
# all logs that contain the string "**WARNING**"
if ($msg contains '**WARNING**') then
@@(o)localhost:6514;RSYSLOG_SyslogProtocol23Format
What do you get?
Telegraf Output
● syslog
○ tags
■ severity (string)
■ facility (string)
■ hostname (string)
■ appname (string)
○ fields
■ version (integer)
■ severity_code (integer)
■ facility_code (integer)
■ timestamp (integer): the time recorded in the syslog message
■ procid (string)
■ msgid (string)
■ sdid (bool)
■ Structured Data (string)
○ timestamp: the time the messages was received
Original timestamp
Timestamp given by Telegraf
Timestamps & Ingest Latency
var data = stream
|from()
.database('syslog')
.retentionPolicy('autogen')
.measurement('syslog')
|groupBy('hostname')
|window()
.align()
.period(1m)
.every(1m)
|eval(lambda: unixNano("time") - "timestamp")
.as('timestamp_lag_ns')
var mean_latency = data
|mean('timestamp_lag_ns')
.as('val')
mean_latency
|log()
Typical Latency: under 1ms
Getting more from Telegraf
Enable Counters
# Create per-minute counts of syslog data
[[aggregators.basicstats]]
period = "1m"
drop_original = false
stats = ["count"]
name_suffix = "_counts"
# filtering
namepass = ["syslog"]
Finding OOMS in Legacy Systems with the Syslog Telegraf Plugin
Finding OOMS in Legacy Systems with the Syslog Telegraf Plugin
Adding More Tags From Your Log Data
[[processors.parser]]
parse_fields = ["message"]
drop_original = false
merge = "override"
data_format = "json"
tag_keys = ["method","uri"]
namepass = ["syslog"]
[processors.parser.tagpass]
appname = ["crazy-cool-api"]
{
"hostname": "ab704bf3336f" ,
"ip": "10.0.0.30",
"method": "GET",
"msg": "request completed" ,
"start": "2018-10-23T07:00:03Z" ,
"status": 200,
"time": "2018-10-23T07:00:03.942327Z" ,
"uri": "/health",
"uri_query": "",
"user_agent": "Consul Health Check"
}
Log Message Telegraf Config
Finding OOMS in Legacy Systems with the Syslog Telegraf Plugin
Data Sanitization
var ipv4_address = /d{1,3}.d{1,3}.d{1,3}.d{1,3}/
|eval(lambda: regexReplace(ipv4_address, "message", '<ipv4-address>'))
.as('message')
.keep()
Rewriting your data with regexReplace
var email_address = /w+([-+.']w+)*@w+([-.]w+)*.w+([-.]w+)*/
|eval(lambda: regexReplace(email_address, "message", '<email-address>'))
.as('message')
.keep(
Email Addresses
IP Addresses
Email Address Regex (RFC 5322)
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[
x01-x08x0bx0cx0e-x1fx21x23-x5bx5d-x7f]|[x01-x09x0bx0cx0e
-x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?.)+[a-z0-9](?:[a-z0-9-]*
[a-z0-9])?|[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).){3}(?:25[0-5]|
2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[x01-x08x0bx0cx0e
-x1fx21-x5ax53-x7f]|[x01-x09x0bx0cx0e-x7f])+)])
Email Address Regex / Pain Amplifier
(?:(?:rn)?[t])*)(?:.(?:(?:rn)?[t])*(?:[^()<>@,;:".[]000-031]+(?:(?:(?:rn)?[t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?
[t])*))*>(?:(?:rn)?[t])*)|(?:[^()<>@,;:".[]000-031]+(?:(?:(?:rn)?[t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[t]))*"(?:(?:
rn)?[t])*)*:(?:(?:rn)?[t])*(?:(?:(?:[^()<>@,;:".[]000-031]+(?:(?:(?:rn)?[t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[t]))*"
(?:(?:rn)?[t])*)(?:.(?:(?:rn)?[t])*(?:[^()<>@,;:".[]000-031]+(?:(?:(?:rn)?[t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[t]
))*"(?:(?:rn)?[t])*))*@(?:(?:rn)?[t])*(?:[^()<>@,;:".[]000-031]+(?:(?:(?:rn)?[t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn
)?[t])*)(?:.(?:(?:rn)?[t])*(?:[^()<>@,;:".[]000-031]+(?:(?:(?:rn)?[t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[t])*))*|
(?:[^()<>@,;:".[]000-031]+(?:(?:(?:rn)?[t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[t]))*"(?:(?:rn)?[t])*)*<(?:(?:rn)?[t]
)*(?:@(?:[^()<>@,;:".[]000-031]+(?:(?:(?:rn)?[t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[t])*)(?:.(?:(?:rn)?[t])*(?:[^(
)<>@,;:".[]000-031]+(?:(?:(?:rn)?[t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[t])*))*(?:,@(?:(?:rn)?[t])*(?:[^()<>@,;:"
.[]000-031]+(?:(?:(?:rn)?[t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[t])*)(?:.(?:(?:rn)?[t])*(?:[^()<>@,;:".[]000-0
31]+(?:(?:(?:rn)?[t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[t])*))*)*:(?:(?:rn)?[t])*)?(?:[^()<>@,;:".[]000-031]+(?:(?:
(?:rn)?[t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[t]))*"(?:(?:rn)?[t])*)(?:.(?:(?:rn)?[t])*(?:[^()<>@,;:".[]000-031]+(?
:(?:(?:rn)?[t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[t]))*"(?:(?:rn)?[t])*))*@(?:(?:rn)?[t])*(?:[^()<>@,;:".[]000-031]+
(?:(?:(?:rn)?[t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[t])*)(?:.(?:(?:rn)?[t])*(?:[^()<>@,;:".[]000-031]+(?:(?:(?:r
n)?[t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[t])*))*>(?:(?:rn)?[t])*)(?:,s*(?:(?:[^()<>@,;:".[]000-031]+(?:(?:(?:rn)
?[t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[t]))*"(?:(?:rn)?[t])*)(?:.(?:(?:rn)?[t])*(?:[^()<>@,;:".[]000-031]+(?:(?:(?:
rn)?[t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[t]))*"(?:(?:rn)?[t])*))*@(?:(?:rn)?[t])*(?:[^()<>@,;:".[]000-031]+(?:(?:(?
:rn)?[t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[t])*)(?:.(?:(?:rn)?[t])*(?:[^()<>@,;:".[]000-031]+(?:(?:(?:rn)?[t])
+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[t])*))*|(?:[^()<>@,;:".[]000-031]+(?:(?:(?:rn)?[t])+|Z|(?=[["()<>@,;:".[]]))|
"(?:[^"r]|.|(?:(?:rn)?[t]))*"(?:(?:rn)?[t])*)*<(?:(?:rn)?[t])*(?:@(?:[^()<>@,;:".[]000-031]+(?:(?:(?:rn)?[t])+|Z|(?=[["()<>@,;:".[
]]))|[([^[]r]|.)*](?:(?:rn)?[t])*)(?:.(?:(?:rn)?[t])*(?:[^()<>@,;:".[]000-031]+(?:(?:(?:rn)?[t])+|Z|(?=[["()<>@,;:".[]]))|[([^
[]r]|.)*](?:(?:rn)?[t])*))*(?:,@(?:(?:rn)?[t])*(?:[^()<>@,;:".[]000-031]+(?:(?:(?:rn)?[t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|
.)*](?:(?:rn)?[t])*)(?:.(?:(?:rn)?[t])*(?:[^()<>@,;:".[]000-031]+(?:(?:(?:rn)?[t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:
rn)?[t])*))*)*:(?:(?:rn)?[t])*)?(?:[^()<>@,;:".[]000-031]+(?:(?:(?:rn)?[t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[t]))*"
(?:(?:rn)?[t])*)(?:.(?:(?:rn)?[t])*(?:[^()<>@,;:".[]000-031]+(?:(?:(?:rn)?[t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[t]
))*"(?:(?:rn)?[t])*))*@(?:(?:rn)?[t])*(?:[^()<>@,;:".[]000-031]+(?:(?:(?:rn)?[t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn
)?[t])*)(?:.(?:(?:rn)?[t])*(?:[^()<>@,;:".[]000-031]+(?:(?:(?:rn)?[t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[t])*))*
>(?:(?:rn)?[t])*))*)?;s*)
Finding Events
With Kapacitor
Tick Template
var where_filter = lambda: TRUE
var appname_lambda lambda
var event_type string
var data = stream
|from()
.database('syslog')
.retentionPolicy('autogen')
.measurement('syslog')
|where(lambda: "facility" == 'kern')
|where(where_filter)
|eval(appname_lambda)
.as('appname')
.tags('appname')
.keep()
|eval(lambda: regexReplace(/w+([-+.']w+)*@w+([-.]w+)*.w+([-.]w+)*/, "message", '<email-address>'))
.as('message')
.keep()
|eval(lambda: unixNano("time") - "timestamp")
.as('timestamp_lag_ns')
.keep()
|log()
/etc/kapacitor/load/templates/template-syslog_kern_service_events.tick
Task Definition : OOM
template-id: template-syslog_kern_service_events
vars:
Where_filter:
type: lambda
value: >
"message" =~ /.+ Memory cgroup out of memory: Kill process .+/
appname_lambda:
type: lambda
value: >
regexReplace(/^[.+] Memory cgroup out of memory: Kill process .+? ((.+?)) score .+/, "message", '$1')
event_type:
type: string
value: "OOM"
/etc/kapacitor/load/tasks/syslog_kern_service_events-OOM.yaml
Task Definition : Segfault
/etc/kapacitor/load/tasks/syslog_kern_service_events-segfault.yaml
template-id: template-syslog_kern_service_events
vars:
Where_filter:
type: lambda
value: >
"message" =~ /.+: segfault at .+/
appname_lambda:
type: lambda
value: >
regexReplace(/^[.+] (.+?)[.+]: segfault at .*/, "message", '$1')
event_type:
type: string
value: "segfault"
Memory cgroup out of memory: Kill process 3056 ( upstart-socket-) score 1057 or sacrifice childn
The kernel truncates process names down to 15 chars (16 chars -1 NUL)
and stores this in /proc/<PID>/comm
include/linux/sched.h
/* Task command name length: */
#define TASK_COMM_LEN 16
e.g. upstart-socket-bridge becomes upstart-socket-
The Kernel & Process Names
metric_relabel_configs:
- source_labels: ['name']
regex: '([[:ascii:]]{1,15}).*)'
target_label: 'name_short'
replacement: '$1'
Linkage Workarounds
Prometheus
[[processors.regex]]
[[processors.regex.tags]]
key = "<tag>"
pattern = "^([[:ascii:]]{1,15}).*$"
replacement = "${1}"
result_key = "short_<tag>"
Telegraf
Dashboards
Finding OOMS in Legacy Systems with the Syslog Telegraf Plugin
Ad-hoc Variable
Group By Using A Custom Variable
GROUP BY time(1m), "[[group]]" fill(0)
Regex Search Using A Constant
WHERE ("message" =~ / $search/)
Annotations
Questions
Dylan Ferreira
@dylanferreira

More Related Content

PPTX
How to Introduce Telemetry Streaming (gNMI) in Your Network with SNMP with Te...
PDF
Virtual training Intro to InfluxDB & Telegraf
PDF
Inside the InfluxDB storage engine
PPTX
Extending Flux - Writing Your Own Functions by Adam Anthony
PPTX
Development and Applications of Distributed IoT Sensors for Intermittent Conn...
PPTX
Kapacitor - Real Time Data Processing Engine
PPTX
How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...
PDF
How to Build a Telegraf Plugin by Noah Crowley
How to Introduce Telemetry Streaming (gNMI) in Your Network with SNMP with Te...
Virtual training Intro to InfluxDB & Telegraf
Inside the InfluxDB storage engine
Extending Flux - Writing Your Own Functions by Adam Anthony
Development and Applications of Distributed IoT Sensors for Intermittent Conn...
Kapacitor - Real Time Data Processing Engine
How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...
How to Build a Telegraf Plugin by Noah Crowley

What's hot (20)

PDF
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
PDF
Graph Everything
PPTX
Building Modern Data Pipelines for Time Series Data on GCP with InfluxData by...
PDF
Time Series Database and Tick Stack
PDF
Wayfair Use Case: The four R's of Metrics Delivery
PDF
Creating and Using the Flux SQL Datasource | Katy Farmer | InfluxData
PDF
Getting Ready to Move to InfluxDB 2.0 | Tim Hall | InfluxData
PDF
Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...
PDF
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
PPTX
Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | Influ...
PDF
Meet the Experts: InfluxDB Product Update
PDF
Flink Forward SF 2017: Kenneth Knowles - Back to Sessions overview
PPTX
Flink Forward SF 2017: Stephan Ewen - Convergence of real-time analytics and ...
PDF
Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approx...
PDF
Scaling ingest pipelines with high performance computing principles - Rajiv K...
PPTX
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
PDF
DISTRIBUTED PERFORMANCE ANALYSIS USING INFLUXDB AND THE LINUX EBPF VIRTUAL MA...
PDF
eBPF Powered Distributed Kubernetes Performance Analysis - Lorenzo Fontana, I...
PDF
Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Proce...
PDF
PSUG #52 Dataflow and simplified reactive programming with Akka-streams
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Graph Everything
Building Modern Data Pipelines for Time Series Data on GCP with InfluxData by...
Time Series Database and Tick Stack
Wayfair Use Case: The four R's of Metrics Delivery
Creating and Using the Flux SQL Datasource | Katy Farmer | InfluxData
Getting Ready to Move to InfluxDB 2.0 | Tim Hall | InfluxData
Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | Influ...
Meet the Experts: InfluxDB Product Update
Flink Forward SF 2017: Kenneth Knowles - Back to Sessions overview
Flink Forward SF 2017: Stephan Ewen - Convergence of real-time analytics and ...
Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approx...
Scaling ingest pipelines with high performance computing principles - Rajiv K...
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
DISTRIBUTED PERFORMANCE ANALYSIS USING INFLUXDB AND THE LINUX EBPF VIRTUAL MA...
eBPF Powered Distributed Kubernetes Performance Analysis - Lorenzo Fontana, I...
Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Proce...
PSUG #52 Dataflow and simplified reactive programming with Akka-streams
Ad

Similar to Finding OOMS in Legacy Systems with the Syslog Telegraf Plugin (20)

PPTX
Lessons Learned Running InfluxDB Cloud and Other Cloud Services at Scale by T...
PDF
Monitoring InfluxEnterprise
PDF
Lessons Learned: Running InfluxDB Cloud and Other Cloud Services at Scale | T...
PDF
Building a Telegraf Plugin by Noah Crowly | Developer Advocate | InfluxData
PDF
Advanced kapacitor
PPT
ELK stack at weibo.com
PDF
LogStash in action
PDF
Application Logging in the 21st century - 2014.key
PDF
Using Netconf/Yang with OpenDalight
PDF
Write your own telegraf plugin
PPTX
Influx data basic
PDF
Functional, Type-safe, Testable Microservices with ZIO and gRPC
PPTX
Oracle Basics and Architecture
ODP
Logitoring - log-driven monitoring and the Rocket science
PDF
A Brief Introduction of TiDB (Percona Live)
PDF
Virtual training Intro to Kapacitor
PDF
2024 Dec 05 - PyData Global - Tutorial Its In The Air Tonight
PPTX
Open Source TCP or Netflow Log Server Using Graylog
PDF
OSMC 2024 | Telegraf – A data collection agent by Sven Rebhan.pdf
PDF
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
Lessons Learned Running InfluxDB Cloud and Other Cloud Services at Scale by T...
Monitoring InfluxEnterprise
Lessons Learned: Running InfluxDB Cloud and Other Cloud Services at Scale | T...
Building a Telegraf Plugin by Noah Crowly | Developer Advocate | InfluxData
Advanced kapacitor
ELK stack at weibo.com
LogStash in action
Application Logging in the 21st century - 2014.key
Using Netconf/Yang with OpenDalight
Write your own telegraf plugin
Influx data basic
Functional, Type-safe, Testable Microservices with ZIO and gRPC
Oracle Basics and Architecture
Logitoring - log-driven monitoring and the Rocket science
A Brief Introduction of TiDB (Percona Live)
Virtual training Intro to Kapacitor
2024 Dec 05 - PyData Global - Tutorial Its In The Air Tonight
Open Source TCP or Netflow Log Server Using Graylog
OSMC 2024 | Telegraf – A data collection agent by Sven Rebhan.pdf
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
Ad

More from InfluxData (20)

PPTX
Announcing InfluxDB Clustered
PDF
Best Practices for Leveraging the Apache Arrow Ecosystem
PDF
How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...
PDF
Power Your Predictive Analytics with InfluxDB
PDF
How Teréga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base
PDF
Build an Edge-to-Cloud Solution with the MING Stack
PDF
Meet the Founders: An Open Discussion About Rewriting Using Rust
PDF
Introducing InfluxDB Cloud Dedicated
PDF
Gain Better Observability with OpenTelemetry and InfluxDB
PPTX
How a Heat Treating Plant Ensures Tight Process Control and Exceptional Quali...
PDF
How Delft University's Engineering Students Make Their EV Formula-Style Race ...
PPTX
Introducing InfluxDB’s New Time Series Database Storage Engine
PDF
Start Automating InfluxDB Deployments at the Edge with balena
PDF
Understanding InfluxDB’s New Storage Engine
PDF
Streamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDB
PPTX
Ward Bowman [PTC] | ThingWorx Long-Term Data Storage with InfluxDB | InfluxDa...
PDF
Scott Anderson [InfluxData] | New & Upcoming Flux Features | InfluxDays 2022
PDF
Steinkamp, Clifford [InfluxData] | Closing Thoughts | InfluxDays 2022
PDF
Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...
PDF
Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022
Announcing InfluxDB Clustered
Best Practices for Leveraging the Apache Arrow Ecosystem
How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...
Power Your Predictive Analytics with InfluxDB
How Teréga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base
Build an Edge-to-Cloud Solution with the MING Stack
Meet the Founders: An Open Discussion About Rewriting Using Rust
Introducing InfluxDB Cloud Dedicated
Gain Better Observability with OpenTelemetry and InfluxDB
How a Heat Treating Plant Ensures Tight Process Control and Exceptional Quali...
How Delft University's Engineering Students Make Their EV Formula-Style Race ...
Introducing InfluxDB’s New Time Series Database Storage Engine
Start Automating InfluxDB Deployments at the Edge with balena
Understanding InfluxDB’s New Storage Engine
Streamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDB
Ward Bowman [PTC] | ThingWorx Long-Term Data Storage with InfluxDB | InfluxDa...
Scott Anderson [InfluxData] | New & Upcoming Flux Features | InfluxDays 2022
Steinkamp, Clifford [InfluxData] | Closing Thoughts | InfluxDays 2022
Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...
Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022

Recently uploaded (20)

PPTX
CHE NAA, , b,mn,mblblblbljb jb jlb ,j , ,C PPT.pptx
PDF
Unit-1 introduction to cyber security discuss about how to secure a system
PPTX
June-4-Sermon-Powerpoint.pptx USE THIS FOR YOUR MOTIVATION
PPTX
introduction about ICD -10 & ICD-11 ppt.pptx
DOCX
Unit-3 cyber security network security of internet system
PDF
The Internet -By the Numbers, Sri Lanka Edition
PPTX
SAP Ariba Sourcing PPT for learning material
PPT
tcp ip networks nd ip layering assotred slides
PDF
RPKI Status Update, presented by Makito Lay at IDNOG 10
PDF
Testing WebRTC applications at scale.pdf
PDF
The New Creative Director: How AI Tools for Social Media Content Creation Are...
PPTX
artificial intelligence overview of it and more
PPTX
presentation_pfe-universite-molay-seltan.pptx
PPT
Design_with_Watersergyerge45hrbgre4top (1).ppt
PPTX
Introduction to Information and Communication Technology
PDF
Sims 4 Historia para lo sims 4 para jugar
PDF
Slides PDF The World Game (s) Eco Economic Epochs.pdf
PPTX
Module 1 - Cyber Law and Ethics 101.pptx
PDF
An introduction to the IFRS (ISSB) Stndards.pdf
PDF
💰 𝐔𝐊𝐓𝐈 𝐊𝐄𝐌𝐄𝐍𝐀𝐍𝐆𝐀𝐍 𝐊𝐈𝐏𝐄𝐑𝟒𝐃 𝐇𝐀𝐑𝐈 𝐈𝐍𝐈 𝟐𝟎𝟐𝟓 💰
CHE NAA, , b,mn,mblblblbljb jb jlb ,j , ,C PPT.pptx
Unit-1 introduction to cyber security discuss about how to secure a system
June-4-Sermon-Powerpoint.pptx USE THIS FOR YOUR MOTIVATION
introduction about ICD -10 & ICD-11 ppt.pptx
Unit-3 cyber security network security of internet system
The Internet -By the Numbers, Sri Lanka Edition
SAP Ariba Sourcing PPT for learning material
tcp ip networks nd ip layering assotred slides
RPKI Status Update, presented by Makito Lay at IDNOG 10
Testing WebRTC applications at scale.pdf
The New Creative Director: How AI Tools for Social Media Content Creation Are...
artificial intelligence overview of it and more
presentation_pfe-universite-molay-seltan.pptx
Design_with_Watersergyerge45hrbgre4top (1).ppt
Introduction to Information and Communication Technology
Sims 4 Historia para lo sims 4 para jugar
Slides PDF The World Game (s) Eco Economic Epochs.pdf
Module 1 - Cyber Law and Ethics 101.pptx
An introduction to the IFRS (ISSB) Stndards.pdf
💰 𝐔𝐊𝐓𝐈 𝐊𝐄𝐌𝐄𝐍𝐀𝐍𝐆𝐀𝐍 𝐊𝐈𝐏𝐄𝐑𝟒𝐃 𝐇𝐀𝐑𝐈 𝐈𝐍𝐈 𝟐𝟎𝟐𝟓 💰

Finding OOMS in Legacy Systems with the Syslog Telegraf Plugin

  • 1. Finding OOMs with Telegraf Dylan Ferreira @dylanferreira
  • 3. Agenda ● A whole bunch of dry stuff ● Maybe snacks?
  • 4. Agenda.real ● Our TICK stack layout ● Configuring Rsyslog & Telegraf ● Doing more with Telegraf plugins ● Using Kapacitor to search & sanitize log data ● Exploring this data with Grafana
  • 5. How did we get here? ● Multi-dimensional TSDB. ● Very high performance. ● All data is aggregated on ingest. ● Can store raw metrics. ● Multi-dimensional TSDB. ● Complex aggregations (CQs). ● Multiple fields! ● Can store raw metrics. ● Flat structure with metadata stored in the metric name (pre-v1). ● Fixed per-series aggregations
  • 6. What is Telegraf? “Telegraf is InfluxData's open source plugin-driven server agent for collecting and reporting metrics.” ● Inputs ● Processors ● Aggregators ● Outputs ● Stream Buffer ●● Metrics Router
  • 8. Applications Host Setup Applications sending log data to syslog Syslog relaying log data to Telegraf
  • 12. The OOM Killer /** * oom_badness - heuristic function to determine which candidate task to kill * @p: task struct of which task we should calculate * @totalpages: total present RAM allowed for page allocation * @memcg: task's memory controller, if constrained * @nodemask: nodemask passed to page allocator for mempolicy ooms * * The heuristic for determining which task to kill is made to be as simple and * predictable as possible. The goal is to return the highest value for the * task consuming the most memory to avoid subsequent oom failures. */
  • 13. Armouring Telegraf Against the OOM Killer [Service] OOMScoreAdjust=-600 /etc/systemd/system/telegraf.service.d/oom_score_adj.conf value between -1000 and +1000/proc/$PID/oom_score_adj /proc/$PID/oom__adj value between -17 and +15
  • 14. Telegraf Config [[inputs.syslog]] server = "tcp://:6514" [[outputs.influxdb]] urls = [ "http://influxdb-syslog.service.consul:8086" ] database = "syslog" retention_policy = "autogen" precision = "ns" user_agent = "telegraf" namepass = ["syslog"]
  • 15. Rsyslog Config $ActionQueueType LinkedList $ActionQueueFileName telegraf $ActionResumeRetryCount -1 $ActionQueueSaveOnShutdown on # forward over tcp with octet framing according to RFC 5425 *.* @@(o)localhost:6514;RSYSLOG_SyslogProtocol23Format # all logs that contain the string "**WARNING**" if ($msg contains '**WARNING**') then @@(o)localhost:6514;RSYSLOG_SyslogProtocol23Format
  • 16. What do you get?
  • 17. Telegraf Output ● syslog ○ tags ■ severity (string) ■ facility (string) ■ hostname (string) ■ appname (string) ○ fields ■ version (integer) ■ severity_code (integer) ■ facility_code (integer) ■ timestamp (integer): the time recorded in the syslog message ■ procid (string) ■ msgid (string) ■ sdid (bool) ■ Structured Data (string) ○ timestamp: the time the messages was received Original timestamp Timestamp given by Telegraf
  • 18. Timestamps & Ingest Latency var data = stream |from() .database('syslog') .retentionPolicy('autogen') .measurement('syslog') |groupBy('hostname') |window() .align() .period(1m) .every(1m) |eval(lambda: unixNano("time") - "timestamp") .as('timestamp_lag_ns') var mean_latency = data |mean('timestamp_lag_ns') .as('val') mean_latency |log() Typical Latency: under 1ms
  • 19. Getting more from Telegraf
  • 20. Enable Counters # Create per-minute counts of syslog data [[aggregators.basicstats]] period = "1m" drop_original = false stats = ["count"] name_suffix = "_counts" # filtering namepass = ["syslog"]
  • 23. Adding More Tags From Your Log Data [[processors.parser]] parse_fields = ["message"] drop_original = false merge = "override" data_format = "json" tag_keys = ["method","uri"] namepass = ["syslog"] [processors.parser.tagpass] appname = ["crazy-cool-api"] { "hostname": "ab704bf3336f" , "ip": "10.0.0.30", "method": "GET", "msg": "request completed" , "start": "2018-10-23T07:00:03Z" , "status": 200, "time": "2018-10-23T07:00:03.942327Z" , "uri": "/health", "uri_query": "", "user_agent": "Consul Health Check" } Log Message Telegraf Config
  • 26. var ipv4_address = /d{1,3}.d{1,3}.d{1,3}.d{1,3}/ |eval(lambda: regexReplace(ipv4_address, "message", '<ipv4-address>')) .as('message') .keep() Rewriting your data with regexReplace var email_address = /w+([-+.']w+)*@w+([-.]w+)*.w+([-.]w+)*/ |eval(lambda: regexReplace(email_address, "message", '<email-address>')) .as('message') .keep( Email Addresses IP Addresses
  • 27. Email Address Regex (RFC 5322) (?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[ x01-x08x0bx0cx0e-x1fx21x23-x5bx5d-x7f]|[x01-x09x0bx0cx0e -x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?.)+[a-z0-9](?:[a-z0-9-]* [a-z0-9])?|[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).){3}(?:25[0-5]| 2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[x01-x08x0bx0cx0e -x1fx21-x5ax53-x7f]|[x01-x09x0bx0cx0e-x7f])+)])
  • 28. Email Address Regex / Pain Amplifier (?:(?:rn)?[t])*)(?:.(?:(?:rn)?[t])*(?:[^()<>@,;:".[]000-031]+(?:(?:(?:rn)?[t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)? [t])*))*>(?:(?:rn)?[t])*)|(?:[^()<>@,;:".[]000-031]+(?:(?:(?:rn)?[t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[t]))*"(?:(?: rn)?[t])*)*:(?:(?:rn)?[t])*(?:(?:(?:[^()<>@,;:".[]000-031]+(?:(?:(?:rn)?[t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[t]))*" (?:(?:rn)?[t])*)(?:.(?:(?:rn)?[t])*(?:[^()<>@,;:".[]000-031]+(?:(?:(?:rn)?[t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[t] ))*"(?:(?:rn)?[t])*))*@(?:(?:rn)?[t])*(?:[^()<>@,;:".[]000-031]+(?:(?:(?:rn)?[t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn )?[t])*)(?:.(?:(?:rn)?[t])*(?:[^()<>@,;:".[]000-031]+(?:(?:(?:rn)?[t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[t])*))*| (?:[^()<>@,;:".[]000-031]+(?:(?:(?:rn)?[t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[t]))*"(?:(?:rn)?[t])*)*<(?:(?:rn)?[t] )*(?:@(?:[^()<>@,;:".[]000-031]+(?:(?:(?:rn)?[t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[t])*)(?:.(?:(?:rn)?[t])*(?:[^( )<>@,;:".[]000-031]+(?:(?:(?:rn)?[t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[t])*))*(?:,@(?:(?:rn)?[t])*(?:[^()<>@,;:" .[]000-031]+(?:(?:(?:rn)?[t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[t])*)(?:.(?:(?:rn)?[t])*(?:[^()<>@,;:".[]000-0 31]+(?:(?:(?:rn)?[t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[t])*))*)*:(?:(?:rn)?[t])*)?(?:[^()<>@,;:".[]000-031]+(?:(?: (?:rn)?[t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[t]))*"(?:(?:rn)?[t])*)(?:.(?:(?:rn)?[t])*(?:[^()<>@,;:".[]000-031]+(? :(?:(?:rn)?[t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[t]))*"(?:(?:rn)?[t])*))*@(?:(?:rn)?[t])*(?:[^()<>@,;:".[]000-031]+ (?:(?:(?:rn)?[t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[t])*)(?:.(?:(?:rn)?[t])*(?:[^()<>@,;:".[]000-031]+(?:(?:(?:r n)?[t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[t])*))*>(?:(?:rn)?[t])*)(?:,s*(?:(?:[^()<>@,;:".[]000-031]+(?:(?:(?:rn) ?[t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[t]))*"(?:(?:rn)?[t])*)(?:.(?:(?:rn)?[t])*(?:[^()<>@,;:".[]000-031]+(?:(?:(?: rn)?[t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[t]))*"(?:(?:rn)?[t])*))*@(?:(?:rn)?[t])*(?:[^()<>@,;:".[]000-031]+(?:(?:(? :rn)?[t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[t])*)(?:.(?:(?:rn)?[t])*(?:[^()<>@,;:".[]000-031]+(?:(?:(?:rn)?[t]) +|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[t])*))*|(?:[^()<>@,;:".[]000-031]+(?:(?:(?:rn)?[t])+|Z|(?=[["()<>@,;:".[]]))| "(?:[^"r]|.|(?:(?:rn)?[t]))*"(?:(?:rn)?[t])*)*<(?:(?:rn)?[t])*(?:@(?:[^()<>@,;:".[]000-031]+(?:(?:(?:rn)?[t])+|Z|(?=[["()<>@,;:".[ ]]))|[([^[]r]|.)*](?:(?:rn)?[t])*)(?:.(?:(?:rn)?[t])*(?:[^()<>@,;:".[]000-031]+(?:(?:(?:rn)?[t])+|Z|(?=[["()<>@,;:".[]]))|[([^ []r]|.)*](?:(?:rn)?[t])*))*(?:,@(?:(?:rn)?[t])*(?:[^()<>@,;:".[]000-031]+(?:(?:(?:rn)?[t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]| .)*](?:(?:rn)?[t])*)(?:.(?:(?:rn)?[t])*(?:[^()<>@,;:".[]000-031]+(?:(?:(?:rn)?[t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?: rn)?[t])*))*)*:(?:(?:rn)?[t])*)?(?:[^()<>@,;:".[]000-031]+(?:(?:(?:rn)?[t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[t]))*" (?:(?:rn)?[t])*)(?:.(?:(?:rn)?[t])*(?:[^()<>@,;:".[]000-031]+(?:(?:(?:rn)?[t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[t] ))*"(?:(?:rn)?[t])*))*@(?:(?:rn)?[t])*(?:[^()<>@,;:".[]000-031]+(?:(?:(?:rn)?[t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn )?[t])*)(?:.(?:(?:rn)?[t])*(?:[^()<>@,;:".[]000-031]+(?:(?:(?:rn)?[t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[t])*))* >(?:(?:rn)?[t])*))*)?;s*)
  • 30. Tick Template var where_filter = lambda: TRUE var appname_lambda lambda var event_type string var data = stream |from() .database('syslog') .retentionPolicy('autogen') .measurement('syslog') |where(lambda: "facility" == 'kern') |where(where_filter) |eval(appname_lambda) .as('appname') .tags('appname') .keep() |eval(lambda: regexReplace(/w+([-+.']w+)*@w+([-.]w+)*.w+([-.]w+)*/, "message", '<email-address>')) .as('message') .keep() |eval(lambda: unixNano("time") - "timestamp") .as('timestamp_lag_ns') .keep() |log() /etc/kapacitor/load/templates/template-syslog_kern_service_events.tick
  • 31. Task Definition : OOM template-id: template-syslog_kern_service_events vars: Where_filter: type: lambda value: > "message" =~ /.+ Memory cgroup out of memory: Kill process .+/ appname_lambda: type: lambda value: > regexReplace(/^[.+] Memory cgroup out of memory: Kill process .+? ((.+?)) score .+/, "message", '$1') event_type: type: string value: "OOM" /etc/kapacitor/load/tasks/syslog_kern_service_events-OOM.yaml
  • 32. Task Definition : Segfault /etc/kapacitor/load/tasks/syslog_kern_service_events-segfault.yaml template-id: template-syslog_kern_service_events vars: Where_filter: type: lambda value: > "message" =~ /.+: segfault at .+/ appname_lambda: type: lambda value: > regexReplace(/^[.+] (.+?)[.+]: segfault at .*/, "message", '$1') event_type: type: string value: "segfault"
  • 33. Memory cgroup out of memory: Kill process 3056 ( upstart-socket-) score 1057 or sacrifice childn The kernel truncates process names down to 15 chars (16 chars -1 NUL) and stores this in /proc/<PID>/comm include/linux/sched.h /* Task command name length: */ #define TASK_COMM_LEN 16 e.g. upstart-socket-bridge becomes upstart-socket- The Kernel & Process Names
  • 34. metric_relabel_configs: - source_labels: ['name'] regex: '([[:ascii:]]{1,15}).*)' target_label: 'name_short' replacement: '$1' Linkage Workarounds Prometheus [[processors.regex]] [[processors.regex.tags]] key = "<tag>" pattern = "^([[:ascii:]]{1,15}).*$" replacement = "${1}" result_key = "short_<tag>" Telegraf
  • 38. Group By Using A Custom Variable GROUP BY time(1m), "[[group]]" fill(0)
  • 39. Regex Search Using A Constant WHERE ("message" =~ / $search/)