SlideShare a Scribd company logo
Observability tips for HAProxy
Willy Tarreau
(willy@haproxy.org)
Dotscale 2018
2
Definition
Observability:
"control theory, observability is a measure of how well
internal states of a system can be inferred from
knowledge of its external outputs" (wikipedia)
In short for us: WTF is going on ?
3
Observability vs monitoring
● Monitoring tells you how well something works (or not)
● Observability helps you detect what is not working and why
=> You monitor an observable system.
4
Observability is important
● can’t rely on impatient users’ complaints anymore
=> detect and fix trouble not reported yet
● stop adding duct tape, address root causes!
● improve users experience where it matters
5
The LB as an observation tower
● central place
● in distributed systems, one LB per level, even better
when sidecar
● sees multiple targets, eases comparisons
=> draw references
● Trusted low level component & excellent transparency
● many LB decisions actually depend on metrics related
to performance and observability
● already logs, provides long-term references
6
Why not look from other points ?
● you should! Especially in distributed systems
(microservices, etc)!
● but often it's too late when the first incident happens!
● with existing LB's logs, it's already possible to do a lot
Note: see OpenTracing and Prometheus
7
What does the LB see ?
● global failures (aborts, timeouts)
● abnormal delays caused by network retransmits
● connection failures and retries caused by bad tuning
(eg: conntrack)
● connection slowdowns caused by inefficient firewall
policies (#rules)
...
8
What does the LB see (…) ?
● client-side issues (BW limitations)
● per-URL processing time (application issues, svc
partners)
● per-node vs per-cluster variations
=> narrow down to individual node or shared resource
● deployment issues : new occasional error on a specific
page, can be addressed before going full-scale
9
Accessing metrics in HAProxy
● Logs :
● Halog, ELK, Prometheus, …
● Provides unique-id for tracing/event correlation
● Stats :
● Stats page, CLI, hatop
● Stick-tables (per arbitrary key like IP, URL, cookie) :
● Byte count, cumulated/concurrent conns, errors, …
10
Sequence of events on HAProxy
11
Sequence of events on HAProxy
12
Sequence of events on HAProxy
13
Sequence of events on HAProxy
14
Sequence of events on HAProxy
15
Sequence of events on HAProxy
16
Sequence of events on HAProxy
17
Sequence of events on HAProxy
18
Sequence of events on HAProxy
19
Sequence of events on HAProxy
20
Sequence of events on HAProxy
21
Sequence of events on HAProxy
22
Sequence of events on HAProxy
23
Sequence of events on HAProxy
24
Sequence of events on HAProxy
25
Sequence of events on HAProxy
26
More timers to come in HAProxy 1.9
● HAProxy now supports heavier per-request workloads
(Lua, device identification, …)
● Processing times over 200 µs can become noticeable
Actions:
● log per-request total CPU time spent in analysers
● log per-request total CPU time spent in TLS handshake
● log per-request total latency added by other tasks
● Ability to kill offending tasks
● Ability to alert on high latencies
=> make HAProxy as observable as other components
27
Event timing reports
● Timers are averaged in the stats
● Each timer appears in the logs
● Halog -rt/-RT/-pct for quick analysis
● Each timer crossing a limit triggers a timeout
● Each abort at a specific step causes a hard error
=> termination codes
haproxy[14389]: 10.0.1.2:33317 [06/Feb/2018:12:14:14.655] http-in
static/srv1 10/0/30/69/109 200 2750 - - SDNN 1/1/1/1/0 0/0 {haproxy.org}
{} "GET /index.html HTTP/1.1"
Timers Term code Cookie code
28
Termination codes
● Distinguish between timeout and abort
● Indicate whom (client, server, haproxy, kill, ...)
● Indicate when (req,queue,connect,response...)
● Completed by persistence cookie indications
● Filtered and sorted by halog :
# halog -tcn|-TCN ... # for filtering
# halog -tc # for sorting
29
Other relevant metrics : HTTP status distribution
● Stats page: distribution per frontend/backend/server
● Filter by ranges: halog -hs/-HS
● Sorted output: halog -st
=> graph the distribution and watch for variations
between application deployments
30
Other relevant metrics : queue length
● Uses server maxconn
● Grows exponentially with slowdowns : easy to detect!
● Tells you how many extra servers you need
● Reported by halog -Q/-QS
● Shown in real time on the stats page per backend/srv
=> If you watch only one metric, watch this one!
31
Other relevant metrics : LB fairness
LB algorithm implies fairness between servers :
● Equal request count with roundrobin
=> Higher than average concurrency indicates
abnormally slow server
● Equal load with leastconn
=> Low req count indicates abnormally slow server
=> graph relevant values within the farm
32
Other relevant metrics : error rate
● Global: halog -e
● Per server: halog -srv
● Per client IP: halog -e -ic (detect bad CDN nodes)
● Per URL: halog -ue
● Stats page: per frontend/backend/server
● Stick-tables: per arbitrary key using http_err_rate()
=> no threshold, watch for variations
33
Useful entries in log-format
● Default httplog format is quite rich
● Can be improved using the log-format directive
● Hint: log stick-table stats for similar keys
haproxy[14389]: 10.0.1.2:33317 [06/Feb/2018:12:14:14.655] http-in
static/srv1 10/0/30/69/109 200 2750 - - SDNN 1/1/1/1/0 0/0 {haproxy.org}
{} "GET /index.html HTTP/1.1"
Timers
HTTP status Byte
count
Term
Code
Cookie
Code
Conn
Count
Queue
Length
34
Tips: sampling : why / when
"I can't enable logs, I have too much traffic!"
● an average syslog server can store 20k events/s
without sweating
● that's 1.7B events/day or 350GB of uncompressed
haproxy logs/day
● compresses to 1TB/month
● for $100 you can store 4 months with no loss
● have more traffic / not interested in this level of detail ?
# log only 5% of requests
http-request set-log-level silent unless { rand(100) -lt 5 }
35
Tips: selective logging: why / when
● you only want to catch suspicious events
● disable logging unless Tc/Tq/Tr/Tw/... is above a certain
threshold
● on the fly for selected keys from the CLI + stick-table
● also see "option dontlognormal"
● WARNING: you'll lose any valid reference
36
Tips: other halog goodies
● Poorly documented, use halog --help
● response time per url: halog -uat
● errors per server: halog -srv
● Percentiles on req/queue/conn/resp times: halog -pct
● detect stolen CPU / swap : halog -ac … -ad …
● very fast (1-2 GB per second)
=> Use it in production to figure the relevant metrics
37
Success stories
Customer spotting a broken fiber between two core switches
● Tc from HA1 to srv 1,2,3,5 always low, srv 4,6 high at 99 pct
● Tc from HA2 to srv 1,2,4,6 always low, srv 3,5 high at 99 pct
=> both haproxy and servers out of cause
● issue rate stable at various traffic levels => not congestion
● inter-switch link apparently at cause but not for all flows
● inter-switch link made of two fibers balanced on MAC tuple
● thanks to long-term logs, origin could even be identified
38
Success stories
Customer figuring a wrong web server configuration using
/dev/random
● Tc abnormally high with lots of random values to several
seconds, and only for TLS
● timer also covers TLS handshake
=> not a network, hardware or performance issue, only
server config.
=> system was regularly running out of entropy due to
mistakenly using /dev/random as a random source for SSL
39
Conclusion
● exploit your stats
● enable logs on LBs, no excuse for not doing it!
● process them automatically, manually once in a while
● compare numbers between similar objects
● detect anomalies
● fix problems before they are witnessed
● profit :-)
40
Interesting lectures
● https://medium.com/@copyconstruct/monitoring-and-observability-8417d1952e1c
● https://www.vividcortex.com/blog/monitoring-isnt-observability
● http://opentracing.io/documentation/
● https://prometheus.io/

More Related Content

PPTX
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
PDF
Monitoring your Python with Prometheus (Python Ireland April 2015)
PPTX
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
PDF
Life timevalue
PPTX
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
PDF
Monitoring microservices with Prometheus
PDF
Stream Processing with Apache Flink
PPTX
Apache hive
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
Monitoring your Python with Prometheus (Python Ireland April 2015)
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Life timevalue
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
Monitoring microservices with Prometheus
Stream Processing with Apache Flink
Apache hive

Similar to Observability tips for HAProxy (20)

PDF
Observability with HAProxy
PPTX
PDF
Webinar slides: An Introduction to Performance Monitoring for PostgreSQL
PDF
Network visibility and control using industry standard sFlow telemetry
PPTX
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
PDF
Building a Dynamic Rules Engine with Kafka Streams
ODP
Zero Downtime JEE Architectures
PDF
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
PPTX
OpenTelemetry For Architects
PPTX
Introduction to Ethereum
PPTX
TLS - 2016 Velocity Training
PPTX
ECS19 - Ingo Gegenwarth - Running Exchange in large environment
PPTX
The new (is it really ) api stack
PDF
Bringing Learnings from Googley Microservices with gRPC - Varun Talwar, Google
PDF
Introduction to ZooKeeper - TriHUG May 22, 2012
PDF
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
PDF
PG Day'14 Russia, PostgreSQL System Architecture, Heikki Linnakangas
PDF
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
PPTX
HTTP/2 Introduction
PDF
Security Monitoring for big Infrastructures without a Million Dollar budget
Observability with HAProxy
Webinar slides: An Introduction to Performance Monitoring for PostgreSQL
Network visibility and control using industry standard sFlow telemetry
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Building a Dynamic Rules Engine with Kafka Streams
Zero Downtime JEE Architectures
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
OpenTelemetry For Architects
Introduction to Ethereum
TLS - 2016 Velocity Training
ECS19 - Ingo Gegenwarth - Running Exchange in large environment
The new (is it really ) api stack
Bringing Learnings from Googley Microservices with gRPC - Varun Talwar, Google
Introduction to ZooKeeper - TriHUG May 22, 2012
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
PG Day'14 Russia, PostgreSQL System Architecture, Heikki Linnakangas
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
HTTP/2 Introduction
Security Monitoring for big Infrastructures without a Million Dollar budget
Ad

Recently uploaded (20)

PPTX
Impressionism_PostImpressionism_Presentation.pptx
PPTX
chapter8-180915055454bycuufucdghrwtrt.pptx
PPTX
_ISO_Presentation_ISO 9001 and 45001.pptx
PPTX
Intro to ISO 9001 2015.pptx wareness raising
PPTX
An Unlikely Response 08 10 2025.pptx
PPTX
Anesthesia and it's stage with mnemonic and images
PDF
6.-propertise of noble gases, uses and isolation in noble gases
PDF
Tunisia's Founding Father(s) Pitch-Deck 2022.pdf
DOCX
"Project Management: Ultimate Guide to Tools, Techniques, and Strategies (2025)"
PPTX
2025-08-10 Joseph 02 (shared slides).pptx
PPTX
Lesson-7-Gas. -Exchange_074636.pptx
PPTX
3RD-Q 2022_EMPLOYEE RELATION - Copy.pptx
PDF
Yusen Logistics Group Sustainability Report 2024.pdf
PPTX
FINAL TEST 3C_OCTAVIA RAMADHANI SANTOSO-1.pptx
PDF
Presentation1 [Autosaved].pdf diagnosiss
PPTX
Tablets And Capsule Preformulation Of Paracetamol
PDF
Microsoft-365-Administrator-s-Guide_.pdf
PPTX
Module_4_Updated_Presentation CORRUPTION AND GRAFT IN THE PHILIPPINES.pptx
PPTX
ART-APP-REPORT-FINctrwxsg f fuy L-na.pptx
PDF
PM Narendra Modi's speech from Red Fort on 79th Independence Day.pdf
Impressionism_PostImpressionism_Presentation.pptx
chapter8-180915055454bycuufucdghrwtrt.pptx
_ISO_Presentation_ISO 9001 and 45001.pptx
Intro to ISO 9001 2015.pptx wareness raising
An Unlikely Response 08 10 2025.pptx
Anesthesia and it's stage with mnemonic and images
6.-propertise of noble gases, uses and isolation in noble gases
Tunisia's Founding Father(s) Pitch-Deck 2022.pdf
"Project Management: Ultimate Guide to Tools, Techniques, and Strategies (2025)"
2025-08-10 Joseph 02 (shared slides).pptx
Lesson-7-Gas. -Exchange_074636.pptx
3RD-Q 2022_EMPLOYEE RELATION - Copy.pptx
Yusen Logistics Group Sustainability Report 2024.pdf
FINAL TEST 3C_OCTAVIA RAMADHANI SANTOSO-1.pptx
Presentation1 [Autosaved].pdf diagnosiss
Tablets And Capsule Preformulation Of Paracetamol
Microsoft-365-Administrator-s-Guide_.pdf
Module_4_Updated_Presentation CORRUPTION AND GRAFT IN THE PHILIPPINES.pptx
ART-APP-REPORT-FINctrwxsg f fuy L-na.pptx
PM Narendra Modi's speech from Red Fort on 79th Independence Day.pdf
Ad

Observability tips for HAProxy

  • 1. Observability tips for HAProxy Willy Tarreau (willy@haproxy.org) Dotscale 2018
  • 2. 2 Definition Observability: "control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs" (wikipedia) In short for us: WTF is going on ?
  • 3. 3 Observability vs monitoring ● Monitoring tells you how well something works (or not) ● Observability helps you detect what is not working and why => You monitor an observable system.
  • 4. 4 Observability is important ● can’t rely on impatient users’ complaints anymore => detect and fix trouble not reported yet ● stop adding duct tape, address root causes! ● improve users experience where it matters
  • 5. 5 The LB as an observation tower ● central place ● in distributed systems, one LB per level, even better when sidecar ● sees multiple targets, eases comparisons => draw references ● Trusted low level component & excellent transparency ● many LB decisions actually depend on metrics related to performance and observability ● already logs, provides long-term references
  • 6. 6 Why not look from other points ? ● you should! Especially in distributed systems (microservices, etc)! ● but often it's too late when the first incident happens! ● with existing LB's logs, it's already possible to do a lot Note: see OpenTracing and Prometheus
  • 7. 7 What does the LB see ? ● global failures (aborts, timeouts) ● abnormal delays caused by network retransmits ● connection failures and retries caused by bad tuning (eg: conntrack) ● connection slowdowns caused by inefficient firewall policies (#rules) ...
  • 8. 8 What does the LB see (…) ? ● client-side issues (BW limitations) ● per-URL processing time (application issues, svc partners) ● per-node vs per-cluster variations => narrow down to individual node or shared resource ● deployment issues : new occasional error on a specific page, can be addressed before going full-scale
  • 9. 9 Accessing metrics in HAProxy ● Logs : ● Halog, ELK, Prometheus, … ● Provides unique-id for tracing/event correlation ● Stats : ● Stats page, CLI, hatop ● Stick-tables (per arbitrary key like IP, URL, cookie) : ● Byte count, cumulated/concurrent conns, errors, …
  • 10. 10 Sequence of events on HAProxy
  • 11. 11 Sequence of events on HAProxy
  • 12. 12 Sequence of events on HAProxy
  • 13. 13 Sequence of events on HAProxy
  • 14. 14 Sequence of events on HAProxy
  • 15. 15 Sequence of events on HAProxy
  • 16. 16 Sequence of events on HAProxy
  • 17. 17 Sequence of events on HAProxy
  • 18. 18 Sequence of events on HAProxy
  • 19. 19 Sequence of events on HAProxy
  • 20. 20 Sequence of events on HAProxy
  • 21. 21 Sequence of events on HAProxy
  • 22. 22 Sequence of events on HAProxy
  • 23. 23 Sequence of events on HAProxy
  • 24. 24 Sequence of events on HAProxy
  • 25. 25 Sequence of events on HAProxy
  • 26. 26 More timers to come in HAProxy 1.9 ● HAProxy now supports heavier per-request workloads (Lua, device identification, …) ● Processing times over 200 µs can become noticeable Actions: ● log per-request total CPU time spent in analysers ● log per-request total CPU time spent in TLS handshake ● log per-request total latency added by other tasks ● Ability to kill offending tasks ● Ability to alert on high latencies => make HAProxy as observable as other components
  • 27. 27 Event timing reports ● Timers are averaged in the stats ● Each timer appears in the logs ● Halog -rt/-RT/-pct for quick analysis ● Each timer crossing a limit triggers a timeout ● Each abort at a specific step causes a hard error => termination codes haproxy[14389]: 10.0.1.2:33317 [06/Feb/2018:12:14:14.655] http-in static/srv1 10/0/30/69/109 200 2750 - - SDNN 1/1/1/1/0 0/0 {haproxy.org} {} "GET /index.html HTTP/1.1" Timers Term code Cookie code
  • 28. 28 Termination codes ● Distinguish between timeout and abort ● Indicate whom (client, server, haproxy, kill, ...) ● Indicate when (req,queue,connect,response...) ● Completed by persistence cookie indications ● Filtered and sorted by halog : # halog -tcn|-TCN ... # for filtering # halog -tc # for sorting
  • 29. 29 Other relevant metrics : HTTP status distribution ● Stats page: distribution per frontend/backend/server ● Filter by ranges: halog -hs/-HS ● Sorted output: halog -st => graph the distribution and watch for variations between application deployments
  • 30. 30 Other relevant metrics : queue length ● Uses server maxconn ● Grows exponentially with slowdowns : easy to detect! ● Tells you how many extra servers you need ● Reported by halog -Q/-QS ● Shown in real time on the stats page per backend/srv => If you watch only one metric, watch this one!
  • 31. 31 Other relevant metrics : LB fairness LB algorithm implies fairness between servers : ● Equal request count with roundrobin => Higher than average concurrency indicates abnormally slow server ● Equal load with leastconn => Low req count indicates abnormally slow server => graph relevant values within the farm
  • 32. 32 Other relevant metrics : error rate ● Global: halog -e ● Per server: halog -srv ● Per client IP: halog -e -ic (detect bad CDN nodes) ● Per URL: halog -ue ● Stats page: per frontend/backend/server ● Stick-tables: per arbitrary key using http_err_rate() => no threshold, watch for variations
  • 33. 33 Useful entries in log-format ● Default httplog format is quite rich ● Can be improved using the log-format directive ● Hint: log stick-table stats for similar keys haproxy[14389]: 10.0.1.2:33317 [06/Feb/2018:12:14:14.655] http-in static/srv1 10/0/30/69/109 200 2750 - - SDNN 1/1/1/1/0 0/0 {haproxy.org} {} "GET /index.html HTTP/1.1" Timers HTTP status Byte count Term Code Cookie Code Conn Count Queue Length
  • 34. 34 Tips: sampling : why / when "I can't enable logs, I have too much traffic!" ● an average syslog server can store 20k events/s without sweating ● that's 1.7B events/day or 350GB of uncompressed haproxy logs/day ● compresses to 1TB/month ● for $100 you can store 4 months with no loss ● have more traffic / not interested in this level of detail ? # log only 5% of requests http-request set-log-level silent unless { rand(100) -lt 5 }
  • 35. 35 Tips: selective logging: why / when ● you only want to catch suspicious events ● disable logging unless Tc/Tq/Tr/Tw/... is above a certain threshold ● on the fly for selected keys from the CLI + stick-table ● also see "option dontlognormal" ● WARNING: you'll lose any valid reference
  • 36. 36 Tips: other halog goodies ● Poorly documented, use halog --help ● response time per url: halog -uat ● errors per server: halog -srv ● Percentiles on req/queue/conn/resp times: halog -pct ● detect stolen CPU / swap : halog -ac … -ad … ● very fast (1-2 GB per second) => Use it in production to figure the relevant metrics
  • 37. 37 Success stories Customer spotting a broken fiber between two core switches ● Tc from HA1 to srv 1,2,3,5 always low, srv 4,6 high at 99 pct ● Tc from HA2 to srv 1,2,4,6 always low, srv 3,5 high at 99 pct => both haproxy and servers out of cause ● issue rate stable at various traffic levels => not congestion ● inter-switch link apparently at cause but not for all flows ● inter-switch link made of two fibers balanced on MAC tuple ● thanks to long-term logs, origin could even be identified
  • 38. 38 Success stories Customer figuring a wrong web server configuration using /dev/random ● Tc abnormally high with lots of random values to several seconds, and only for TLS ● timer also covers TLS handshake => not a network, hardware or performance issue, only server config. => system was regularly running out of entropy due to mistakenly using /dev/random as a random source for SSL
  • 39. 39 Conclusion ● exploit your stats ● enable logs on LBs, no excuse for not doing it! ● process them automatically, manually once in a while ● compare numbers between similar objects ● detect anomalies ● fix problems before they are witnessed ● profit :-)
  • 40. 40 Interesting lectures ● https://medium.com/@copyconstruct/monitoring-and-observability-8417d1952e1c ● https://www.vividcortex.com/blog/monitoring-isnt-observability ● http://opentracing.io/documentation/ ● https://prometheus.io/