SlideShare a Scribd company logo
WWW.HAPROXY.COM
Observability
with HAProxy
WWW.HAPROXY.COM
● Introduction to “observability”
● Deep dive in HAProxy
● Tips (to get more from HAProxy)
● RoX (Return On eXperience)
● Conclusion (because we need one)
Agenda
WWW.HAPROXY.COMWWW.HAPROXY.COM
Introduction
WWW.HAPROXY.COM
Observability:
"control theory, observability is a measure of how well
internal states of a system can be inferred from knowledge
of its external outputs"
(wikipedia)
In short for us: WTF is going on ?
Definition
WWW.HAPROXY.COM
● Monitoring tells you how well something works (or not)
● Observability helps you detect what is not working and why
=> You monitor an observable system.
Observability vs monitoring
WWW.HAPROXY.COM
● can’t rely on impatient users’ complaints anymore
=> detect and fix trouble not reported yet
● stop adding duct tape, address root causes!
● improve users experience where it matters
Observability is crucial
WWW.HAPROXY.COM
● central place
● in distributed systems, one LB per layer, even better when
sidecar
● sees multiple targets, eases comparisons
=> draw references
● Trusted low level component & excellent transparency
● many LB decisions actually depend on metrics related to
performance and observability
● again, logs provide long-term references
The LB as an observation tower
WWW.HAPROXY.COM
● Maintain two types of TCP connections
○ From client to HAPRoxy (1)
○ From HAProxy to server (2)
○ Only access to data payload (not the “packet”)
LB in Reverse-proxy mode
HAProxyClient Server
(1) (2)
WWW.HAPROXY.COM
● you should! Especially in distributed systems (microservices,
etc)!
● but often it's too late when the first incident happens!
● with existing LB's logs, it's already possible to do a lot
Note: check OpenTracing and Prometheus
Why not looking from other points?
WWW.HAPROXY.COM
● global failures (aborts, timeouts)
● abnormal delays caused by network retransmits
● connection failures and retries caused by bad tuning (eg:
conntrack)
● connection slowdowns caused by inefficient firewall policies
(#rules)
● Non exhaustive list ...
What does the LB see?
WWW.HAPROXY.COM
● client-side issues (BW limitations)
● per-URL processing time (application issues, svc partners)
● per-node vs per-cluster variations
=> narrow down to individual node or shared resource
● deployment issues : new occasional error on a specific page,
can be addressed before going full-scale
What does the LB see?
WWW.HAPROXY.COMWWW.HAPROXY.COM
Deep dive in
HAProxy
WWW.HAPROXY.COM
haproxy[14389]: 10.0.1.2:33317 [06/Feb/2018:12:14:14.655] http-in static/srv1 10/0/30/69/109 200 2750 - -SDNN 1/1/1/1/0 0/0 {haproxy.org} {}
"GET /index.html HTTP/1.1"
I need help !!!
WWW.HAPROXY.COM
● Logs :
○ Halog, ELK, Splunk, datadog…
○ Provides unique-id for tracing/event correlation
● Statistics :
○ Stats page, CLI, hatop
○ Prometheus, statsd, ...
● Stick-tables (per arbitrary key like IP, URL, cookie, ...) :
○ Byte count, cumulated/concurrent conns, errors, rates…
Accessing metrics in HAProxy
WWW.HAPROXY.COM
Sequence of events in HAProxy
WWW.HAPROXY.COM
Sequence of events in HAProxy
WWW.HAPROXY.COM
Sequence of events in HAProxy
WWW.HAPROXY.COM
Sequence of events in HAProxy
WWW.HAPROXY.COM
Sequence of events in HAProxy
WWW.HAPROXY.COM
Sequence of events in HAProxy
WWW.HAPROXY.COM
Sequence of events in HAProxy
WWW.HAPROXY.COM
Sequence of events in HAProxy
WWW.HAPROXY.COM
Sequence of events in HAProxy
WWW.HAPROXY.COM
Sequence of events in HAProxy
WWW.HAPROXY.COM
Sequence of events in HAProxy
WWW.HAPROXY.COM
Sequence of events in HAProxy
WWW.HAPROXY.COM
Sequence of events in HAProxy
WWW.HAPROXY.COM
Sequence of events in HAProxy
WWW.HAPROXY.COM
Sequence of events in HAProxy
WWW.HAPROXY.COM
● HAProxy now supports heavier per-request workloads (Lua,
device identification, …)
● Processing times over 200 µs can become noticeable
● Actions:
○ log per-request total CPU time spent in analysers
○ log per-request total CPU time spent in TLS handshake
○ log per-request total latency added by other tasks
○ Ability to kill offending tasks
○ Ability to alert on high latencies
=> make HAProxy as observable as other components
And more to come in HAProxy 1.9 (Nov. 2018)
WWW.HAPROXY.COM
● Timers are averaged in the stats
● Each timer appears in the logs
● halog -rt/-RT/-pct for quick analysis
● Each timer crossing a limit triggers a timeout
● Each abort at a specific step causes a hard error
=> termination codes
haproxy[14389]: 10.0.1.2:33317 [06/Feb/2018:12:14:14.655] http-in
static/srv1 10/0/30/69/109 200 2750 - - SDNN 1/1/1/1/0 0/0 {haproxy.org}
{} "GET /index.html HTTP/1.1"
Event timing reports
Timers Term code Cookie code
WWW.HAPROXY.COM
● Who, when, why and how the session has been terminated
● Distinguish between timeout and abort
● Indicate who (client, server, haproxy, kill, ...)
● Indicate when (req, queue, connect, response, data...)
● Completed by persistence cookie indications
● Filtered and sorted by halog :
halog -tcn|-TCN ... # for filtering
halog -tc # for sorting
=> Graphing the number of errors reported by second helps
detecting an issue is in progress somewhere
Termination codes
WWW.HAPROXY.COM
● Stats page: distribution per frontend/backend/server
● Filter by ranges: halog -hs/-HS
● Sorted output: halog -st
=> graph the distribution and watch for variations between
application deployments
HTTP status code distribution
WWW.HAPROXY.COM
● Uses server maxconn
● Grows exponentially with slowdowns : easy to detect!
● Tells you how many extra servers you need
● Reported by halog -Q/-QS
● Shown in real time on the stats page per backend/srv
=> If you watch only one metric, watch this one!
Other relevant metrics: queues
WWW.HAPROXY.COM
LB algorithm implies fairness between servers :
● Equal request count with roundrobin
=> Higher than average concurrency indicates abnormally
slow server
● Equal load with leastconn
=> Low req count indicates abnormally slow server
=> graph relevant values within the farm
Other relevant metrics: LB fairness
WWW.HAPROXY.COM
● Global: halog -e
● Per server: halog -srv
● Per client IP: halog -e -ic (detect bad CDN nodes)
● Per URL: halog -ue
● Stats page: per frontend/backend/server
● Stick-tables: per arbitrary key using http_err_rate()
=> no threshold, watch for variations
Other relevant metrics: Error rate
WWW.HAPROXY.COM
● Default httplog format is quite rich
● Can be improved using the log-format directive
● Hint: log stick-table stats for similar keys
haproxy[14389]: 10.0.1.2:33317 [06/Feb/2018:12:14:14.655] http-in
static/srv1 10/0/30/69/109 200 2750 - - SDNN 1/1/1/1/0 0/0 {haproxy.org}
{} "GET /index.html HTTP/1.1"
Useful entries in log-format
Timers
HTTP status Byte
count
Term
Code
Cookie
Code
Conn
Count
Queue
Length
WWW.HAPROXY.COMWWW.HAPROXY.COM
Tips !!!!
WWW.HAPROXY.COM
"I can't enable logs, I have too much traffic!"
● an average syslog server can store 20k events/s without
sweating
● that's 1.7B events/day or 350GB of uncompressed haproxy
logs/day
● compresses to 1TB/month
● for $100 you can store 4 months with no loss
● have more traffic / not interested in this level of detail ?
# log only 5% of requests
http-request set-log-level silent unless { rand(100) -lt 5 }
Sampling: why / When
Tips !!!!
WWW.HAPROXY.COM
● you only want to catch suspicious events
● enable logs per url, source IP, etc...
● disable logging unless Tc/Tq/Tr/Tw/... is above a certain
threshold
● on the fly for selected keys from the CLI + stick-table
● also see "option dontlognormal"
WARNING: you'll lose any valid reference
Selective logging: why / When
Tips !!!!
WWW.HAPROXY.COM
● Poorly documented, use halog --help
● response time per url: halog -uat
● errors per server: halog -srv
● Percentiles on req/queue/conn/resp times: halog -pct
● detect stolen CPU / swap : halog -ac … -ad …
● very fast (1-2 GB of log file per second)
=> Use it in production to figure the relevant metrics
Other halog goodies
Tips !!!!
WWW.HAPROXY.COMWWW.HAPROXY.COM
ROX
(Return On eXperience)
WWW.HAPROXY.COM
● Many ops and dev do this every day all around the world
● Users are complaining that the application is slow
● Devops team check HAProxy logs (using halog or ELK) in order to isolate
potential sources of issue:
○ Check server Tc
○ Sort servers by response time
○ Sort URLs by response time
● HAProxy won’t fix the problem (in most cases), but will drastically reduce the
troubleshooting time by isolating the potential issues
Web application is slow!!!!!!!!
Return On eXperience
WWW.HAPROXY.COM
● Retail customer, with shops everywhere in France
● Cash register reports the following error message “Unable to reach central
system” when searching people in the loyalty card database, after waiting up to
10s.
● Argument between customer and “central system” software provider:
○ Customer: your software is slow
○ Provider: your network triggers this error
● After installing HAProxy, its logs reported the following:
○ Time to get the query from the client: 300ms (from North of France to Paris)
○ Server response time: 50s
● Conclusion: provider turned off debug mode in the “central system” software
Unavailable central system ?!?
Return On eXperience
WWW.HAPROXY.COM
● Customer with big TLS traffic, mostly API based
● Each HAProxy server can’t go higher than 20K HTTPs req/s (in 2015), even if
there seem to be a lot of unused resources on the box
● At first, this was qualified as an “HAProxy” issue (including the HW and OS
itself)
● Hard tuning the box did not improve anything
● halog Tc percentile reported higher values when the load increased (up to
30ms, on the LAN!!!)
● After asking customer to double check the Spanning Tree, it seemed a small 24
ports ToR switch became root bridge….
● Fixing spanning tree also fixed HAProxy’s TLS performance
HAProxy TLS performance issue
Return On eXperience
WWW.HAPROXY.COM
● Tc from HA1 to srv 1,2,3,5 always low, srv 4,6 high at 99 pct
● Tc from HA2 to srv 1,2,4,6 always low, srv 3,5 high at 99 pct
=> both haproxy and servers out of cause
● issue rate stable at various traffic levels => not congestion
● inter-switch link apparently at cause but not for all flows
● inter-switch link made of two fibers balanced on MAC tuple
● thanks to long-term logs, origin could even be identified
Spotting a broken fiber channel between 2
core switches
Return On eXperience
WWW.HAPROXY.COM
● Tc abnormally high with lots of random values to several
seconds, and only for TLS
● Tc timer also covers TLS handshake
=> not a network, hardware or performance issue, only
server config.
=> system was regularly running out of entropy due to
mistakenly using /dev/random as a random source for SSL
TIP: use /dev/urandom ;)
Wrong web server configuration using
/dev/random
Return On eXperience
WWW.HAPROXY.COMWWW.HAPROXY.COM
Conclusion
WWW.HAPROXY.COM
● exploit your stats
● Choose the right LB
● enable logs on LBs, no excuse for not doing it!
● process them automatically, manually once in a while
● compare numbers between similar objects
● detect anomalies
● fix problems before they are witnessed
● Enjoy :-)
Conclusion
Conclusion
WWW.HAPROXY.COM
QUESTION &
ANSWER

More Related Content

PPTX
Pushing Packets - How do the ML2 Mechanism Drivers Stack Up
PPTX
Introduction to Haproxy
PPTX
HAProxy
PDF
Introduction to Docker Compose
PPTX
Rancher
PDF
High Availability for OpenStack
PDF
Infrastructure & System Monitoring using Prometheus
PDF
Prometheus - basics
Pushing Packets - How do the ML2 Mechanism Drivers Stack Up
Introduction to Haproxy
HAProxy
Introduction to Docker Compose
Rancher
High Availability for OpenStack
Infrastructure & System Monitoring using Prometheus
Prometheus - basics

What's hot (20)

PDF
Kubernetes or OpenShift - choosing your container platform for Dev and Ops
PPTX
Introduction to the Container Network Interface (CNI)
PDF
Kubernetes 101 - an Introduction to Containers, Kubernetes, and OpenShift
ODP
Introduction to Nginx
PDF
GitOps with ArgoCD
PDF
2019.06.27 Intro to Ceph
PDF
gRPC Overview
PDF
Monitoring with prometheus
PPTX
PDF
Monitoring with Prometheus
PDF
Building a redundant CloudStack management cluster - Vladimir Melnik
PDF
NGINX: Basics and Best Practices EMEA
PPTX
PPTX
Kafka presentation
PDF
MQTT - A practical protocol for the Internet of Things
PDF
Git real slides
PDF
[오픈소스컨설팅] 쿠버네티스와 쿠버네티스 on 오픈스택 비교 및 구축 방법
PPTX
Prometheus (Prometheus London, 2016)
PDF
Room 3 - 1 - Nguyễn Xuân Trường Lâm - Zero touch on-premise storage infrastru...
PDF
Kubernetes or OpenShift - choosing your container platform for Dev and Ops
Introduction to the Container Network Interface (CNI)
Kubernetes 101 - an Introduction to Containers, Kubernetes, and OpenShift
Introduction to Nginx
GitOps with ArgoCD
2019.06.27 Intro to Ceph
gRPC Overview
Monitoring with prometheus
Monitoring with Prometheus
Building a redundant CloudStack management cluster - Vladimir Melnik
NGINX: Basics and Best Practices EMEA
Kafka presentation
MQTT - A practical protocol for the Internet of Things
Git real slides
[오픈소스컨설팅] 쿠버네티스와 쿠버네티스 on 오픈스택 비교 및 구축 방법
Prometheus (Prometheus London, 2016)
Room 3 - 1 - Nguyễn Xuân Trường Lâm - Zero touch on-premise storage infrastru...
Ad

Similar to Observability with HAProxy (20)

PDF
Observability tips for HAProxy
PDF
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
PDF
Website & Internet + Performance testing
PPTX
PDF
Protecting the Web at a scale using consul and Elk / Valentin Chernozemski (S...
PDF
PDF
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
PDF
Thinking DevOps in the era of the Cloud - Demi Ben-Ari
PDF
SPDY and What to Consider for HTTP/2.0
PDF
Security Monitoring for big Infrastructures without a Million Dollar budget
PDF
Docker Logging and analysing with Elastic Stack
PDF
Docker Logging and analysing with Elastic Stack - Jakub Hajek
ODP
Zero Downtime JEE Architectures
PDF
'Effective node.js development' by Viktor Turskyi at OdessaJS'2020
PDF
University of Delaware - Improving Web Protocols (early SPDY talk)
PDF
Prometheus (Microsoft, 2016)
PDF
OSMC 2018 | Stream connector: Easily sending events and/or metrics from the C...
PPTX
WTF is Sensu and Monitoring
PPTX
Approaches for application request throttling - Cloud Developer Days Poland
PDF
Altitude San Francisco 2018: HTTP Invalidation Workshop
Observability tips for HAProxy
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Website & Internet + Performance testing
Protecting the Web at a scale using consul and Elk / Valentin Chernozemski (S...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Thinking DevOps in the era of the Cloud - Demi Ben-Ari
SPDY and What to Consider for HTTP/2.0
Security Monitoring for big Infrastructures without a Million Dollar budget
Docker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic Stack - Jakub Hajek
Zero Downtime JEE Architectures
'Effective node.js development' by Viktor Turskyi at OdessaJS'2020
University of Delaware - Improving Web Protocols (early SPDY talk)
Prometheus (Microsoft, 2016)
OSMC 2018 | Stream connector: Easily sending events and/or metrics from the C...
WTF is Sensu and Monitoring
Approaches for application request throttling - Cloud Developer Days Poland
Altitude San Francisco 2018: HTTP Invalidation Workshop
Ad

Recently uploaded (20)

PDF
KodekX | Application Modernization Development
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Big Data Technologies - Introduction.pptx
PPTX
sap open course for s4hana steps from ECC to s4
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
Cloud computing and distributed systems.
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPT
Teaching material agriculture food technology
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
KodekX | Application Modernization Development
Understanding_Digital_Forensics_Presentation.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
20250228 LYD VKU AI Blended-Learning.pptx
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Big Data Technologies - Introduction.pptx
sap open course for s4hana steps from ECC to s4
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Programs and apps: productivity, graphics, security and other tools
Cloud computing and distributed systems.
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Building Integrated photovoltaic BIPV_UPV.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Advanced methodologies resolving dimensionality complications for autism neur...
Teaching material agriculture food technology
Dropbox Q2 2025 Financial Results & Investor Presentation

Observability with HAProxy

  • 2. WWW.HAPROXY.COM ● Introduction to “observability” ● Deep dive in HAProxy ● Tips (to get more from HAProxy) ● RoX (Return On eXperience) ● Conclusion (because we need one) Agenda
  • 4. WWW.HAPROXY.COM Observability: "control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs" (wikipedia) In short for us: WTF is going on ? Definition
  • 5. WWW.HAPROXY.COM ● Monitoring tells you how well something works (or not) ● Observability helps you detect what is not working and why => You monitor an observable system. Observability vs monitoring
  • 6. WWW.HAPROXY.COM ● can’t rely on impatient users’ complaints anymore => detect and fix trouble not reported yet ● stop adding duct tape, address root causes! ● improve users experience where it matters Observability is crucial
  • 7. WWW.HAPROXY.COM ● central place ● in distributed systems, one LB per layer, even better when sidecar ● sees multiple targets, eases comparisons => draw references ● Trusted low level component & excellent transparency ● many LB decisions actually depend on metrics related to performance and observability ● again, logs provide long-term references The LB as an observation tower
  • 8. WWW.HAPROXY.COM ● Maintain two types of TCP connections ○ From client to HAPRoxy (1) ○ From HAProxy to server (2) ○ Only access to data payload (not the “packet”) LB in Reverse-proxy mode HAProxyClient Server (1) (2)
  • 9. WWW.HAPROXY.COM ● you should! Especially in distributed systems (microservices, etc)! ● but often it's too late when the first incident happens! ● with existing LB's logs, it's already possible to do a lot Note: check OpenTracing and Prometheus Why not looking from other points?
  • 10. WWW.HAPROXY.COM ● global failures (aborts, timeouts) ● abnormal delays caused by network retransmits ● connection failures and retries caused by bad tuning (eg: conntrack) ● connection slowdowns caused by inefficient firewall policies (#rules) ● Non exhaustive list ... What does the LB see?
  • 11. WWW.HAPROXY.COM ● client-side issues (BW limitations) ● per-URL processing time (application issues, svc partners) ● per-node vs per-cluster variations => narrow down to individual node or shared resource ● deployment issues : new occasional error on a specific page, can be addressed before going full-scale What does the LB see?
  • 13. WWW.HAPROXY.COM haproxy[14389]: 10.0.1.2:33317 [06/Feb/2018:12:14:14.655] http-in static/srv1 10/0/30/69/109 200 2750 - -SDNN 1/1/1/1/0 0/0 {haproxy.org} {} "GET /index.html HTTP/1.1" I need help !!!
  • 14. WWW.HAPROXY.COM ● Logs : ○ Halog, ELK, Splunk, datadog… ○ Provides unique-id for tracing/event correlation ● Statistics : ○ Stats page, CLI, hatop ○ Prometheus, statsd, ... ● Stick-tables (per arbitrary key like IP, URL, cookie, ...) : ○ Byte count, cumulated/concurrent conns, errors, rates… Accessing metrics in HAProxy
  • 30. WWW.HAPROXY.COM ● HAProxy now supports heavier per-request workloads (Lua, device identification, …) ● Processing times over 200 µs can become noticeable ● Actions: ○ log per-request total CPU time spent in analysers ○ log per-request total CPU time spent in TLS handshake ○ log per-request total latency added by other tasks ○ Ability to kill offending tasks ○ Ability to alert on high latencies => make HAProxy as observable as other components And more to come in HAProxy 1.9 (Nov. 2018)
  • 31. WWW.HAPROXY.COM ● Timers are averaged in the stats ● Each timer appears in the logs ● halog -rt/-RT/-pct for quick analysis ● Each timer crossing a limit triggers a timeout ● Each abort at a specific step causes a hard error => termination codes haproxy[14389]: 10.0.1.2:33317 [06/Feb/2018:12:14:14.655] http-in static/srv1 10/0/30/69/109 200 2750 - - SDNN 1/1/1/1/0 0/0 {haproxy.org} {} "GET /index.html HTTP/1.1" Event timing reports Timers Term code Cookie code
  • 32. WWW.HAPROXY.COM ● Who, when, why and how the session has been terminated ● Distinguish between timeout and abort ● Indicate who (client, server, haproxy, kill, ...) ● Indicate when (req, queue, connect, response, data...) ● Completed by persistence cookie indications ● Filtered and sorted by halog : halog -tcn|-TCN ... # for filtering halog -tc # for sorting => Graphing the number of errors reported by second helps detecting an issue is in progress somewhere Termination codes
  • 33. WWW.HAPROXY.COM ● Stats page: distribution per frontend/backend/server ● Filter by ranges: halog -hs/-HS ● Sorted output: halog -st => graph the distribution and watch for variations between application deployments HTTP status code distribution
  • 34. WWW.HAPROXY.COM ● Uses server maxconn ● Grows exponentially with slowdowns : easy to detect! ● Tells you how many extra servers you need ● Reported by halog -Q/-QS ● Shown in real time on the stats page per backend/srv => If you watch only one metric, watch this one! Other relevant metrics: queues
  • 35. WWW.HAPROXY.COM LB algorithm implies fairness between servers : ● Equal request count with roundrobin => Higher than average concurrency indicates abnormally slow server ● Equal load with leastconn => Low req count indicates abnormally slow server => graph relevant values within the farm Other relevant metrics: LB fairness
  • 36. WWW.HAPROXY.COM ● Global: halog -e ● Per server: halog -srv ● Per client IP: halog -e -ic (detect bad CDN nodes) ● Per URL: halog -ue ● Stats page: per frontend/backend/server ● Stick-tables: per arbitrary key using http_err_rate() => no threshold, watch for variations Other relevant metrics: Error rate
  • 37. WWW.HAPROXY.COM ● Default httplog format is quite rich ● Can be improved using the log-format directive ● Hint: log stick-table stats for similar keys haproxy[14389]: 10.0.1.2:33317 [06/Feb/2018:12:14:14.655] http-in static/srv1 10/0/30/69/109 200 2750 - - SDNN 1/1/1/1/0 0/0 {haproxy.org} {} "GET /index.html HTTP/1.1" Useful entries in log-format Timers HTTP status Byte count Term Code Cookie Code Conn Count Queue Length
  • 39. WWW.HAPROXY.COM "I can't enable logs, I have too much traffic!" ● an average syslog server can store 20k events/s without sweating ● that's 1.7B events/day or 350GB of uncompressed haproxy logs/day ● compresses to 1TB/month ● for $100 you can store 4 months with no loss ● have more traffic / not interested in this level of detail ? # log only 5% of requests http-request set-log-level silent unless { rand(100) -lt 5 } Sampling: why / When Tips !!!!
  • 40. WWW.HAPROXY.COM ● you only want to catch suspicious events ● enable logs per url, source IP, etc... ● disable logging unless Tc/Tq/Tr/Tw/... is above a certain threshold ● on the fly for selected keys from the CLI + stick-table ● also see "option dontlognormal" WARNING: you'll lose any valid reference Selective logging: why / When Tips !!!!
  • 41. WWW.HAPROXY.COM ● Poorly documented, use halog --help ● response time per url: halog -uat ● errors per server: halog -srv ● Percentiles on req/queue/conn/resp times: halog -pct ● detect stolen CPU / swap : halog -ac … -ad … ● very fast (1-2 GB of log file per second) => Use it in production to figure the relevant metrics Other halog goodies Tips !!!!
  • 43. WWW.HAPROXY.COM ● Many ops and dev do this every day all around the world ● Users are complaining that the application is slow ● Devops team check HAProxy logs (using halog or ELK) in order to isolate potential sources of issue: ○ Check server Tc ○ Sort servers by response time ○ Sort URLs by response time ● HAProxy won’t fix the problem (in most cases), but will drastically reduce the troubleshooting time by isolating the potential issues Web application is slow!!!!!!!! Return On eXperience
  • 44. WWW.HAPROXY.COM ● Retail customer, with shops everywhere in France ● Cash register reports the following error message “Unable to reach central system” when searching people in the loyalty card database, after waiting up to 10s. ● Argument between customer and “central system” software provider: ○ Customer: your software is slow ○ Provider: your network triggers this error ● After installing HAProxy, its logs reported the following: ○ Time to get the query from the client: 300ms (from North of France to Paris) ○ Server response time: 50s ● Conclusion: provider turned off debug mode in the “central system” software Unavailable central system ?!? Return On eXperience
  • 45. WWW.HAPROXY.COM ● Customer with big TLS traffic, mostly API based ● Each HAProxy server can’t go higher than 20K HTTPs req/s (in 2015), even if there seem to be a lot of unused resources on the box ● At first, this was qualified as an “HAProxy” issue (including the HW and OS itself) ● Hard tuning the box did not improve anything ● halog Tc percentile reported higher values when the load increased (up to 30ms, on the LAN!!!) ● After asking customer to double check the Spanning Tree, it seemed a small 24 ports ToR switch became root bridge…. ● Fixing spanning tree also fixed HAProxy’s TLS performance HAProxy TLS performance issue Return On eXperience
  • 46. WWW.HAPROXY.COM ● Tc from HA1 to srv 1,2,3,5 always low, srv 4,6 high at 99 pct ● Tc from HA2 to srv 1,2,4,6 always low, srv 3,5 high at 99 pct => both haproxy and servers out of cause ● issue rate stable at various traffic levels => not congestion ● inter-switch link apparently at cause but not for all flows ● inter-switch link made of two fibers balanced on MAC tuple ● thanks to long-term logs, origin could even be identified Spotting a broken fiber channel between 2 core switches Return On eXperience
  • 47. WWW.HAPROXY.COM ● Tc abnormally high with lots of random values to several seconds, and only for TLS ● Tc timer also covers TLS handshake => not a network, hardware or performance issue, only server config. => system was regularly running out of entropy due to mistakenly using /dev/random as a random source for SSL TIP: use /dev/urandom ;) Wrong web server configuration using /dev/random Return On eXperience
  • 49. WWW.HAPROXY.COM ● exploit your stats ● Choose the right LB ● enable logs on LBs, no excuse for not doing it! ● process them automatically, manually once in a while ● compare numbers between similar objects ● detect anomalies ● fix problems before they are witnessed ● Enjoy :-) Conclusion Conclusion