a talk
Nelson Elhage, @nelhage
Operating Consul
As an Early Adopter
This Talk
• consul @ Stripe
• War Stories
• Lessons Learned
Consul at Stripe
The Good, The Bad, The Outages
Why Consul?
• Early 2014
• Stripe Infra gaining complexity
• Nightmarish in-house service registry
• Host lists distributed via puppet
Why Consul?
• Wanted a better service/host store
• consul had everything baked in
• Decided to do some test deployments
Initial Rollout
• Rolled out across all servers
• (started with bake-in in QA)
• No clients at all
What Could Go Wrong?
• We worried about memory leaks
Our First Production Issue
• Noticed one node taking >100M RAM
• (others all <50M)
• Reached out to armon for advice
• bug in the stats framework:
• https://github.com/armon/go-metrics/commit/02567bbc4f518a43853d262b651a3c8257c3f141
StartedAdding Clients
• Hooked into our deploy tool
• kept a manual emergency fallback
• Generated LB config from consul
• Noticed a surprising rate of errors
Raft Instability
• Seeing >1 failover/minute
• Reached out toArmon
• “Try 0.3”
• “consul is not optimized for spinning disk”
Rolling out 0.3
• Roll to QAfirst
• Nothing works!
• Check logs: TLS verification errors
Rolling out 0.3
• 0.3 changed TLS verification to check the
cert name
• Change our SSL issuing to add SANs
• 2014/06/16 16:52:57 [ERR] raft: Failed to make RequestVote
RPC to 10.100.29.175:8300: x509: certificate is valid for
[remote host], not [local host]
0.3 TLS Woes
• Whoops! consul was checking the remote
cert against the local node name
• armon> we just use "demo.consul.io" as
the CN for all of them
• 0.3 essentially completely broke TLS
0.3.1
• I wrote and got merged a patch to restore
0.2 behavior
• Rolled forward to 0.3.1
• Upgraded to SSD-backed servers
Increasing Rollout
• Switched various operational tools from
flatfile to consul
• Main app started using consul at startup
Consensus is Hard
consul-template
• Generating haproxy config using consul-template
• https://github.com/hashicorp/consul-template/
issues/168 – `consul-template` takes O(N²) time
with N services
consul-template
• Got that fixed, turned it on
• consul immediately fell over
• multiple elections/minute
• 2M allocations/minute
consul-template
• Service Watches churn when any service
changes health state
• Watching services on a large cluster →
self-DDOS
consul-template
• We use `consul-template -once` in cron
now
• Worse latency, but it works reliably
consul for leader election
• Our data team wanted a leader-election
primitive
• Built on top of consul, cribbing example
code
Sometime Later…
goroutine leak
• consul would rapidly eat all memory
• larger heap -> large GC pauses -> raft
instability
• manually restarted cluster 1/day
goroutine leak
• Reached out toArmon
• Very helpful in debugging
• Found several unrelated memory leaks
goroutine leak
• Tried to figure out what changed
• Eventually correlated to a session leak in
our leader election code
goroutine leak
• Fixed our leader-election code
• New policy: No non-discovery uses of
consul
consul DNS
• Increasingly reliant on consul for internal
discovery
• Unhappy at exposure to periodic instability
• Still have fallbacks, but outages remain painful
consul DNS
• Solution: Use consul-template to compile
consul DNS to a zone file
• Serve that out of a normal DNS server
• Refresh every 15s
Current Status
• Run consul everywhere
• Register all services
• Request-path lookups hit cached DNS
• Operational tools use HTTP interface
• Also generate config from consul-template
Final Stability Note
• consul 0.5.2 fixed our memory leaks
• consul has been quite stable for us of late
• consul-template watches still don’t scale
• 0.6 should help
Lessons Learned
being an early adopter without bringing down the site
(too many times)
Expect It To Be Rough
Monitoring, Monitoring, Monitoring
(graph all the things)
Incremental Rollout
Limit Scope
Isolation
UpgradeAggressively
Get To Know Upstream
Be Willing to Dive In
Questions?

More Related Content

PDF
PDF
Service Discovery in Distributed Systems
PDF
Getting Started with Consul
PDF
Consul First Steps
PPTX
Service Discovery Like a Pro
PPTX
High-speed, Reactive Microservices 2017
PPTX
WebSocket MicroService vs. REST Microservice
PPTX
Service Discovery with Consul - Arunvel Arunachalam
Service Discovery in Distributed Systems
Getting Started with Consul
Consul First Steps
Service Discovery Like a Pro
High-speed, Reactive Microservices 2017
WebSocket MicroService vs. REST Microservice
Service Discovery with Consul - Arunvel Arunachalam

What's hot (20)

PDF
HAProxyConf 2019: Building a Service Mesh at Criteo with Consul and HAProxy
PPTX
2019 05-28 SRE Consul Criteo Meetup
PPTX
Service Discovery with Consul
PPTX
Discover Kafka on OpenShift: Processing Real-Time Financial Events at Scale (...
PDF
Kafka Summit SF 2017 - Running Kafka as a Service at Scale
PDF
NATS Connect Live | Serverless on Kubernetes with OpenFaaS & NATS
PDF
Apache Kafka in Adobe Ad Cloud's Analytics Platform
PDF
Salt Air 19 - Intro to SaltStack RAET (reliable asyncronous event transport)
PDF
Network Infrastructure as Code with Chef and Cisco
PDF
Can Kafka Handle a Lyft Ride? (Andrey Falko & Can Cecen, Lyft) Kafka Summit 2020
PPTX
Introducing Exactly Once Semantics To Apache Kafka
PDF
TDC2017 | São Paulo - Trilha Containers How we figured out we had a SRE team ...
PPTX
... No it's Apache Kafka!
PPTX
Kafka Summit NYC 2017 - Deep Dive Into Apache Kafka
PDF
Let the alpakka pull your stream
PDF
Simple Solutions for Complex Problems - Boulder Meetup
PDF
Grokking TechTalk #24: Kafka's principles and protocols
PDF
How to build a Neutron Plugin (stadium edition)
PPT
Flying to clouds - can it be easy? Cloud Native Applications
PPT
JDD 2016 - Jacek Bukowski - "Flying To Clouds" - Can It Be Easy?
HAProxyConf 2019: Building a Service Mesh at Criteo with Consul and HAProxy
2019 05-28 SRE Consul Criteo Meetup
Service Discovery with Consul
Discover Kafka on OpenShift: Processing Real-Time Financial Events at Scale (...
Kafka Summit SF 2017 - Running Kafka as a Service at Scale
NATS Connect Live | Serverless on Kubernetes with OpenFaaS & NATS
Apache Kafka in Adobe Ad Cloud's Analytics Platform
Salt Air 19 - Intro to SaltStack RAET (reliable asyncronous event transport)
Network Infrastructure as Code with Chef and Cisco
Can Kafka Handle a Lyft Ride? (Andrey Falko & Can Cecen, Lyft) Kafka Summit 2020
Introducing Exactly Once Semantics To Apache Kafka
TDC2017 | São Paulo - Trilha Containers How we figured out we had a SRE team ...
... No it's Apache Kafka!
Kafka Summit NYC 2017 - Deep Dive Into Apache Kafka
Let the alpakka pull your stream
Simple Solutions for Complex Problems - Boulder Meetup
Grokking TechTalk #24: Kafka's principles and protocols
How to build a Neutron Plugin (stadium edition)
Flying to clouds - can it be easy? Cloud Native Applications
JDD 2016 - Jacek Bukowski - "Flying To Clouds" - Can It Be Easy?
Ad

Similar to Operating Consul as an Early Adopter (19)

PDF
Service discovery like a pro (presented at reversimX)
PDF
Introduction to Consul
PDF
Infrastructure development using Consul
PDF
Consul scale
PDF
Soa with consul
PDF
Consul tutorial
PDF
Consul administration at scale
PPTX
Discover/Register Everything in consul
PDF
Consul: Service Mesh for Microservices
PDF
Protecting the Web at a scale using consul and Elk / Valentin Chernozemski (S...
PDF
The rise of Layer 7, microservices, and the proxy war with Envoy, NGINX, and ...
PDF
PostgreSQL High-Availability and Geographic Locality using consul
PDF
HashiStack. To the cloud and beyond...
PDF
Consul and docker swarm cluster
PPTX
Intro to Consul
PPTX
Introduction to service discovery and self-organizing cluster orchestration. ...
PDF
2019 hashiconf consul-templaterb
PDF
Smart networking with service meshes
PDF
HAProxyconf 2019 - Criteo - Transitioning from Ticketing to LBaaS
Service discovery like a pro (presented at reversimX)
Introduction to Consul
Infrastructure development using Consul
Consul scale
Soa with consul
Consul tutorial
Consul administration at scale
Discover/Register Everything in consul
Consul: Service Mesh for Microservices
Protecting the Web at a scale using consul and Elk / Valentin Chernozemski (S...
The rise of Layer 7, microservices, and the proxy war with Envoy, NGINX, and ...
PostgreSQL High-Availability and Geographic Locality using consul
HashiStack. To the cloud and beyond...
Consul and docker swarm cluster
Intro to Consul
Introduction to service discovery and self-organizing cluster orchestration. ...
2019 hashiconf consul-templaterb
Smart networking with service meshes
HAProxyconf 2019 - Criteo - Transitioning from Ticketing to LBaaS
Ad

Recently uploaded (20)

PDF
MCP Security Tutorial - Beginner to Advanced
PPTX
Tech Workshop Escape Room Tech Workshop
PDF
Time Tracking Features That Teams and Organizations Actually Need
PDF
How Tridens DevSecOps Ensures Compliance, Security, and Agility
PDF
Designing Intelligence for the Shop Floor.pdf
PPTX
Trending Python Topics for Data Visualization in 2025
PPTX
Computer Software and OS of computer science of grade 11.pptx
DOCX
Modern SharePoint Intranet Templates That Boost Employee Engagement in 2025.docx
PDF
DNT Brochure 2025 – ISV Solutions @ D365
PDF
AI-Powered Threat Modeling: The Future of Cybersecurity by Arun Kumar Elengov...
PDF
Multiverse AI Review 2025: Access All TOP AI Model-Versions!
PDF
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
PDF
EaseUS PDF Editor Pro 6.2.0.2 Crack with License Key 2025
PPTX
Log360_SIEM_Solutions Overview PPT_Feb 2020.pptx
PDF
How AI/LLM recommend to you ? GDG meetup 16 Aug by Fariman Guliev
PDF
Wondershare Recoverit Full Crack New Version (Latest 2025)
PDF
Microsoft Office 365 Crack Download Free
PPTX
Cybersecurity: Protecting the Digital World
PPTX
AMADEUS TRAVEL AGENT SOFTWARE | AMADEUS TICKETING SYSTEM
PPTX
WiFi Honeypot Detecscfddssdffsedfseztor.pptx
MCP Security Tutorial - Beginner to Advanced
Tech Workshop Escape Room Tech Workshop
Time Tracking Features That Teams and Organizations Actually Need
How Tridens DevSecOps Ensures Compliance, Security, and Agility
Designing Intelligence for the Shop Floor.pdf
Trending Python Topics for Data Visualization in 2025
Computer Software and OS of computer science of grade 11.pptx
Modern SharePoint Intranet Templates That Boost Employee Engagement in 2025.docx
DNT Brochure 2025 – ISV Solutions @ D365
AI-Powered Threat Modeling: The Future of Cybersecurity by Arun Kumar Elengov...
Multiverse AI Review 2025: Access All TOP AI Model-Versions!
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
EaseUS PDF Editor Pro 6.2.0.2 Crack with License Key 2025
Log360_SIEM_Solutions Overview PPT_Feb 2020.pptx
How AI/LLM recommend to you ? GDG meetup 16 Aug by Fariman Guliev
Wondershare Recoverit Full Crack New Version (Latest 2025)
Microsoft Office 365 Crack Download Free
Cybersecurity: Protecting the Digital World
AMADEUS TRAVEL AGENT SOFTWARE | AMADEUS TICKETING SYSTEM
WiFi Honeypot Detecscfddssdffsedfseztor.pptx

Operating Consul as an Early Adopter