SlideShare a Scribd company logo
puppet @ 100,000+ agents 
John Jawed (“JJ”) 
eBay/PayPal
but I don’t have 100,000 agents 
issues ahead encountered at <1000 agents
me 
responsible for Puppet/Foreman @ eBay 
how I got here: 
engineer -> engineer with root access -> system/infrastructure 
engineer
free time: PuppyConf
puppet @ eBay, quick facts 
-> perhaps the largest Puppet deployment 
-> more definitively the most diverse 
-> manages core security 
-> trying to solve the “p100k” problems
#’s 
• 100K+ agents 
– Solaris, Linux, and Windows 
– Production & QA 
– Cloud (openstack & VMware) + bare metal 
• 32 different OS versions, 43 hardware configurations 
– Over 300 permutations in production 
• Countless apps from C/C++ to Hadoop 
– Some applications over 15+ years old
currently 
• 3-4 puppet masters per data center 
• foreman for ENC, statistics, and fact collection 
• 150+ puppet runs per second 
• separate git repos per environment, common core 
modules 
– caching git daemon used by ppm’s
Puppet Availability and Performance at 100K Nodes - PuppetConf 2014
nodes growing, sometimes violently 
linear growth trendline
Puppet Availability and Performance at 100K Nodes - PuppetConf 2014
setup puppetmasters 
setup puppet master, it’s the CA too 
sign and run 400 agents concurrently, that’s less than 
half a percent of all the nodes you need to get 
through.
Puppet Availability and Performance at 100K Nodes - PuppetConf 2014
not exactly puppet issues 
entropy unavailable 
crypto is CPU heavy (heavier than you ever have and 
still believe) 
passenger children are all busy
OK, let’s setup separate hosts which only function as a 
CA
multiple dedicated CA’s 
much better, distributed the CPU I/O and helped the 
entropy problem. 
the PPM’s can handle actual puppet agent runs 
because they aren’t tied up signing. Great!
wait, how do the CA’s know about each others certs? 
some sort of network file system (NFS sounds okay).
shared storage for CA cluster 
-> Get a list of pending signing requests (should be small!) 
# puppet cert list 
… 
wait 
… 
wait 
…
Puppet Availability and Performance at 100K Nodes - PuppetConf 2014
optimize CA’s for large # of certs 
Traversing a large # of certs is too slow over NFS. 
-> Profile 
-> Implement optimization 
-> Get patch accepted (PUP-1665, 8x improvement)
<3 puppetlabs team
optimizing foreman 
- read heavy is fine, DB’s do it well. 
- read heavy in a write heavy environment is more challenging. 
- foreman writes a lot of log, fact, and report data post puppet run. 
- majority of requests are to get ENC data 
- use makara with PG read slaves 
(https://github.com/taskrabbit/makara) to scale ENC requests 
- Needs updates to foreigner (gem) 
- If ENC requests areslow, puppetmasters fall over.
optimizing foreman 
ENC requests load balanced to read slaves 
fact/report/host info write requests sent to master 
makara knows how to arbitrate the connection (great 
job TaskRabbit team!)
more optimizations 
make sure RoR cache is set to use dalli 
(config.cache_store = :dalli_store), see foreman wiki 
fact collection optimization (already in upstream), 
without this reporting facts back to foreman can kill a 
busy puppetmaster! (if you care: 
https://github.com/theforeman/puppet-foreman/ 
pull/145)
<3 the foreman team
let’s add more nodes 
Adding another 30,000 nodes (that’s 30% coverage). 
Agent setup: pretty standard stuff, puppet agent as a 
service.
results 
average puppet run: 29 seconds. 
not horrible. but average latency is a lie because that 
usually represents the mean average (sum of N / N). 
the actual puppet run graph looks more like…
curve impossible 
No one in operations or infrastructure ever wants a service runtime graph like this. 
mean 
average
PPM running @ medium load 
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 
16765 puppet 20 0 341m 76m 3828 S 53.0 0.1 67:14.92 ruby 
17197 puppet 20 0 343m 75m 3828 S 40.7 0.1 62:50.01 ruby 
17174 puppet 20 0 353m 78m 3996 S 38.7 0.1 70:07.44 ruby 
16330 puppet 20 0 338m 74m 3828 S 33.8 0.1 66:08.81 ruby 
17231 puppet 20 0 344m 75m 3820 S 29.8 0.1 70:00.47 ruby 
17238 puppet 20 0 353m 76m 3996 S 29.8 0.1 69:11.94 ruby 
17187 puppet 20 0 343m 76m 3820 S 26.2 0.1 70:48.66 ruby 
17156 puppet 20 0 353m 75m 3984 S 25.8 0.1 64:44.62 ruby 
… system processes
60 seconds later…idle 
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 
17343 puppet 20 0 344m 77m 3828 S 11.6 0.1 74:47.23 ruby 
31152 puppet 20 0 203m 9048 2568 S 11.3 0.0 0:03.67 httpd 
29435 puppet 20 0 203m 9208 2668 S 10.9 0.0 0:05.46 httpd 
16220 puppet 20 0 337m 74m 3828 S 10.3 0.1 70:07.42 ruby 
16354 puppet 20 0 339m 75m 3816 S 10.3 0.1 62:11.71 ruby 
… system processes
120 seconds later…thrashing 
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 
16765 puppet 20 0 341m 76m 3828 S 94.0 0.1 67:14.92 ruby 
17197 puppet 20 0 343m 75m 3828 S 93.7 0.1 62:50.01 ruby 
17174 puppet 20 0 353m 78m 3996 S 92.7 0.1 70:07.44 ruby 
16330 puppet 20 0 338m 74m 3828 S 90.8 0.1 66:08.81 ruby 
17231 puppet 20 0 344m 75m 3820 S 89.8 0.1 70:00.47 ruby 
17238 puppet 20 0 353m 76m 3996 S 89.8 0.1 69:11.94 ruby 
17187 puppet 20 0 343m 76m 3820 S 88.2 0.1 70:48.66 ruby 
17156 puppet 20 0 353m 75m 3984 S 87.8 0.1 64:44.62 ruby 
17152 puppet 20 0 353m 75m 3984 S 86.3 0.1 64:44.62 ruby 
17153 puppet 20 0 353m 75m 3984 S 85.3 0.1 64:44.62 ruby 
17151 puppet 20 0 353m 75m 3984 S 82.9 0.1 64:44.62 ruby 
… more ruby processes
Puppet Availability and Performance at 100K Nodes - PuppetConf 2014
what we really want 
A flat consistent runtime curve, this is important for any production service. 
Without predictability there is no reliability!
consistency @ medium load 
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 
16765 puppet 20 0 341m 76m 3828 S 53.0 0.1 67:14.92 ruby 
17197 puppet 20 0 343m 75m 3828 S 40.7 0.1 62:50.01 ruby 
17174 puppet 20 0 353m 78m 3996 S 38.7 0.1 70:07.44 ruby 
16330 puppet 20 0 338m 74m 3828 S 33.8 0.1 66:08.81 ruby 
17231 puppet 20 0 344m 75m 3820 S 29.8 0.1 70:00.47 ruby 
17238 puppet 20 0 353m 76m 3996 S 29.8 0.1 69:11.94 ruby 
17187 puppet 20 0 343m 76m 3820 S 26.2 0.1 70:48.66 ruby 
17156 puppet 20 0 353m 75m 3984 S 25.8 0.1 64:44.62 ruby 
… system processes
hurdle: runinterval 
near impossible to get a flat curve because of uneven 
and chaotic agent run distribution. 
runinterval is non-deterministic … even if you manage 
to sync up service times eventually it’s nebulous.
the puppet agent daemon approach is not going to 
work.
plan A: puppet via cron 
generate run time based some deterministic agent data 
point (IP, MAC address, hostname, etc.). 
IE, if you wanted a puppet run every 30 minutes, your 
crontab may look like: 
08 * * * * puppet agent -t 
38 * * * * puppet agent -t
plan A yields 
Fewer and predictable spikes
Improved. 
But does not scale because cronjobs help run times 
become deterministic but lack even distribution.
eliminate all masters? masterless puppet 
kicking the can down the road, somewhere 
infrastructure still has to serve the files and catalog to 
agents. 
masterless puppet creates a whole host of other 
issues (file transfer channels, catalog compiler host).
eliminate all masters? masterless puppet 
…and the same issues exists in albeit in different 
forms. 
shifts problems to “compile interval” and 
“manifest/module push interval”.
plan Z: increase your runinterval 
Z, the zombie apocalypse plan (do not do this!). 
delaying failure till you are no longer responsible for it 
(hopefully).
alternate setups 
SSL termination on load balancer – expensive 
- LB’s are difficult to deploy, cost more (you still 
need fail over otherwise it’s a SPoF!) 
caching – cache is meant to make things faster, not 
required to work. If cache is required to make services 
functional, solving the wrong problem.
zen moment 
maybe the issue isn’t about timing the agent from 
the host. 
maybe the issue is that the agent doesn’t know when 
there’s enough capacity to reliably and predictably run 
puppet.
enforcing states is delayed 
runinterval/cronjobs/masterless setups still render 
puppet as a suboptimal solution in a state sensitive 
environment (customer and financial data). 
the problem is not unique to puppet. salt, coreOS, et 
al. are susceptible.
security trivia 
web service REST3DotOh just got compromised and 
allows a sensitive file managed by puppet to be 
manipulated. 
Q: how/when does puppet set the proper state?
the how; sounds awesome 
A: every puppet runs ensures that a file is in its’ 
intended state and records the previous state if it was 
not.
the when; sounds far from awesome 
A: whenever puppet is scheduled to run next. up to 
runinterval minutes from the compromise, masterless 
push, or cronjob execution.
smaller intervals help but… 
all the strategies have one common issue: 
puppet masters do not scale with smaller intervals, 
exasperate spikes in the runtime curve.
this needs to change
pvc 
“pvc” – open source & lightweight process for a 
deterministic and evenly distributed puppet service 
curve… 
…and reactive state enforcement puppet runs.
pvc 
a different approach that executes puppet runs based on 
available capacity and local state changes. 
pings from an agent to check if its’ time to run puppet. 
file monitoring to force puppet runs when important files 
change outside of puppet (think /etc/shadow, 
/etc/sudoers).
pvc 
basic concepts: 
- Frequent pings to determine when to run puppet 
- Tied in to backend PPM health/capacity 
- Frequent fact collection without needing to run puppet 
- Sensitive files should be subject to monitoring 
- on change or updates outside of puppet, immediately run 
puppet! 
- efficiency an important factor.
pvc advantages 
-> variable puppet agent run timing 
- allows the flat and predictable service curve (what we 
want). 
- more frequent puppet runs when capacity is available, 
less frequent puppet runs less capacity is available.
pvc advantages 
-> improves security (kind of a big deal these days) 
- puppet runs when state changes rather than waiting to 
run. 
- efficient, uses inotify to monitor files. 
- if a file being monitored is changed, a puppet run is 
forced.
pvc advantages 
- orchestration between foreman & puppet 
- controlled rollout of changes 
- upload facts between puppet runs into foreman
pvc – backend 
3 endpoints – all get the ?fqdn=<certname> parameter 
GET /host – should pvc run puppet or facter? 
POST /report – raw puppet run output, files monitored 
were changed 
POST /facts – facter output (puppet facts in JSON)
pvc – /host 
> curl http://hi.com./host?fqdn=jj.e.com 
< PVC_RETURN=0 
< PVC_RUN=1 
< PVC_PUPPET_MASTER=puppet.vip.e.com 
< PVC_FACT_RUN=0 
< PVC_CHECK_INTERVAL=60 
< PVC_FILES_MONITORED="/etc/security/access.conf /etc/passwd"
pvc – /facts 
allows collecting of facts outside of the normal puppet 
run, useful for monitoring. 
set PVC_FACT_RUN to report facts back to the pvc 
backend.
pvc – git for auditing 
push actual changes between runs into git 
- branch per host, parentless branches & commits 
are cheap. 
- easy to audit fact changes (fact blacklist to 
prevent spam) and changes between puppet runs. 
- keeping puppet reports between runs is not 
helpful.
pvc – incremental rollouts 
select candidate hosts based on your criteria and set an environment variable 
via the /host endpoint output: 
FACTER_UPDATE_FLAG=true 
in your manifest, check: 
if $::UPDATE_FLAG { 
… 
}
example pvc.conf 
host_endpoint=http://jj.e.com./host 
report_endpoint=http://jj.e.com./report 
facts_endpoint=http://jj.e.com./facts 
info=1 
warnings=1
pvc – available on github 
$ git clone https://github.com/johnj/pvc 
make someone happy, achieve:
wishlist 
stuff pvc should probably have: 
• authentication of some sort 
• a more general backend, currently tightly integrated 
into internal PPM infrastructure health 
• whatever other users wish it had
misc. lessons learned 
your ENC has to be fast, or your puppetmasters fail 
without ever doing anything. 
upgrade ruby to 2.x for the performance improvements. 
serve static module files with a caching http server 
(nginx).
contact 
@johnjawed 
https://github.com/johnj 
jj@x.com

More Related Content

PDF
Performance Tuning Your Puppet Infrastructure - PuppetConf 2014
PDF
Running at Scale: Practical Performance Tuning with Puppet - PuppetConf 2013
PDF
A user's perspective on SaltStack and other configuration management tools
PDF
OpenNebula, the foreman and CentOS play nice, too
ODP
Foreman in Your Data Center :OSDC 2015
PPTX
Deploying Foreman in Enterprise Environments
PDF
Red Hat Satellite 6 - Automation with Puppet
PDF
SaltConf14 - Anita Kuno, HP & OpenStack - Using SaltStack for event-driven or...
Performance Tuning Your Puppet Infrastructure - PuppetConf 2014
Running at Scale: Practical Performance Tuning with Puppet - PuppetConf 2013
A user's perspective on SaltStack and other configuration management tools
OpenNebula, the foreman and CentOS play nice, too
Foreman in Your Data Center :OSDC 2015
Deploying Foreman in Enterprise Environments
Red Hat Satellite 6 - Automation with Puppet
SaltConf14 - Anita Kuno, HP & OpenStack - Using SaltStack for event-driven or...

What's hot (19)

PDF
Experiences from Running Masterless Puppet - PuppetConf 2014
PDF
Foreman presentation
ODP
Linux host orchestration with Foreman, Puppet and Gitlab
PDF
Salt conf 2014 - Using SaltStack in high availability environments
ODP
Managing your SaltStack Minions with Foreman
PPTX
SaltConf 2014: Safety with powertools
ODP
PXEless Discovery with Foreman
PDF
Spot Trading - A case study in continuous delivery for mission critical finan...
PDF
Full Stack Automation with Katello & The Foreman
PDF
OpenNebula and SaltStack - OpenNebulaConf 2013
PDF
The SaltStack Pub Crawl - Fosscomm 2016
PDF
De-centralise and Conquer: Masterless Puppet in a Dynamic Environment
PDF
Openstack il2014 staypuft- your friendly foreman openstack installer
ODP
Foreman in your datacenter
PDF
Configuration Management - Finding the tool to fit your needs
PPTX
Puppet meetup testing
PDF
Arnold Bechtoldt, Inovex GmbH Linux systems engineer - Configuration Manageme...
PPT
SaltConf14 - Ben Cane - Using SaltStack in High Availability Environments
PPTX
High availability for puppet - 2016
Experiences from Running Masterless Puppet - PuppetConf 2014
Foreman presentation
Linux host orchestration with Foreman, Puppet and Gitlab
Salt conf 2014 - Using SaltStack in high availability environments
Managing your SaltStack Minions with Foreman
SaltConf 2014: Safety with powertools
PXEless Discovery with Foreman
Spot Trading - A case study in continuous delivery for mission critical finan...
Full Stack Automation with Katello & The Foreman
OpenNebula and SaltStack - OpenNebulaConf 2013
The SaltStack Pub Crawl - Fosscomm 2016
De-centralise and Conquer: Masterless Puppet in a Dynamic Environment
Openstack il2014 staypuft- your friendly foreman openstack installer
Foreman in your datacenter
Configuration Management - Finding the tool to fit your needs
Puppet meetup testing
Arnold Bechtoldt, Inovex GmbH Linux systems engineer - Configuration Manageme...
SaltConf14 - Ben Cane - Using SaltStack in High Availability Environments
High availability for puppet - 2016
Ad

Viewers also liked (20)

PPTX
Monitis: All-in-One Systems Monitoring from the Cloud
PDF
How to create multiprocess server on windows with ruby - rubykaigi2016 Ritta ...
PDF
Intro to Systems Orchestration with MCollective
PDF
Configuration Changes Don't Have to be Scary: Testing with containers
PDF
La importancia de la educación financiera
PPSX
шевченко т г 1
PDF
New constitution - what principles should guide our business?
DOC
Apa style course work chile earthquake 2010
PDF
Efficient Perception of Proteins and Nucleic Acids from Atomic Connectivity
PDF
8 reasons Images Matter, plus learn how to upload custom images on Listly
PDF
Desições sobre guarda
PPT
Pharma Social Media Tools (Slideshare)
PPTX
Macabio chapter5 projectmanagement
PPTX
Thyatira
PPT
Cwts activity module 2
DOCX
Planificador de proyectos actual (1)
PPT
Винтовая симметрия и золотое сечение
PPTX
Top 5 call center software solutions
PDF
TOP 10 HONEYMOON DESTINATIONS_ABTA MAG_FEB 2016
PPTX
WUD 2009 - User Experience Design a telefony komórkowe
Monitis: All-in-One Systems Monitoring from the Cloud
How to create multiprocess server on windows with ruby - rubykaigi2016 Ritta ...
Intro to Systems Orchestration with MCollective
Configuration Changes Don't Have to be Scary: Testing with containers
La importancia de la educación financiera
шевченко т г 1
New constitution - what principles should guide our business?
Apa style course work chile earthquake 2010
Efficient Perception of Proteins and Nucleic Acids from Atomic Connectivity
8 reasons Images Matter, plus learn how to upload custom images on Listly
Desições sobre guarda
Pharma Social Media Tools (Slideshare)
Macabio chapter5 projectmanagement
Thyatira
Cwts activity module 2
Planificador de proyectos actual (1)
Винтовая симметрия и золотое сечение
Top 5 call center software solutions
TOP 10 HONEYMOON DESTINATIONS_ABTA MAG_FEB 2016
WUD 2009 - User Experience Design a telefony komórkowe
Ad

Similar to Puppet Availability and Performance at 100K Nodes - PuppetConf 2014 (20)

PDF
Islands: Puppet at Bulletproof Networks
PPT
Capacity Management from Flickr
DOCX
sun solaris
PDF
Getput suite
PDF
2012 07 making disqus realtime@euro python
PDF
MongoDB World 2019: Becoming an Ops Manager Backup Superhero!
PDF
Linux Systems Performance 2016
PDF
Large-scaled Deploy Over 100 Servers in 3 Minutes
PPT
vBACD - Introduction to Opscode Chef - 2/29
PDF
Performance tweaks and tools for Linux (Joe Damato)
PDF
Consul administration at scale
PDF
Debugging Ruby Systems
PDF
Non-blocking I/O, Event loops and node.js
PDF
Lxbrand
PDF
Capacity Management for Web Operations
PDF
FPGA based 10G Performance Tester for HW OpenFlow Switch
PDF
BKK16-104 sched-freq
PDF
"Scaling in space and time with Temporal", Andriy Lupa .pdf
PDF
Kubernetes at Datadog the very hard way
PDF
"Scaling in space and time with Temporal", Andriy Lupa.pdf
Islands: Puppet at Bulletproof Networks
Capacity Management from Flickr
sun solaris
Getput suite
2012 07 making disqus realtime@euro python
MongoDB World 2019: Becoming an Ops Manager Backup Superhero!
Linux Systems Performance 2016
Large-scaled Deploy Over 100 Servers in 3 Minutes
vBACD - Introduction to Opscode Chef - 2/29
Performance tweaks and tools for Linux (Joe Damato)
Consul administration at scale
Debugging Ruby Systems
Non-blocking I/O, Event loops and node.js
Lxbrand
Capacity Management for Web Operations
FPGA based 10G Performance Tester for HW OpenFlow Switch
BKK16-104 sched-freq
"Scaling in space and time with Temporal", Andriy Lupa .pdf
Kubernetes at Datadog the very hard way
"Scaling in space and time with Temporal", Andriy Lupa.pdf

More from Puppet (20)

PPTX
Puppet Community Day: Planning the Future Together
PPTX
The Evolution of Puppet: Key Changes and Modernization Tips
PPTX
Can You Help Me Upgrade to Puppet 8? Tips, Tools & Best Practices for Your Up...
PPTX
Bolt Dynamic Inventory: Making Puppet Easier
PPTX
Customizing Reporting with the Puppet Report Processor
PPTX
Puppet at ConfigMgmtCamp 2025 Sponsor Deck
PPTX
The State of Puppet in 2025: A Presentation from Developer Relations Lead Dav...
PPTX
Let Red be Red and Green be Green: The Automated Workflow Restarter in GitHub...
PDF
Puppet camp2021 testing modules and controlrepo
PPTX
Puppetcamp r10kyaml
PDF
2021 04-15 operational verification (with notes)
PPTX
Puppet camp vscode
PDF
Modules of the twenties
PDF
Applying Roles and Profiles method to compliance code
PPTX
KGI compliance as-code approach
PDF
Enforce compliance policy with model-driven automation
PDF
Keynote: Puppet camp compliance
PPTX
Automating it management with Puppet + ServiceNow
PPTX
Puppet: The best way to harden Windows
PPTX
Simplified Patch Management with Puppet - Oct. 2020
Puppet Community Day: Planning the Future Together
The Evolution of Puppet: Key Changes and Modernization Tips
Can You Help Me Upgrade to Puppet 8? Tips, Tools & Best Practices for Your Up...
Bolt Dynamic Inventory: Making Puppet Easier
Customizing Reporting with the Puppet Report Processor
Puppet at ConfigMgmtCamp 2025 Sponsor Deck
The State of Puppet in 2025: A Presentation from Developer Relations Lead Dav...
Let Red be Red and Green be Green: The Automated Workflow Restarter in GitHub...
Puppet camp2021 testing modules and controlrepo
Puppetcamp r10kyaml
2021 04-15 operational verification (with notes)
Puppet camp vscode
Modules of the twenties
Applying Roles and Profiles method to compliance code
KGI compliance as-code approach
Enforce compliance policy with model-driven automation
Keynote: Puppet camp compliance
Automating it management with Puppet + ServiceNow
Puppet: The best way to harden Windows
Simplified Patch Management with Puppet - Oct. 2020

Recently uploaded (20)

PPTX
A Presentation on Artificial Intelligence
PDF
KodekX | Application Modernization Development
PDF
Modernizing your data center with Dell and AMD
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPT
Teaching material agriculture food technology
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
Big Data Technologies - Introduction.pptx
PDF
Electronic commerce courselecture one. Pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Encapsulation theory and applications.pdf
A Presentation on Artificial Intelligence
KodekX | Application Modernization Development
Modernizing your data center with Dell and AMD
The AUB Centre for AI in Media Proposal.docx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Dropbox Q2 2025 Financial Results & Investor Presentation
Teaching material agriculture food technology
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
NewMind AI Weekly Chronicles - August'25 Week I
Big Data Technologies - Introduction.pptx
Electronic commerce courselecture one. Pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
20250228 LYD VKU AI Blended-Learning.pptx
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Unlocking AI with Model Context Protocol (MCP)
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Encapsulation_ Review paper, used for researhc scholars
MYSQL Presentation for SQL database connectivity
Encapsulation theory and applications.pdf

Puppet Availability and Performance at 100K Nodes - PuppetConf 2014

  • 1. puppet @ 100,000+ agents John Jawed (“JJ”) eBay/PayPal
  • 2. but I don’t have 100,000 agents issues ahead encountered at <1000 agents
  • 3. me responsible for Puppet/Foreman @ eBay how I got here: engineer -> engineer with root access -> system/infrastructure engineer
  • 5. puppet @ eBay, quick facts -> perhaps the largest Puppet deployment -> more definitively the most diverse -> manages core security -> trying to solve the “p100k” problems
  • 6. #’s • 100K+ agents – Solaris, Linux, and Windows – Production & QA – Cloud (openstack & VMware) + bare metal • 32 different OS versions, 43 hardware configurations – Over 300 permutations in production • Countless apps from C/C++ to Hadoop – Some applications over 15+ years old
  • 7. currently • 3-4 puppet masters per data center • foreman for ENC, statistics, and fact collection • 150+ puppet runs per second • separate git repos per environment, common core modules – caching git daemon used by ppm’s
  • 9. nodes growing, sometimes violently linear growth trendline
  • 11. setup puppetmasters setup puppet master, it’s the CA too sign and run 400 agents concurrently, that’s less than half a percent of all the nodes you need to get through.
  • 13. not exactly puppet issues entropy unavailable crypto is CPU heavy (heavier than you ever have and still believe) passenger children are all busy
  • 14. OK, let’s setup separate hosts which only function as a CA
  • 15. multiple dedicated CA’s much better, distributed the CPU I/O and helped the entropy problem. the PPM’s can handle actual puppet agent runs because they aren’t tied up signing. Great!
  • 16. wait, how do the CA’s know about each others certs? some sort of network file system (NFS sounds okay).
  • 17. shared storage for CA cluster -> Get a list of pending signing requests (should be small!) # puppet cert list … wait … wait …
  • 19. optimize CA’s for large # of certs Traversing a large # of certs is too slow over NFS. -> Profile -> Implement optimization -> Get patch accepted (PUP-1665, 8x improvement)
  • 21. optimizing foreman - read heavy is fine, DB’s do it well. - read heavy in a write heavy environment is more challenging. - foreman writes a lot of log, fact, and report data post puppet run. - majority of requests are to get ENC data - use makara with PG read slaves (https://github.com/taskrabbit/makara) to scale ENC requests - Needs updates to foreigner (gem) - If ENC requests areslow, puppetmasters fall over.
  • 22. optimizing foreman ENC requests load balanced to read slaves fact/report/host info write requests sent to master makara knows how to arbitrate the connection (great job TaskRabbit team!)
  • 23. more optimizations make sure RoR cache is set to use dalli (config.cache_store = :dalli_store), see foreman wiki fact collection optimization (already in upstream), without this reporting facts back to foreman can kill a busy puppetmaster! (if you care: https://github.com/theforeman/puppet-foreman/ pull/145)
  • 25. let’s add more nodes Adding another 30,000 nodes (that’s 30% coverage). Agent setup: pretty standard stuff, puppet agent as a service.
  • 26. results average puppet run: 29 seconds. not horrible. but average latency is a lie because that usually represents the mean average (sum of N / N). the actual puppet run graph looks more like…
  • 27. curve impossible No one in operations or infrastructure ever wants a service runtime graph like this. mean average
  • 28. PPM running @ medium load PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 16765 puppet 20 0 341m 76m 3828 S 53.0 0.1 67:14.92 ruby 17197 puppet 20 0 343m 75m 3828 S 40.7 0.1 62:50.01 ruby 17174 puppet 20 0 353m 78m 3996 S 38.7 0.1 70:07.44 ruby 16330 puppet 20 0 338m 74m 3828 S 33.8 0.1 66:08.81 ruby 17231 puppet 20 0 344m 75m 3820 S 29.8 0.1 70:00.47 ruby 17238 puppet 20 0 353m 76m 3996 S 29.8 0.1 69:11.94 ruby 17187 puppet 20 0 343m 76m 3820 S 26.2 0.1 70:48.66 ruby 17156 puppet 20 0 353m 75m 3984 S 25.8 0.1 64:44.62 ruby … system processes
  • 29. 60 seconds later…idle PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 17343 puppet 20 0 344m 77m 3828 S 11.6 0.1 74:47.23 ruby 31152 puppet 20 0 203m 9048 2568 S 11.3 0.0 0:03.67 httpd 29435 puppet 20 0 203m 9208 2668 S 10.9 0.0 0:05.46 httpd 16220 puppet 20 0 337m 74m 3828 S 10.3 0.1 70:07.42 ruby 16354 puppet 20 0 339m 75m 3816 S 10.3 0.1 62:11.71 ruby … system processes
  • 30. 120 seconds later…thrashing PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 16765 puppet 20 0 341m 76m 3828 S 94.0 0.1 67:14.92 ruby 17197 puppet 20 0 343m 75m 3828 S 93.7 0.1 62:50.01 ruby 17174 puppet 20 0 353m 78m 3996 S 92.7 0.1 70:07.44 ruby 16330 puppet 20 0 338m 74m 3828 S 90.8 0.1 66:08.81 ruby 17231 puppet 20 0 344m 75m 3820 S 89.8 0.1 70:00.47 ruby 17238 puppet 20 0 353m 76m 3996 S 89.8 0.1 69:11.94 ruby 17187 puppet 20 0 343m 76m 3820 S 88.2 0.1 70:48.66 ruby 17156 puppet 20 0 353m 75m 3984 S 87.8 0.1 64:44.62 ruby 17152 puppet 20 0 353m 75m 3984 S 86.3 0.1 64:44.62 ruby 17153 puppet 20 0 353m 75m 3984 S 85.3 0.1 64:44.62 ruby 17151 puppet 20 0 353m 75m 3984 S 82.9 0.1 64:44.62 ruby … more ruby processes
  • 32. what we really want A flat consistent runtime curve, this is important for any production service. Without predictability there is no reliability!
  • 33. consistency @ medium load PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 16765 puppet 20 0 341m 76m 3828 S 53.0 0.1 67:14.92 ruby 17197 puppet 20 0 343m 75m 3828 S 40.7 0.1 62:50.01 ruby 17174 puppet 20 0 353m 78m 3996 S 38.7 0.1 70:07.44 ruby 16330 puppet 20 0 338m 74m 3828 S 33.8 0.1 66:08.81 ruby 17231 puppet 20 0 344m 75m 3820 S 29.8 0.1 70:00.47 ruby 17238 puppet 20 0 353m 76m 3996 S 29.8 0.1 69:11.94 ruby 17187 puppet 20 0 343m 76m 3820 S 26.2 0.1 70:48.66 ruby 17156 puppet 20 0 353m 75m 3984 S 25.8 0.1 64:44.62 ruby … system processes
  • 34. hurdle: runinterval near impossible to get a flat curve because of uneven and chaotic agent run distribution. runinterval is non-deterministic … even if you manage to sync up service times eventually it’s nebulous.
  • 35. the puppet agent daemon approach is not going to work.
  • 36. plan A: puppet via cron generate run time based some deterministic agent data point (IP, MAC address, hostname, etc.). IE, if you wanted a puppet run every 30 minutes, your crontab may look like: 08 * * * * puppet agent -t 38 * * * * puppet agent -t
  • 37. plan A yields Fewer and predictable spikes
  • 38. Improved. But does not scale because cronjobs help run times become deterministic but lack even distribution.
  • 39. eliminate all masters? masterless puppet kicking the can down the road, somewhere infrastructure still has to serve the files and catalog to agents. masterless puppet creates a whole host of other issues (file transfer channels, catalog compiler host).
  • 40. eliminate all masters? masterless puppet …and the same issues exists in albeit in different forms. shifts problems to “compile interval” and “manifest/module push interval”.
  • 41. plan Z: increase your runinterval Z, the zombie apocalypse plan (do not do this!). delaying failure till you are no longer responsible for it (hopefully).
  • 42. alternate setups SSL termination on load balancer – expensive - LB’s are difficult to deploy, cost more (you still need fail over otherwise it’s a SPoF!) caching – cache is meant to make things faster, not required to work. If cache is required to make services functional, solving the wrong problem.
  • 43. zen moment maybe the issue isn’t about timing the agent from the host. maybe the issue is that the agent doesn’t know when there’s enough capacity to reliably and predictably run puppet.
  • 44. enforcing states is delayed runinterval/cronjobs/masterless setups still render puppet as a suboptimal solution in a state sensitive environment (customer and financial data). the problem is not unique to puppet. salt, coreOS, et al. are susceptible.
  • 45. security trivia web service REST3DotOh just got compromised and allows a sensitive file managed by puppet to be manipulated. Q: how/when does puppet set the proper state?
  • 46. the how; sounds awesome A: every puppet runs ensures that a file is in its’ intended state and records the previous state if it was not.
  • 47. the when; sounds far from awesome A: whenever puppet is scheduled to run next. up to runinterval minutes from the compromise, masterless push, or cronjob execution.
  • 48. smaller intervals help but… all the strategies have one common issue: puppet masters do not scale with smaller intervals, exasperate spikes in the runtime curve.
  • 49. this needs to change
  • 50. pvc “pvc” – open source & lightweight process for a deterministic and evenly distributed puppet service curve… …and reactive state enforcement puppet runs.
  • 51. pvc a different approach that executes puppet runs based on available capacity and local state changes. pings from an agent to check if its’ time to run puppet. file monitoring to force puppet runs when important files change outside of puppet (think /etc/shadow, /etc/sudoers).
  • 52. pvc basic concepts: - Frequent pings to determine when to run puppet - Tied in to backend PPM health/capacity - Frequent fact collection without needing to run puppet - Sensitive files should be subject to monitoring - on change or updates outside of puppet, immediately run puppet! - efficiency an important factor.
  • 53. pvc advantages -> variable puppet agent run timing - allows the flat and predictable service curve (what we want). - more frequent puppet runs when capacity is available, less frequent puppet runs less capacity is available.
  • 54. pvc advantages -> improves security (kind of a big deal these days) - puppet runs when state changes rather than waiting to run. - efficient, uses inotify to monitor files. - if a file being monitored is changed, a puppet run is forced.
  • 55. pvc advantages - orchestration between foreman & puppet - controlled rollout of changes - upload facts between puppet runs into foreman
  • 56. pvc – backend 3 endpoints – all get the ?fqdn=<certname> parameter GET /host – should pvc run puppet or facter? POST /report – raw puppet run output, files monitored were changed POST /facts – facter output (puppet facts in JSON)
  • 57. pvc – /host > curl http://hi.com./host?fqdn=jj.e.com < PVC_RETURN=0 < PVC_RUN=1 < PVC_PUPPET_MASTER=puppet.vip.e.com < PVC_FACT_RUN=0 < PVC_CHECK_INTERVAL=60 < PVC_FILES_MONITORED="/etc/security/access.conf /etc/passwd"
  • 58. pvc – /facts allows collecting of facts outside of the normal puppet run, useful for monitoring. set PVC_FACT_RUN to report facts back to the pvc backend.
  • 59. pvc – git for auditing push actual changes between runs into git - branch per host, parentless branches & commits are cheap. - easy to audit fact changes (fact blacklist to prevent spam) and changes between puppet runs. - keeping puppet reports between runs is not helpful.
  • 60. pvc – incremental rollouts select candidate hosts based on your criteria and set an environment variable via the /host endpoint output: FACTER_UPDATE_FLAG=true in your manifest, check: if $::UPDATE_FLAG { … }
  • 61. example pvc.conf host_endpoint=http://jj.e.com./host report_endpoint=http://jj.e.com./report facts_endpoint=http://jj.e.com./facts info=1 warnings=1
  • 62. pvc – available on github $ git clone https://github.com/johnj/pvc make someone happy, achieve:
  • 63. wishlist stuff pvc should probably have: • authentication of some sort • a more general backend, currently tightly integrated into internal PPM infrastructure health • whatever other users wish it had
  • 64. misc. lessons learned your ENC has to be fast, or your puppetmasters fail without ever doing anything. upgrade ruby to 2.x for the performance improvements. serve static module files with a caching http server (nginx).

Editor's Notes

  • #25: Greg, dominic, ohad