Test driven infrastructure development

Test driven
Infrastructure
development

Tomas (t0m) Doran
<tomas.doran@timgroup.com>
@bobtﬁsh
https://github.com/bobtﬁsh
https://github.com/youdevise
Thursday, 14 March 13

‘Real men’ develop in
production!


Repeat again and again. Development cycle SLOOOW.

production!
• Edit / Commit / Push



production!
• Update puppetmaster



production!
• puppet agent -t



production!
• puppet agent -t
• Repeat



This is insane!


CHOAS and FAIL result when you break each other. Or, MORE likely (this happens twice a
day!)

This is insane!

• Try it on an 8 person team.


day!)

This is insane!

• Try it on an 8 person team.
• ‘LOL - I broke puppet’


day!)

OI!!!


OI!!!
OI t0m!!!!


OI!!!
OI t0m!!!!
You broke
puppet!


OI!!!
OI t0m!!!!
You broke
puppet!
AARRRGGH!!!


Lets ﬁx this!

• First, a glossary:


Lets ﬁx this!

• mco - mcollective


Lets ﬁx this!

• mco - mcollective
• ENC - External node classiﬁer


We can do better


This at least lets you develop things independently. Everyone can do dev in their own branch
and merge once they have something that doesn’t break _everything_. You can also rebase -i
(squash) all the ARGH PUPPET SYNTAX commits.

We can do better
• Branch == environment



We can do better
• Branch / Commit / Push



We can do better
• mco puppetupdate



We can do better
• mco puppetupdate
• puppet agent -t
--environment xxx



Sounds good?

•Then you’ll be wanting:
•https://github.com/
youdevise/puppetupdate


It’s a bit basic, but then I ripped it out of work internal code at 8am ;)

So we ﬁxed it?


Refactoring


Sorry Chris, but when you say ‘refactoring’ - it’s not refactoring unless you have tests.
The problem is that you can’t always remember to run the right branch on all the right nodes.
Or rather, how do you even know what all the right nodes are? And if you’re hacking on
custom functions, or anything using exported resource - WOE

Refactoring
• We change things to be consistent across
codebase:
• Why did puppet just delete all the
ﬁrewall rules on the production
database?



Refactoring
• We change things to be consistent across
codebase:
• Why did puppet just delete all the
ﬁrewall rules on the production
database?
• We don’t refactor:
• Add bugs all the time due to
inconsistency


Unfortunate reality:

• Hard coded IPs in 10 places


So, despite our best efforts, our puppet code was SHIIIIT.
Exported resources IS NOT a good ﬁt for non-trivial things (like generating load balancer
conﬁgs). Ergo lots of hard coded IPs in multiple places. Ergo puppet code per site.


• role::oy_lb




• role::oy_lb
• hiera data split by domain (colo)




• role::oy_lb
• mco puppet




• role::oy_lb
• mco puppet
• 4 weeks per app per environment



The state of the art


• It’s certainly in a state


Nobody does automatic runs
Puppet becomes an auditing tool (automatic noop runs + reports)


• Automatic runs dangerous




• cron --noop runs




• puppet becomes an auditing system




• puppet becomes an auditing system
• This isn’t what I signed up for!



Business says no!


Business says no!
• Launching new products has a long lead
time
• This is unhelpful if your company is trying
to branch out into new markets


Business says no!
• Launching new products has a long lead
time
• This is unhelpful if your company is trying
to branch out into new markets
• CI / stage environments unlike prod
• Issues when new functionality goes live
• Developers think you’re incompetent

What is wrong
with this picture?


You just don’t know the answer to any of these questions in any reliable way...
But, generally, the answers are NO, YES, NO, NO

What is wrong
with this picture?
• Did you run it everywhere?



What is wrong
with this picture?
• Does it affect anything you’re
not expecting?



What is wrong
with this picture?
not expecting?
• Can you rebuild cleanly?



What is wrong
with this picture?
not expecting?
• Can you rebuild cleanly?
• Does the code even make things
reﬂect current state?



‘We use puppet’


Hint - you don’t!

‘We use puppet’

• Means nothing


Hint - you don’t!

‘We use puppet’

• Means nothing
• State of your system is
the sum of all changes


Hint - you don’t!

‘We use puppet’

• Means nothing
• State of your system is
the sum of all changes
• How do you know your
code can rebuild things?


Hint - you don’t!

It’s all mierda


We need to grow up, and raise the level of the conversation..

It’s all mierda
• Development communities are 10
years ahead



It’s all mierda
years ahead
• We don’t integration test
• (repeatably)



It’s all mierda
years ahead
• We don’t integration test
• (repeatably)
• We can’t build / rebuild
• (reliably)


Infra is hard


Sure - it’s much much harder to get a standalone testable system in infra than it is in
development.

Infra is hard

• Infrastructure is inherently more complex


development.

Infra is hard

• Less control


development.

Infra is hard

• Less control
• More moving parts


development.

Infra is hard

• Less control
• ‘End to end’ testing


development.

Infra is hard

• Less control
• ‘End to end’ testing
• Persistent data


development.

No excuses:
Scientiﬁc method


I do not consider this an excuse to abandon sanity.

The solution?


The solution?
• Re-provision everything in tests
• N.B. Not perfect (but better!)


The solution?
• Re-provision everything in tests
• N.B. Not perfect (but better!)

• Proper software engineering
• Unit and integration tests
• Build pipeline + promotion

Openstack

• Our tests spinning up 12 machines => VMs


So, we should use openstack, right? As of December, when we looked - 2 networks max,
inﬂexible. lvs not possible.

Openstack

• Openstack going to be awesome, right now:



Openstack

• Networking sucks



Openstack

• Load balancing is a shambles



Openstack

• Load balancing is a shambles
• lvs / vlans / metal / bonding - nope



My desires:


My desires:
• Reuse as much code as possible! (e.g. load
balancers)


My desires:
balancers)
• No per colo/environment puppet code


My desires:
balancers)
• No IPs anywhere


My desires:
balancers)
• No IPs anywhere
• ‘DRY’


My desires:
balancers)
• No IPs anywhere
• ‘DRY’
• CI pipeline to promote to production


My desires:
balancers)
• No IPs anywhere
• ‘DRY’
• 1 puppet run from provisioned to working


My desires:
balancers)
• No IPs anywhere
• ‘DRY’
• 1 puppet run from provisioned to working
• Repeatable and testable!

Orc
• Continuous (zero downtime) deployment


Orc
• Development / infrastructure application
contract


Orc
contract
• Model driven


Orc
contract
• Model driven
• https://github.com/youdevise/orc/


Puppetroll


Puppetroll

• Rolls out a consistent sha1 from the
puppetmaster to an entire environment


Puppetroll

• Fails if any puppet run fails


Puppetroll

• Fails if any puppet run fails
• https://github.com/youdevise/puppetroll


Provisioning tools


Provisioning tools
• debootstrap custom gold images


Provisioning tools
• mcollective ‘computenode’ agent for kvm


Provisioning tools
• ‘provision me a machine called X, on
networks Y and Z’


Provisioning tools
• ‘provision me a machine called X, on
networks Y and Z’
• Dynamic IP allocation (dnsmasq locally,
DDNS for real)


stacks


stacks
• Model driven deployment


stacks
• DSL for describing groups of systems +
dependencies


stacks
dependencies
• rake tasks to provision / test / clean up
stack + deps


stacks
dependencies
• rake tasks to provision / test / clean up
stack + deps
• Can provision a full environment, run E2E
tests, tear it down - in CI.


I want to hack on load
balancers

= 4 new, independent machines


How it works?


How it works?
• DSL creates model of systems


How it works?
• rake task ‘launch’:


How it works?
• mco provisions boxes on compute nodes


How it works?
• each box runs puppet --waitforcert


How it works?
• mco signs cert


How it works?
• mco signs cert
• puppet runs for each box

mco computenode


Puppetmaster


Puppetmaster

• Uses the same model


Puppetmaster

• Generates an ENC for each node


Puppetmaster

• Puppet code:


Puppetmaster

• Puppet code:
• Just installs things / starts services


Puppetmaster

• Puppet code:
• Just installs things / starts services
• I.E. what it’s good at!


External node classiﬁer


Putting it together


So, what do we have? Well - everything I showed you already...
Building proxy server layer (by refactoring puppet code) right now. Databases to follow!

Putting it together

• Still ongoing - live production apps ETA two
weeks.



Putting it together

weeks.
• Still haven’t solved re-provisioning problem
for live environments!



Putting it together

weeks.
• Still haven’t solved re-provisioning problem
for live environments!
• Do have repeatable and testable / tested
infrastructure building in CI!




The top table is our test overview - we have two types of tests, those which are for a speciﬁc
machine (i.e. a VM) and those which are for a virtual service (backed by multiple machines)
‘behaves like’ is an rspec thing we haven’t overridden.
For each machine, we test that it’s pingable, then run every nrpe (nagios) agent and check

In the (near) future?


• Live application stack in production


• Automated ‘promotion’ of good changes to
production


production
• Integrated environment support for dev
stacks on dev branches/environments


production
• Integrated environment support for dev
stacks on dev branches/environments
• Open source all the things!


Thanks!


Thanks!
• puppet is an awesome tool.
• It doesn’t solve higher level system
modeling problems
• It shouldn’t try to!


Thanks!
• puppet is an awesome tool.
• It doesn’t solve higher level system
modeling problems
• It shouldn’t try to!

• sysadmins need to level up
• It’s not done till you can test it still works

Photo Credits
• Escher's "Relativity" in LEGO - Andrew Lipson (http://www.andrewlipson.com)
• Manure - Flickr - chesbayprogram
• Provisions - Flickr - quinn.anya
• Stacked - Flickr - andrewrennie
• Dilbert - Flickr - osde-info
• Stacking wood - ﬁckr - arthuserea
• Square wheels - Flickr - vrogy
• Puppets - Flickr - SkipSteuart
• Light bulb - Flickr - bazik
• This-is-not-art - Wikimedia commons - Loran Davis
• Danger of death - Flickr - zigazou76
• Bob the Builder - Flickr - jamesclay
• Swiss roll - Flickr - add1sun
• Orc - Flickr - photo_munki
• Danger! Danger - Flickr - donsolo
• Cow of the future - Flickr - thewamphyri
• SCIENCE - Flickr - chasblackman


Links!

• http://github.com/youdevise
• http://github.com/bobtﬁsh
• https://devblog.timgroup.com/
• (Yes, we are hiring)


Test driven infrastructure development

More Related Content

Viewers also liked (17)

Similar to Test driven infrastructure development (20)

More from Tomas Doran (20)

Test driven infrastructure development