SlideShare a Scribd company logo
Automation at Brainly
… or how to enter the world of automation in a “different way”.
OPS stack:
● ~80 servers, heavy usage of LXC containers
(~1000)
● 99.9% Debian, 1 Ubuntu host :)
● Nginx / Apache2, 2k reqs per sec
● 200 million page views monthly
● 700Mbps peak traffic
● Python is dominant
About Brainly
World’s largest homework help social network, connecting over 40 million users monthly
DEV stack:
● PHP
- Symfony 2
- SOA projects
- 200 reqs per sec on russian version
● Erlang
- 55k concurrent users
- 22k events per sec
● Native Apps
- iOS
- Android
● Puppet was not feasible for us
- *lots* of dependencies which make containers bigger/heavier
- problems with Puppet's declarative language
- seemed incoherent, lacking integration of orchestration
- steep learning curve
- YMMV
● "packaging as automation" as an intermediate solution
- dependency hell, installing one package could result in uninstalling others
- inflexible, lots of code duplication in debian/rules file
- LOTS of custom bash and PHP scripts, usually very hard to reuse
and not standardized
- this was a dead end :(
● Ansible
- initially used only for orchestration
- maintaining it required keeping up2date inventory, which later
simplified and helped with lots of things
Starting point
● we decided to move forward with Ansible and use it for setting up machines as
well
● first project was nagios monitoring plugins setup
● turned out to be ideal for containers and our needs in general
- very little dependencies to begin with (python2, python-apt),
and small footprint - "configured" Python modules are transferred
directly to machine, no need for local repositories
- very light, no compilation on the destination host is needed
- easy to understand. Tasks/playbooks map directly to actions
an ops/devops would have done if he was doing it by hand
- compatible with "automation by packages". We were able to
migrate from the old system in small steps.
First steps with Ansible
● all policies, rules, and good practices written down in automation's repo main
directory
● helps with introducing new people into the team or with devops approach
- newbies are able to start committing to repo quickly
- what's in GUIDELINES.md, that's law and changing it requires wider
consensus
- gives examples on how to deal with certain problems in standardized way
● few examples:
- limit the number of tags, each of them should be self-contained
with no cross-dependencies.
- do not include roles/tasks inside other roles,
this creates hard to follow dependencies
- NEVER subset the list of hosts inside the role, do it in site.yml.
Otherwise debugging roles/hosts will become difficult
- think twice before adding new role and esp. groups. As infrastructure
grows, it becomes hard to manage and/or creates "dead” code/roles
Avoiding regressions
● one of the policies introduced was storing one-off scripts in a
separate directory in our automation repo.
● most of them are Ansible playbooks used just for one particular
task (i.e. Squeeze->Wheezy migration)
● version-control everything!
● turned out to be very useful, some of them turned out to be useful
enough to be rewritten to proper role or a tool
Ugly-hacks reusability
PLNOG14: Automation at Brainly - Paweł Rozlach
● available on GitHub and Ansible Galaxy:
https://galaxy.ansible.com/list#/roles/940
https://galaxy.ansible.com/list#/roles/941
● “base” role:
- is reused across 8 different production roles we have ATM
- contains basic monitoring, log rotation, packages installation, etc…
- includes PHP setup in modphp/prefork configuration
- PHP disabled functions control
- basic security setup
- does not include any site-specific stuff
● "site” role:
- contains all site specific stuff and dependencies
(vhosts, additional packages, etc...)
- usually very simple
- more than one site role possible, only one base role though
● It is an example of how we make our roles reusable
Apache2 automation
● automatically setups monitoring basing on inventory and host groups
● implements devops approach - if dev has root on machine, he also has
access to all monitoring stuff related to this system
● automatic host dependencies basing on host groups
● provisioning new hosts is no longer so painful ("auto-discovery")
● all services configuration is stored as YAML files, and used in templates
● role uses DNS data directly from inventory in order to make monitoring
independent of DNS failures
Icinga
DNS migration
● at the beginning:
- dozens of authoritative name servers, each of them having
customized configuration, running ~100 zones, all created by hand
- the main reason for that was using DNS for switching between
primary/secondary servers/services
● three phases:
- slurping configuration into Ansible
- normalizing the configuration
- improving the setup
● Python script which uses Ansible API to fetch normalized zone configuration from
each server
- results available in a neat hash, with per-host, per-zone keys!
- normalization using named-checkconf tool
● use slurped configuration to re-generate all configs, this time using only the data
available to Ansible's
● "push-button" migration, after all recipes were ready :)
● secure: all zone transfers are signed with individual keys, ACLs are tight
● playbooks use dns data directly from inventory
● changing/migrating slaves/masters is easy, NS records are auto-generated
● updates to zones automatically bump serial, while still preserving the
YYYYMMDDxx format
● CRM records are auto-generated as well
* see next slide about CRM automation
● dns entries are always up2date thanks to some custom action modules
- ansible_ssh_host variables are harvested and processed into zones
- only custom entries and zone primary/secondary server names are
now stored in YAML
- new hosts are automatically added to zones, decommissioned
ones - removed
- auto-generation of reverse zones
DNS automation
● we have ~130 CRM clusters
● setting them up by hand would be "difficult" at best, impossible at worst
● available on Ansible Galaxy:
- https://galaxy.ansible.com/list#/roles/956
- https://galaxy.ansible.com/list#/roles/979
● follows pattern from apache2_base
- “base” role suitable for manually set up clusters
- "cluster” role provides service upon base, with few reusable snippets
and a possibility for more complex configurations
● automatic membership based on ansible inventory (no multicasts!)
● the most difficult part was providing synchronous handlers
● few simple configurations are provided, like single service-single vip
Corosync & Pacemaker
● initially we did not have time nor resources to set up full fledged LDAP
● we needed:
- user should be able to log in even during a network outage
- removal/adding users, ssh-keys, custom settings, etc..
all had to be supported
- it had to be reusable/accessible in other roles
(i.e. Icinga/monitoring)
- different privileges for dev,production and other environments
- UID/GID unification
● turned out to be simpler than we thought - users are managed using few
simple tasks and group_vars data. Rest is handled via variables precedence.
● migration/standardization required some effort though
User management automation
● standard ansible inventory management becomes a bit cumbersome with 100’s of
hosts:
- each host has to have ansible_ssh_host defined
- adding/removing large number of hosts/groups required editing lots of files
and/or one-off scripts
- ip address management using google docs does not scale ;)
● Ansible has well defined dynamic inventory API, with scripts available for AWS,
Cobbler, Rackspace, Docker, and many others.
● we wrote our own, which is based on YAML file, version controlled by git:
- python API allowing to manipulate the inventory easily
- logic and syntax checking of the inventory
● available as opensource: https://github.com/brainly/inventory_tool
Inventory management
● we are leasing our servers from Hetzner, no direct Layer 2 connectivity
● all tunnel setups are done using Ansible, new server
is automatically added to our network
● firewalls are set up by Ansible as well:
- OPS contribute the base firewall, DEVs can open
the ports of interest for their application
- ferm at it's base, for easy rule making and keeping in-kernel firewall in sync
with on-disk rules
- rules are auto-generated basing on inventory, adding/removing hosts is
automatically reconfigures FW
Networking
● based on Bareos, opensource Bacula fork
● new hosts are automatically set up for backup,
extending storage space is no longer a problem
● authentication using certificates, PITA without ansible
Backups
● deployment done by Python script calling Ansible API
● simple tasks implemented using ansible playbooks
● complex logic implemented in Python
Deployments
● Jinja2 template error messages are "difficult" to interpret
● templates sometimes grow to huge complexity
● Jinja2 is designed for speed, but with tradeoffs - some Python operators are
missing and creating custom plugins/filters poses some problems
● multi-inheritance, problems with 2-headed trees
● speed, improved with "pipelining=True", containerization on the long run
● some useful functionality requires paid subscription (Ansible Tower)
- RESTfull API, useful if you want to push new application version
to productions via i.e. Jenkins
- schedules - currently we need to push the changes ourselves
Not everything is perfect
● developers by default have RO access to repo, RW on case-by-case basis
● changes to systems owned by developers are done by developers,
OPS only provide the platform and tools
● all non-trivial changes require a Pull Request and a review from Ops
● encrypt mission critical data with Ansible Vault and push it directly to the repo
- *strong* encryption
- available to Ansible without the need for decryption
(password still required though)
- all security sensitive stuff can be skipped by developers with
"--skip-tags" option to ansible-playbooks
Dev,DevOps,Ops
PLNOG14: Automation at Brainly - Paweł Rozlach
● some of the things we mentioned can be find on our Github account
● we are working on opensourcing more stuff
https://github.com/brainly
Opensource! Opensource! Opensource!
● time needed to deploy new markets dropped considerably
● increased productivity
● better cooperation with developers
● more workpower, Devs are no longer blocked so much, we can push
tasks to them
● infrastructure as a code
● versioning
● code-reuse, less copy-pasting
Conclusions
We are hiring!
http://brainly.co/jobs/
Questions?
Thank you!

More Related Content

PDF
Automation@Brainly - Polish Linux Autumn 2014
PPT
Configuration management with puppet
PDF
Configuration manager presentation
PPTX
Puppet Availability and Performance at 100K Nodes - PuppetConf 2014
PDF
Automating Complex Setups with Puppet
PDF
Spot Trading - A case study in continuous delivery for mission critical finan...
PDF
Puppet - Configuration Management Made Eas(ier)
PPTX
SaltConf 2014: Safety with powertools
Automation@Brainly - Polish Linux Autumn 2014
Configuration management with puppet
Configuration manager presentation
Puppet Availability and Performance at 100K Nodes - PuppetConf 2014
Automating Complex Setups with Puppet
Spot Trading - A case study in continuous delivery for mission critical finan...
Puppet - Configuration Management Made Eas(ier)
SaltConf 2014: Safety with powertools

What's hot (20)

PDF
PuppetCamp Sydney 2012 - Building a Multimaster Environment
ODP
Introduction to Ansible
PPTX
Ansible MySQL MHA
PDF
Beyond Puppet
PDF
Salt conf 2014 - Using SaltStack in high availability environments
PDF
Devops with Python by Yaniv Cohen DevopShift
KEY
Puppet for dummies - ZendCon 2011 Edition
PPTX
Herd your chickens: Ansible for DB2 configuration management
PDF
Automated Deployment and Configuration Engines. Ansible
PDF
Puppet and Telefonica R&D
PDF
Using Puppet - Real World Configuration Management
PPTX
Vagrant, Ansible, and OpenStack on your laptop
PDF
PuppetCamp SEA 1 - The State of Puppet
PDF
MySQL DevOps at Outbrain
PDF
TXLF: Chef- Software Defined Infrastructure Today & Tomorrow
PPTX
Ansible: What, Why & How
PPT
SaltConf14 - Oz Akan, Rackspace - Deploying OpenStack Marconi with SaltStack
PDF
Understanding salt modular sub-systems and customization
PDF
Puppet for SysAdmins
PDF
Zabbix Performance Tuning
PuppetCamp Sydney 2012 - Building a Multimaster Environment
Introduction to Ansible
Ansible MySQL MHA
Beyond Puppet
Salt conf 2014 - Using SaltStack in high availability environments
Devops with Python by Yaniv Cohen DevopShift
Puppet for dummies - ZendCon 2011 Edition
Herd your chickens: Ansible for DB2 configuration management
Automated Deployment and Configuration Engines. Ansible
Puppet and Telefonica R&D
Using Puppet - Real World Configuration Management
Vagrant, Ansible, and OpenStack on your laptop
PuppetCamp SEA 1 - The State of Puppet
MySQL DevOps at Outbrain
TXLF: Chef- Software Defined Infrastructure Today & Tomorrow
Ansible: What, Why & How
SaltConf14 - Oz Akan, Rackspace - Deploying OpenStack Marconi with SaltStack
Understanding salt modular sub-systems and customization
Puppet for SysAdmins
Zabbix Performance Tuning
Ad

Viewers also liked (20)

PDF
JDD2015: Sustainability Supporting Data Variability: Keeping Core Components ...
PPTX
4Developers 2015: Programowanie synchroniczne i asynchroniczne - dwa światy k...
PDF
JDD2015: What is code? - Jakub Marchwicki
PDF
4Developers 2015: Customer Journey Based UX Design - Łukasz Szadkowski
PDF
4Developers 2015: Dobrze posól swoje hasło: skróty haseł w webie - Leszek Kru...
PDF
JDD2015: Ratpack: core of your micro-services - Andrey Adamovich
PDF
JDD2015: Java Everywhere Again—with DukeScript - Jaroslav Tulach
ODP
4Developers 2015: Skalowanie i integracja systemów w asynchronicznym stylu - ...
PDF
4Developers 2015: Jak (w końcu) zacząć pracować z DDD wykorzystując BDD - Kac...
PDF
Sprytniejsze testowanie kodu java ze spock framework (zaawansowane techniki) ...
PDF
JDD2015: Make your world event driven - Krzysztof Dębski
PPTX
JDD2015: Piękny Pan od HR radzi, czyli 1011 błędów, które popełniają programi...
POTX
JDD2015: Twenty-one years of "Design Patterns" - Ralph Johnson
PPTX
4Developers 2015: .NET 2015 - co nowego? - Michał Dudak, Future Processing
PDF
PLNOG15: BGP Route Reflector from practical point of view
PDF
DevOpsDays Warsaw 2015: From core Java to Devops team – Krzysztof Debski
PDF
DevOpsDays Warsaw 2015: JaaC - Jenkins as a Code – Łukasz Szczęsny
PDF
nakabayasi m
PPTX
Introduction to HTML
PDF
4Developers 2015: Continuous Security in DevOps - Maciej Lasyk
JDD2015: Sustainability Supporting Data Variability: Keeping Core Components ...
4Developers 2015: Programowanie synchroniczne i asynchroniczne - dwa światy k...
JDD2015: What is code? - Jakub Marchwicki
4Developers 2015: Customer Journey Based UX Design - Łukasz Szadkowski
4Developers 2015: Dobrze posól swoje hasło: skróty haseł w webie - Leszek Kru...
JDD2015: Ratpack: core of your micro-services - Andrey Adamovich
JDD2015: Java Everywhere Again—with DukeScript - Jaroslav Tulach
4Developers 2015: Skalowanie i integracja systemów w asynchronicznym stylu - ...
4Developers 2015: Jak (w końcu) zacząć pracować z DDD wykorzystując BDD - Kac...
Sprytniejsze testowanie kodu java ze spock framework (zaawansowane techniki) ...
JDD2015: Make your world event driven - Krzysztof Dębski
JDD2015: Piękny Pan od HR radzi, czyli 1011 błędów, które popełniają programi...
JDD2015: Twenty-one years of "Design Patterns" - Ralph Johnson
4Developers 2015: .NET 2015 - co nowego? - Michał Dudak, Future Processing
PLNOG15: BGP Route Reflector from practical point of view
DevOpsDays Warsaw 2015: From core Java to Devops team – Krzysztof Debski
DevOpsDays Warsaw 2015: JaaC - Jenkins as a Code – Łukasz Szczęsny
nakabayasi m
Introduction to HTML
4Developers 2015: Continuous Security in DevOps - Maciej Lasyk
Ad

Similar to PLNOG14: Automation at Brainly - Paweł Rozlach (20)

PDF
To AWS with Ansible
PDF
ansible_rhel.pdf
PPTX
Ansible presentation
PDF
Getting Started with Ansible - Jake.pdf
PDF
Ansible Automation to Rule Them All
PPTX
Best practices for ansible
PDF
Pilot Tech Talk #10 — Practical automation by Kamil Cholewiński
PDF
Automation and Ansible
PDF
"Using Automation Tools To Deploy And Operate Applications In Real World Scen...
PDF
"Using Automation Tools To Deploy And Operate Applications In Real World Scen...
PDF
Ansible, best practices
PDF
Ansible is the simplest way to automate. SymfonyCafe, 2015
PDF
DevOps Meetup ansible
PDF
Ansible & Salt - Vincent Boon
PDF
Ansible new paradigms for orchestration
PDF
DevOps for Humans - Ansible for Drupal Deployment Victory!
PDF
A quick intro to Ansible
PPTX
How to deploy spark instance using ansible 2.0 in fiware lab v2
PPTX
How to Deploy Spark Instance Using Ansible 2.0 in FIWARE Lab
PPTX
Mastering_Ansible_PAnsible_Presentation our score increases as you pick a
To AWS with Ansible
ansible_rhel.pdf
Ansible presentation
Getting Started with Ansible - Jake.pdf
Ansible Automation to Rule Them All
Best practices for ansible
Pilot Tech Talk #10 — Practical automation by Kamil Cholewiński
Automation and Ansible
"Using Automation Tools To Deploy And Operate Applications In Real World Scen...
"Using Automation Tools To Deploy And Operate Applications In Real World Scen...
Ansible, best practices
Ansible is the simplest way to automate. SymfonyCafe, 2015
DevOps Meetup ansible
Ansible & Salt - Vincent Boon
Ansible new paradigms for orchestration
DevOps for Humans - Ansible for Drupal Deployment Victory!
A quick intro to Ansible
How to deploy spark instance using ansible 2.0 in fiware lab v2
How to Deploy Spark Instance Using Ansible 2.0 in FIWARE Lab
Mastering_Ansible_PAnsible_Presentation our score increases as you pick a

Recently uploaded (20)

PDF
The Ikigai Template _ Recalibrate How You Spend Your Time.pdf
PPT
250152213-Excitation-SystemWERRT (1).ppt
PDF
Exploring VPS Hosting Trends for SMBs in 2025
PDF
Best Practices for Testing and Debugging Shopify Third-Party API Integrations...
PPTX
t_and_OpenAI_Combined_two_pressentations
PPTX
Mathew Digital SEO Checklist Guidlines 2025
PPT
Ethics in Information System - Management Information System
PDF
Introduction to the IoT system, how the IoT system works
PPT
415456121-Jiwratrwecdtwfdsfwgdwedvwe dbwsdjsadca-EVN.ppt
PPTX
Internet___Basics___Styled_ presentation
PDF
mera desh ae watn.(a source of motivation and patriotism to the youth of the ...
PDF
Unit-1 introduction to cyber security discuss about how to secure a system
PPTX
June-4-Sermon-Powerpoint.pptx USE THIS FOR YOUR MOTIVATION
PDF
The New Creative Director: How AI Tools for Social Media Content Creation Are...
PPTX
Introduction to cybersecurity and digital nettiquette
PPTX
Funds Management Learning Material for Beg
PPTX
presentation_pfe-universite-molay-seltan.pptx
PPT
Design_with_Watersergyerge45hrbgre4top (1).ppt
PPTX
E -tech empowerment technologies PowerPoint
DOCX
Unit-3 cyber security network security of internet system
The Ikigai Template _ Recalibrate How You Spend Your Time.pdf
250152213-Excitation-SystemWERRT (1).ppt
Exploring VPS Hosting Trends for SMBs in 2025
Best Practices for Testing and Debugging Shopify Third-Party API Integrations...
t_and_OpenAI_Combined_two_pressentations
Mathew Digital SEO Checklist Guidlines 2025
Ethics in Information System - Management Information System
Introduction to the IoT system, how the IoT system works
415456121-Jiwratrwecdtwfdsfwgdwedvwe dbwsdjsadca-EVN.ppt
Internet___Basics___Styled_ presentation
mera desh ae watn.(a source of motivation and patriotism to the youth of the ...
Unit-1 introduction to cyber security discuss about how to secure a system
June-4-Sermon-Powerpoint.pptx USE THIS FOR YOUR MOTIVATION
The New Creative Director: How AI Tools for Social Media Content Creation Are...
Introduction to cybersecurity and digital nettiquette
Funds Management Learning Material for Beg
presentation_pfe-universite-molay-seltan.pptx
Design_with_Watersergyerge45hrbgre4top (1).ppt
E -tech empowerment technologies PowerPoint
Unit-3 cyber security network security of internet system

PLNOG14: Automation at Brainly - Paweł Rozlach

  • 1. Automation at Brainly … or how to enter the world of automation in a “different way”.
  • 2. OPS stack: ● ~80 servers, heavy usage of LXC containers (~1000) ● 99.9% Debian, 1 Ubuntu host :) ● Nginx / Apache2, 2k reqs per sec ● 200 million page views monthly ● 700Mbps peak traffic ● Python is dominant About Brainly World’s largest homework help social network, connecting over 40 million users monthly DEV stack: ● PHP - Symfony 2 - SOA projects - 200 reqs per sec on russian version ● Erlang - 55k concurrent users - 22k events per sec ● Native Apps - iOS - Android
  • 3. ● Puppet was not feasible for us - *lots* of dependencies which make containers bigger/heavier - problems with Puppet's declarative language - seemed incoherent, lacking integration of orchestration - steep learning curve - YMMV ● "packaging as automation" as an intermediate solution - dependency hell, installing one package could result in uninstalling others - inflexible, lots of code duplication in debian/rules file - LOTS of custom bash and PHP scripts, usually very hard to reuse and not standardized - this was a dead end :( ● Ansible - initially used only for orchestration - maintaining it required keeping up2date inventory, which later simplified and helped with lots of things Starting point
  • 4. ● we decided to move forward with Ansible and use it for setting up machines as well ● first project was nagios monitoring plugins setup ● turned out to be ideal for containers and our needs in general - very little dependencies to begin with (python2, python-apt), and small footprint - "configured" Python modules are transferred directly to machine, no need for local repositories - very light, no compilation on the destination host is needed - easy to understand. Tasks/playbooks map directly to actions an ops/devops would have done if he was doing it by hand - compatible with "automation by packages". We were able to migrate from the old system in small steps. First steps with Ansible
  • 5. ● all policies, rules, and good practices written down in automation's repo main directory ● helps with introducing new people into the team or with devops approach - newbies are able to start committing to repo quickly - what's in GUIDELINES.md, that's law and changing it requires wider consensus - gives examples on how to deal with certain problems in standardized way ● few examples: - limit the number of tags, each of them should be self-contained with no cross-dependencies. - do not include roles/tasks inside other roles, this creates hard to follow dependencies - NEVER subset the list of hosts inside the role, do it in site.yml. Otherwise debugging roles/hosts will become difficult - think twice before adding new role and esp. groups. As infrastructure grows, it becomes hard to manage and/or creates "dead” code/roles Avoiding regressions
  • 6. ● one of the policies introduced was storing one-off scripts in a separate directory in our automation repo. ● most of them are Ansible playbooks used just for one particular task (i.e. Squeeze->Wheezy migration) ● version-control everything! ● turned out to be very useful, some of them turned out to be useful enough to be rewritten to proper role or a tool Ugly-hacks reusability
  • 8. ● available on GitHub and Ansible Galaxy: https://galaxy.ansible.com/list#/roles/940 https://galaxy.ansible.com/list#/roles/941 ● “base” role: - is reused across 8 different production roles we have ATM - contains basic monitoring, log rotation, packages installation, etc… - includes PHP setup in modphp/prefork configuration - PHP disabled functions control - basic security setup - does not include any site-specific stuff ● "site” role: - contains all site specific stuff and dependencies (vhosts, additional packages, etc...) - usually very simple - more than one site role possible, only one base role though ● It is an example of how we make our roles reusable Apache2 automation
  • 9. ● automatically setups monitoring basing on inventory and host groups ● implements devops approach - if dev has root on machine, he also has access to all monitoring stuff related to this system ● automatic host dependencies basing on host groups ● provisioning new hosts is no longer so painful ("auto-discovery") ● all services configuration is stored as YAML files, and used in templates ● role uses DNS data directly from inventory in order to make monitoring independent of DNS failures Icinga
  • 10. DNS migration ● at the beginning: - dozens of authoritative name servers, each of them having customized configuration, running ~100 zones, all created by hand - the main reason for that was using DNS for switching between primary/secondary servers/services ● three phases: - slurping configuration into Ansible - normalizing the configuration - improving the setup ● Python script which uses Ansible API to fetch normalized zone configuration from each server - results available in a neat hash, with per-host, per-zone keys! - normalization using named-checkconf tool ● use slurped configuration to re-generate all configs, this time using only the data available to Ansible's ● "push-button" migration, after all recipes were ready :)
  • 11. ● secure: all zone transfers are signed with individual keys, ACLs are tight ● playbooks use dns data directly from inventory ● changing/migrating slaves/masters is easy, NS records are auto-generated ● updates to zones automatically bump serial, while still preserving the YYYYMMDDxx format ● CRM records are auto-generated as well * see next slide about CRM automation ● dns entries are always up2date thanks to some custom action modules - ansible_ssh_host variables are harvested and processed into zones - only custom entries and zone primary/secondary server names are now stored in YAML - new hosts are automatically added to zones, decommissioned ones - removed - auto-generation of reverse zones DNS automation
  • 12. ● we have ~130 CRM clusters ● setting them up by hand would be "difficult" at best, impossible at worst ● available on Ansible Galaxy: - https://galaxy.ansible.com/list#/roles/956 - https://galaxy.ansible.com/list#/roles/979 ● follows pattern from apache2_base - “base” role suitable for manually set up clusters - "cluster” role provides service upon base, with few reusable snippets and a possibility for more complex configurations ● automatic membership based on ansible inventory (no multicasts!) ● the most difficult part was providing synchronous handlers ● few simple configurations are provided, like single service-single vip Corosync & Pacemaker
  • 13. ● initially we did not have time nor resources to set up full fledged LDAP ● we needed: - user should be able to log in even during a network outage - removal/adding users, ssh-keys, custom settings, etc.. all had to be supported - it had to be reusable/accessible in other roles (i.e. Icinga/monitoring) - different privileges for dev,production and other environments - UID/GID unification ● turned out to be simpler than we thought - users are managed using few simple tasks and group_vars data. Rest is handled via variables precedence. ● migration/standardization required some effort though User management automation
  • 14. ● standard ansible inventory management becomes a bit cumbersome with 100’s of hosts: - each host has to have ansible_ssh_host defined - adding/removing large number of hosts/groups required editing lots of files and/or one-off scripts - ip address management using google docs does not scale ;) ● Ansible has well defined dynamic inventory API, with scripts available for AWS, Cobbler, Rackspace, Docker, and many others. ● we wrote our own, which is based on YAML file, version controlled by git: - python API allowing to manipulate the inventory easily - logic and syntax checking of the inventory ● available as opensource: https://github.com/brainly/inventory_tool Inventory management
  • 15. ● we are leasing our servers from Hetzner, no direct Layer 2 connectivity ● all tunnel setups are done using Ansible, new server is automatically added to our network ● firewalls are set up by Ansible as well: - OPS contribute the base firewall, DEVs can open the ports of interest for their application - ferm at it's base, for easy rule making and keeping in-kernel firewall in sync with on-disk rules - rules are auto-generated basing on inventory, adding/removing hosts is automatically reconfigures FW Networking
  • 16. ● based on Bareos, opensource Bacula fork ● new hosts are automatically set up for backup, extending storage space is no longer a problem ● authentication using certificates, PITA without ansible Backups
  • 17. ● deployment done by Python script calling Ansible API ● simple tasks implemented using ansible playbooks ● complex logic implemented in Python Deployments
  • 18. ● Jinja2 template error messages are "difficult" to interpret ● templates sometimes grow to huge complexity ● Jinja2 is designed for speed, but with tradeoffs - some Python operators are missing and creating custom plugins/filters poses some problems ● multi-inheritance, problems with 2-headed trees ● speed, improved with "pipelining=True", containerization on the long run ● some useful functionality requires paid subscription (Ansible Tower) - RESTfull API, useful if you want to push new application version to productions via i.e. Jenkins - schedules - currently we need to push the changes ourselves Not everything is perfect
  • 19. ● developers by default have RO access to repo, RW on case-by-case basis ● changes to systems owned by developers are done by developers, OPS only provide the platform and tools ● all non-trivial changes require a Pull Request and a review from Ops ● encrypt mission critical data with Ansible Vault and push it directly to the repo - *strong* encryption - available to Ansible without the need for decryption (password still required though) - all security sensitive stuff can be skipped by developers with "--skip-tags" option to ansible-playbooks Dev,DevOps,Ops
  • 21. ● some of the things we mentioned can be find on our Github account ● we are working on opensourcing more stuff https://github.com/brainly Opensource! Opensource! Opensource!
  • 22. ● time needed to deploy new markets dropped considerably ● increased productivity ● better cooperation with developers ● more workpower, Devs are no longer blocked so much, we can push tasks to them ● infrastructure as a code ● versioning ● code-reuse, less copy-pasting Conclusions