SlideShare a Scribd company logo
Greenfielding Network and Systems
Automation in a Large and Highly Dynamic
Public Transit Network
Logan Best
DevOps Engineer
Transit Wireless
Share your automation story
1. How did you get started with Ansible?
2. How long have you been using it?
3. What's your favorite thing to do when you Ansible?
-vvv
Disclaimer:
This talk will be intentionally vague in
some cases due to NDA and proprietary
IP that I cannot divulge.
Any opinions expressed are of my own
and not my employers.
AnsibleFest 2019 - Greenfielding Network and Systems Automation in a Large and Highly Dynamic Public Transit Network
AnsibleFest 2019 - Greenfielding Network and Systems Automation in a Large and Highly Dynamic Public Transit Network
Core Network
Cisco
● IOS
● IOS-XE
● IOS-XR
● NX-OS
● ASA
Extreme
● NX9500
● NX9600
● VX9000
Nokia
● ALu
● ALE
Westell
Mikrotik
Digi LTE
Debian
Ubuntu
CentOS
Proxmox
Oxidized
Zabbix
The list goes on….
What does this all come down to?
● We have a massive footprint of vendors, versions, and platforms to
cover
● Almost 20,000 devices just in NYC
● Network_cli just isn’t enough sometimes
● Yes, that means some things rely on telnet >.<
● LOADS of underlying groundwork required
So how do you even begin?
● Talk to your peers about existing pain points
● Where’s the low hanging fruit you can get easy wins with?
● How’s the existing infrastructure setup? What’s missing?
What are the current projects?
● Find out what your team or related teams are working on
● How can those tasks be automated?
What was missing?
● Source of Truth
● Secrets Management
● CMDB
● Central Authentication
● Self Service
● DEVELOPERS DEVELOPERS DEVELOPERS
Whew…
So how do we even get started?
● Crawl, Walk, Run principle
● K.I.S.S
● Have a BIG emphasis on team training and buy in
● Network Audit
● Get corporate buy in on conferences, trainings, and certifications
● Use the small initial wins as leverage
Crawl
● Utilize Network Audit to gather facts about the network
● Team Education
● Monitoring
● Automation used as needed with validated and reviewed additive only
changes
● Start introducing input validation to reduce change risk
Walk
● Introduce Netbox as Source of Truth
● Build your Inventory strategy
● Setup DNS and LDAP/Radius AAA
● Start simple small when making changes to the network
● Severely limit your initial footprint to reduce risk to prod
● LAB EVERYTHING!!!
Run
● Netbox implementation complete
● Monitoring adds new automation and device specific metrics
● Implement rollback, integrates with Oxydized to backup on each run and
restore if needed
● Automated ZTP with Ansible instead of console provisioning
● Introduce Jira and proper change/project management culture
● Auto documenting Jira issues with Ansible!
● Getting closer to no manual changes as playbooks evolve and become
more robust
How are we doing all of this?
● Python
● Ansible
● AWX
● Netbox
● Stackstorm
● Zabbix
● Jira
● Slack
● Viewflow.io
AnsibleFest 2019 - Greenfielding Network and Systems Automation in a Large and Highly Dynamic Public Transit Network
So where’s the “highly dynamic” part?
Wifi onboard the trains!
A C C A
Some train operators don’t keep cars
together
How can we keep our sanity?
● Rigorous testing
● Get so good at that you can write a whitepaper on it
● Innovate using existing protocols
● Have a backup strategy for when it all fails to provision
In the end...
● Don’t be afraid to start slow
● Don’t be afraid to start small
● Have a well thought out vision
● Advocate for education for yourself and your peers
● You will eventually break something.
AnsibleFest 2019 - Greenfielding Network and Systems Automation in a Large and Highly Dynamic Public Transit Network

More Related Content

ODP
Icinga Camp Belgrade - ITAF Monitoring best practices & demo
PPTX
Daily AWS Issues
PDF
Man in the Binder - Michael Shalyt & Idan Revivo, CheckPoint
PDF
OpenNebulaConf2018 - Welcome and Project Update - Ignacio M. Llorente, Rubén ...
PDF
ElasticMQ : Server for Local SQS
PDF
OpenNebulaConf2018 - Our Journey to OpenNebula - Germán Gutierrez - Booking.com
PPTX
Operationnal challenges behind Serverless architectures by Laurent Bernaille
PDF
IcingaCamp Stockholm - How to make your monitoring shut up
Icinga Camp Belgrade - ITAF Monitoring best practices & demo
Daily AWS Issues
Man in the Binder - Michael Shalyt & Idan Revivo, CheckPoint
OpenNebulaConf2018 - Welcome and Project Update - Ignacio M. Llorente, Rubén ...
ElasticMQ : Server for Local SQS
OpenNebulaConf2018 - Our Journey to OpenNebula - Germán Gutierrez - Booking.com
Operationnal challenges behind Serverless architectures by Laurent Bernaille
IcingaCamp Stockholm - How to make your monitoring shut up

What's hot (20)

PPTX
Keynote TIAD Camp Serverless
PPTX
Detailed Introduction To Docker
PDF
Operations Delivery Business Value
PPTX
Quick introduction to nodeJs
PPTX
Signal r core workshop - netconf
PDF
Introduction to Lagom Framework
PPTX
Icinga Camp Bangalore - Icinga integrations
PPTX
Cloud Native Apps ... from a user point of view
PPTX
Immutable infrastructure
PDF
Integracia security do ci cd pipelines
PPTX
Ops, DevOps, NoOps and AWS Lambda
PPTX
Get acquainted with the new ASP.Net 5
PDF
OpenStack Ansible for private cloud at Kaidee
PDF
OSMC 2017 | Troubleshooting-icinga 2 by Thomas Widhalm
PDF
Scaling Humans - BigPanda's Fabulous ChatOps Adventure - Erik Zaadi, BigPanda...
PDF
OSMC 2013 | Zabbix: A Practical Demo by Rihards Olups
PDF
Industrialise your deployment: Infrastructure as Code on OVHcloud Public Cloud
PDF
Intro to ES6 / ES2015
PDF
Keynote TIAD Camp Serverless
Detailed Introduction To Docker
Operations Delivery Business Value
Quick introduction to nodeJs
Signal r core workshop - netconf
Introduction to Lagom Framework
Icinga Camp Bangalore - Icinga integrations
Cloud Native Apps ... from a user point of view
Immutable infrastructure
Integracia security do ci cd pipelines
Ops, DevOps, NoOps and AWS Lambda
Get acquainted with the new ASP.Net 5
OpenStack Ansible for private cloud at Kaidee
OSMC 2017 | Troubleshooting-icinga 2 by Thomas Widhalm
Scaling Humans - BigPanda's Fabulous ChatOps Adventure - Erik Zaadi, BigPanda...
OSMC 2013 | Zabbix: A Practical Demo by Rihards Olups
Industrialise your deployment: Infrastructure as Code on OVHcloud Public Cloud
Intro to ES6 / ES2015
Ad

Similar to AnsibleFest 2019 - Greenfielding Network and Systems Automation in a Large and Highly Dynamic Public Transit Network (20)

PDF
Building a Small DC
PDF
Building a Small Datacenter
PDF
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
PDF
PDF
Netty training
PDF
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
PPTX
Solving IoT Hardware Issues With Docker
PDF
HKNOG 6.0 Next Generation Networks - will automation put us out of jobs?
PDF
Triangle Devops Meetup 10/2015
PDF
AWS re:Invent 2016 Fast Forward
PDF
Netflix Open Source: Building a Distributed and Automated Open Source Program
PDF
Building a Distributed & Automated Open Source Program at Netflix
PDF
JUST EAT: Embracing DevOps
PDF
Skillenza Build with Serverless Challenge - Advanced Serverless Concepts
PDF
Devops with Python by Yaniv Cohen DevopShift
PDF
How we leveraged Drupal to build a leading SaaS product
PPTX
Serverless java
PDF
Andy Davidson Automation Presentation from UKNOF 31
PDF
Total cloud immersion
ODP
VSCP & Friends Presentation Eindhoven
Building a Small DC
Building a Small Datacenter
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
Netty training
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
Solving IoT Hardware Issues With Docker
HKNOG 6.0 Next Generation Networks - will automation put us out of jobs?
Triangle Devops Meetup 10/2015
AWS re:Invent 2016 Fast Forward
Netflix Open Source: Building a Distributed and Automated Open Source Program
Building a Distributed & Automated Open Source Program at Netflix
JUST EAT: Embracing DevOps
Skillenza Build with Serverless Challenge - Advanced Serverless Concepts
Devops with Python by Yaniv Cohen DevopShift
How we leveraged Drupal to build a leading SaaS product
Serverless java
Andy Davidson Automation Presentation from UKNOF 31
Total cloud immersion
VSCP & Friends Presentation Eindhoven
Ad

Recently uploaded (20)

PPTX
Primary and secondary sources, and history
PDF
oil_refinery_presentation_v1 sllfmfls.pdf
PPTX
Tour Presentation Educational Activity.pptx
PDF
Instagram's Product Secrets Unveiled with this PPT
PPTX
water for all cao bang - a charity project
PPTX
BIOLOGY TISSUE PPT CLASS 9 PROJECT PUBLIC
PPTX
2025-08-10 Joseph 02 (shared slides).pptx
PPTX
Introduction to Effective Communication.pptx
PPT
First Aid Training Presentation Slides.ppt
PPTX
worship songs, in any order, compilation
DOCX
"Project Management: Ultimate Guide to Tools, Techniques, and Strategies (2025)"
PPTX
The Effect of Human Resource Management Practice on Organizational Performanc...
PPTX
Anesthesia and it's stage with mnemonic and images
PDF
Tunisia's Founding Father(s) Pitch-Deck 2022.pdf
PDF
Nykaa-Strategy-Case-Fixing-Retention-UX-and-D2C-Engagement (1).pdf
PDF
Parts of Speech Prepositions Presentation in Colorful Cute Style_20250724_230...
PPTX
AcademyNaturalLanguageProcessing-EN-ILT-M02-Introduction.pptx
PPTX
chapter8-180915055454bycuufucdghrwtrt.pptx
PPTX
lesson6-211001025531lesson plan ppt.pptx
PPTX
Role and Responsibilities of Bangladesh Coast Guard Base, Mongla Challenges
Primary and secondary sources, and history
oil_refinery_presentation_v1 sllfmfls.pdf
Tour Presentation Educational Activity.pptx
Instagram's Product Secrets Unveiled with this PPT
water for all cao bang - a charity project
BIOLOGY TISSUE PPT CLASS 9 PROJECT PUBLIC
2025-08-10 Joseph 02 (shared slides).pptx
Introduction to Effective Communication.pptx
First Aid Training Presentation Slides.ppt
worship songs, in any order, compilation
"Project Management: Ultimate Guide to Tools, Techniques, and Strategies (2025)"
The Effect of Human Resource Management Practice on Organizational Performanc...
Anesthesia and it's stage with mnemonic and images
Tunisia's Founding Father(s) Pitch-Deck 2022.pdf
Nykaa-Strategy-Case-Fixing-Retention-UX-and-D2C-Engagement (1).pdf
Parts of Speech Prepositions Presentation in Colorful Cute Style_20250724_230...
AcademyNaturalLanguageProcessing-EN-ILT-M02-Introduction.pptx
chapter8-180915055454bycuufucdghrwtrt.pptx
lesson6-211001025531lesson plan ppt.pptx
Role and Responsibilities of Bangladesh Coast Guard Base, Mongla Challenges

AnsibleFest 2019 - Greenfielding Network and Systems Automation in a Large and Highly Dynamic Public Transit Network

  • 1. Greenfielding Network and Systems Automation in a Large and Highly Dynamic Public Transit Network Logan Best DevOps Engineer Transit Wireless
  • 2. Share your automation story 1. How did you get started with Ansible? 2. How long have you been using it? 3. What's your favorite thing to do when you Ansible?
  • 4. Disclaimer: This talk will be intentionally vague in some cases due to NDA and proprietary IP that I cannot divulge. Any opinions expressed are of my own and not my employers.
  • 8. Cisco ● IOS ● IOS-XE ● IOS-XR ● NX-OS ● ASA Extreme ● NX9500 ● NX9600 ● VX9000 Nokia ● ALu ● ALE Westell Mikrotik Digi LTE Debian Ubuntu CentOS Proxmox Oxidized Zabbix The list goes on….
  • 9. What does this all come down to? ● We have a massive footprint of vendors, versions, and platforms to cover ● Almost 20,000 devices just in NYC ● Network_cli just isn’t enough sometimes ● Yes, that means some things rely on telnet >.< ● LOADS of underlying groundwork required
  • 10. So how do you even begin? ● Talk to your peers about existing pain points ● Where’s the low hanging fruit you can get easy wins with? ● How’s the existing infrastructure setup? What’s missing? What are the current projects? ● Find out what your team or related teams are working on ● How can those tasks be automated?
  • 11. What was missing? ● Source of Truth ● Secrets Management ● CMDB ● Central Authentication ● Self Service ● DEVELOPERS DEVELOPERS DEVELOPERS
  • 12. Whew… So how do we even get started? ● Crawl, Walk, Run principle ● K.I.S.S ● Have a BIG emphasis on team training and buy in ● Network Audit ● Get corporate buy in on conferences, trainings, and certifications ● Use the small initial wins as leverage
  • 13. Crawl ● Utilize Network Audit to gather facts about the network ● Team Education ● Monitoring ● Automation used as needed with validated and reviewed additive only changes ● Start introducing input validation to reduce change risk
  • 14. Walk ● Introduce Netbox as Source of Truth ● Build your Inventory strategy ● Setup DNS and LDAP/Radius AAA ● Start simple small when making changes to the network ● Severely limit your initial footprint to reduce risk to prod ● LAB EVERYTHING!!!
  • 15. Run ● Netbox implementation complete ● Monitoring adds new automation and device specific metrics ● Implement rollback, integrates with Oxydized to backup on each run and restore if needed ● Automated ZTP with Ansible instead of console provisioning ● Introduce Jira and proper change/project management culture ● Auto documenting Jira issues with Ansible! ● Getting closer to no manual changes as playbooks evolve and become more robust
  • 16. How are we doing all of this? ● Python ● Ansible ● AWX ● Netbox ● Stackstorm ● Zabbix ● Jira ● Slack ● Viewflow.io
  • 18. So where’s the “highly dynamic” part? Wifi onboard the trains!
  • 19. A C C A Some train operators don’t keep cars together
  • 20. How can we keep our sanity? ● Rigorous testing ● Get so good at that you can write a whitepaper on it ● Innovate using existing protocols ● Have a backup strategy for when it all fails to provision
  • 21. In the end... ● Don’t be afraid to start slow ● Don’t be afraid to start small ● Have a well thought out vision ● Advocate for education for yourself and your peers ● You will eventually break something.

Editor's Notes

  • #3: - Programming for 21 years - Worked in the DC industry for 8 years (this week actually) - started in systems, graduated to network, and now I’m officially “DevOps” - using ansible for 5 years - fell into it as part of bootstrapping servers at my previous employer (public cloud, IaaS, private cloud, DR, Backups) - Ended up using it to auto build BGP peering configs, which then turned into config gen of BGP route-policies and customer provisioning (migrating perl to Ansible ugh)
  • #6: Intro to TW - How many of you have been to NYC? - How many of you knew we have cellular/wifi in the stations? - How many of you have actually used it? Thoughts? I’m taking your feedback to my superiors ;) - 472 stations are covered (409 underground with cellular/wifi, rest are aboveground with just emergency services)
  • #9: Who here has spotted red flags in this list so far?
  • #10: I want people to start raising your hands when you start seeing red flags. Massive footprint of vendors, firmware versions, and platforms to cover NYC itself has almost 20,000 devices And if that wasn’t enough, Network_cli just isn’t enough sometimes. That doesn’t mean netconf or httpapi is used instead… Yes, that means some things rely on telnet. And the telnet module doesn’t always work that well. So that means expect scripts being executed by Ansible Those of you who haven’t raised their hands yet, try working with telnet on unsupported devices with Ansible and then let me know how you feel with your life. Now this also means there’s LOADS of underlying groundwork that is required to cover our network
  • #11: So how do you even begin automating a massive network, largely using unsupported telnet with Ansible and expect scripts? Describe our low hanging fruit: Systems Proxmox/Netbox integration CIMC LDAP DNS DHCP Reporting
  • #12: Source of truth, (not zabbix or any monitoring tool, and holy hell not excel sheets) Secrets Management CMDB Central Authentication. At the time of hiring everything was local authentication…. Everything. Self Service portal for less CLI proficient NetOps/NOC teams DEVELOPERS I knew coming in, that I would be the only one with programming experience and would be responsible for training the team or acquiring training for the team. And that’s ok!
  • #13: - One teammate had already started dabbling with Ansible. - Luckily we had a full top to bottom network audit going on when I was hired, so it was fairly easy for me to latch onto that and start pulling reports in support of the audit. Stuff like “how many interfaces do we have in this datacenter that are used and unused”, or “What vlans are present in every station’s switches, and further more, what interfaces had what vlans on them and were they Admin Down/Up?” - Audit was completed over a month ahead of expectations. - This seems sort of like a no-brainer, but every company is different with a different budget. Coming in hot saying you need 4 people to immediately start going to several new cons that nobody there has ever attended certainly won’t go over well.
  • #14: So let’s break this down into the Crawl Walk Run principle My number one goal in our automation journey is Team Education. Lay the groundwork to either build a robust monitoring solution or enhance the one you have. For us, we have one of the largest known Zabbix databases, currently at 13TB, that Zabbix is aware of. In fact, our monitoring is so important to our automation strategy that I tied it into our Education training and myself and 3 other teammates spent all last week in Zabbix training. Two of us, myself included, were there for the certifications as well and took and passed the Zabbix Certified Professional cert on Friday Luckily, our monitoring was already robust enough for us to get started. However, there’s always room for improvement and we’re on that road in parallel.
  • #15: Introduce Netbox as the new SoT. This entails sanitizing and importing massive excel sheets, physically verifying information, physically mapping our rack elevations, TAGGING Someone in Network To Code’s slack mentioned a tagging strategy called Namespaced tagging, which I immediately loved the idea of and have started working on fitting that into our Netbox tagging Using facts from the network to populate config contexts in Netbox Setup internal DNS infrastructure to get away from IPs everywhere and to simplify our Ansible Inventory Introduce FreeIPA as new central auth, migrate local users to LDAP/Radius users, and setup service accounts for automation and other items that needs access (such as oxydized) Start small with jinja tempating and config modules to deploy these new AAA and DNS changes. Start with the Core IP network only. Don’t even think about touching stations yet LAB EVERYTHING! Luckily we have massive GNS3 hosts that our engineering team uses religiously to model changes to the network before applying them. This has drastically reduced error rates compared to previous years of not having that. Now the automation tests you run (you better be) can be tested against nearly identical GNS3 VMs.
  • #16: Netbox implementation is complete and now you have an amazing dynamic inventory! Evolve your monitoring metrics to try to catch issues with automation changes before they cause a larger issue Implement rollback options! This means we have to integrate with Oxydized (which is our network backup solution of choice) to make a backup before the play runs. Drastically reduce field tech provisioning by implementing Zero Touch Provisioning in conjunction with Ansible. All they have to do is make sure the mgmt interface has DHCP running. We do the rest. If your org doesn’t already have good change mgmt and project mgmt culture, this is a must. We prefer Jira. You should’ve seen the drools from mgmt’s faces when I told them I was going to implement auto documenting change mgmt via ansible. NTC has a really great talk that Jason Edelman did on this very topic with ServiceNow. I’d highly recommend checking that out on youtube for an example of what this would look like. And of course, continue iterating your automation so that less and less manual changes are made to your devices
  • #17: You may be wondering why I haven’t mentioned Stackstorm at all yet at this point. Basically Stackstorm is a workflow mgmt platform and is really a central part of our entire automation pipeline. Most of our workflow traverses through Stackstorm in some capacity to route specific actions to different tools all at the same time. In some cases, Stackstorm is listening for changes and acting on those, such as new Jira tickets, Zabbix alerts, Slack commands, or even Viewflow.io API calls from the self service portal.
  • #20: So here’s a sample diagram of 4 train cars, not specifically NYC Subway related. When we’re considering putting Wifi onboard the trains, we have to assume the cars won’t stay together, but they have types of cars that form a pattern. A cars will always be at the front and back of the train and sometimes spreadout in the middle as well. C cars are everything in between. How in the world do you keep train networks consistent when you never know who your neighbor is going to be? Well I can’t say exactly how unfortunately but this is where ZTP plays a heavy role in our automation strategy, even more so than in the station level network. Essentially what could happen is a train will go into the yard at the end of it’s daily shift. That train could get dismantled and serviced in a variety of ways, again depending on the train operators involved. We have to assume that an A car and a C car will never see each other the next day when it’s placed into another train. By default our devices will reach out to ZTP on car power up to get it’s base golden config. Once it reboots into it’s new config, our onboard train automation will setup the dynamic peering with other cars (intentionally vague there) which ends up with a full train, completely networked and providing public wifi to it’s customers.