Copyright© 2017 GoDaddy Inc. All Rights Reserved.
Don’t Repeat Our Mistakes!
Lessons Learned from Running Go Daddy’s Private Cloud
Kris Lindgren
klindgren@godaddy.com
Mike Dorman
mike.dorman@sendgrid.com
OpenStack Queens Summit, November 2017, Sydney
Copyright© 2016 GoDaddy Inc. All Rights Reserved.
OpenStack at Go Daddy
● 2013: POC cloud (Havana)
● 2014: First production apps (Icehouse)
● 2014: Nova cells v1 (Kilo)
● 2015: “OpenStack everywhere” (Liberty)
● 2017: Working toward containerized services
Copyright© 2016 GoDaddy Inc. All Rights Reserved.
OpenStack at Go Daddy
● What we built:
○ Shared nothing regions
○ Ephemeral disk on local storage
○ Simple networking
○ No live migration
○ Multiple AZ’s
● Scale
○ 1000’s Computes, >100,000 Cores
○ 10,000’s VM’s
Copyright© 2017 GoDaddy Inc. All Rights Reserved.
Avoiding “Accidental Architecture”
Product Infrastructure & Scaling Management
Copyright© 2017 GoDaddy Inc. All Rights Reserved.
Private Cloud =
Free Compute
High Demand =
Overconsumption
Product - Need for Chargeback/Showback
Free Compute =
High Demand
Copyright© 2017 GoDaddy Inc. All Rights Reserved.
Product - Have a Cohesive Vision
• Which OpenStack Services/features
• User onboard/off-boarding
• Patching cadences/methodology
• Legacy integrations
• Adding capacity
• SLAs
• How do end users “consume” OpenStack?
• Procedure for changing the vision
• Helps with cloud paradigm shift
• Expect and tolerate failure
Copyright© 2017 GoDaddy Inc. All Rights Reserved.
Product Issues - How to Avoid
• Manage expectations (for yourself and for users)
• Showback and controls around quota
• Education and evangelism
• Docs and sample code
• “Cloud ready” early adopters
• Ongoing guidance
1.Cloud
2.??????
3.Profit!X
Copyright© 2017 GoDaddy Inc. All Rights Reserved.
Scaling - Nova Cells (v1)
Justification
• Assumed we would grow fast
• Challenges with scaling Nova/RMQ
• Easier earlier than later
• Ongoing debt to manage patches
• Cells v2 was coming soon
http://www.dorm.org/blog/converting-to-openstack-nova-cells-without-destroying-the-world/
Copyright© 2017 GoDaddy Inc. All Rights Reserved.
Scaling - Nova Cells (v1)
Retrospective
Good
• Helped us to scale
• Gained expertise with Nova
• Community street cred
Bad
• No scaling for Neutron
• Patches get more difficult
• Non-standard config
• Delays on v2
• Migration to v2 is unknown
20/20 Hindsight
• Scale/shard RMQ instead
• Aspirations about scale
• Porting patches is top blocker
Copyright© 2017 GoDaddy Inc. All Rights Reserved.
• Colocated API services and RMQ
• (Except Glance)
• Dedicated hardware overkill
• Local python packages
• Made sense for POC
• Nova separated later with Cells v1
Scaling - Collapsed Architecture
Justification
Copyright© 2017 GoDaddy Inc. All Rights Reserved.
Scaling - Collapsed Architecture
Retrospective
Good
• Simple architecture
• Minimal hardware
• Easy network ACLs
• Up and running fast
Bad
• Large failure impacts
• Resource contention
• Single API endpoints
20/20 Hindsight
• OK for POC
• Ignored it too long
• Easy to scale out
• (Implementing now)
Copyright© 2017 GoDaddy Inc. All Rights Reserved.
Infrastructure - Special Neutron Architecture
Justification
• Neutron L2 assumptions
• L3 folded clos network
• L2 stops at leafs
• Uncomfortable with overlays
• Provider network per rack
• Routed floating IPs
• Overload AZ to pick a network
• Local patches for network scheduling
Copyright© 2017 GoDaddy Inc. All Rights Reserved.
Infrastructure - Special Neutron Architecture
Retrospective
Good
• Same for VMs and metal
• Simple infrastructure
• Easy on users
• Network IP usages API
• Segmented networks spec
Bad
• Snowflake setup
• L2 adjacency expectations
• Added features difficult (LBaaS)
• Migration to Neutron segmented networks?
20/20 Hindsight
• Works pretty well
• Patches are limited
• IP usages API extension
• Segmented networks in Neutron
• Many others with same problem
Copyright© 2017 GoDaddy Inc. All Rights Reserved.
Management - Puppet Single Source of Truth
Justification
• Big Puppet shop
• Single source of config
• Good for server bootstrapping
• OpenStack-Puppet modules
• API providers
• Code pipeline already in place
• Ansible kicks off puppet apply
Copyright© 2017 GoDaddy Inc. All Rights Reserved.
Management - Puppet Single Source of Truth
Retrospective
Good
• Single source of config (in theory)
• Efficient bootstrapping
• NOOP mode for sanity
Bad
• State in Puppet, Hiera, APIs
• Some managed manually
• Duplicate API objects
• Omnibus deployments
• NOOP report not always accurate!
• Orphaned/forgotten servers
• Orchestration difficult
20/20 Hindsight
• Many unintended problems
• Not really a single source
• Need for targeted deployments
• Other tools for orchestration
• Use for bootstrapping
Copyright© 2017 GoDaddy Inc. All Rights Reserved.
Strategies for Avoiding Accidental Architecture
• Think of your future selves
•Quantify tech debt interest
• Almost nothing will be temporary
•Make a specific plan and timeline
• Carefully consider scale
•Overestimating can be as bad as
underestimating
• Automate first
•At least make it capable
Copyright© 2017 GoDaddy Inc. All Rights Reserved.
Strategies for Avoiding Accidental Architecture
• KISS!
http://stella.report
Copyright© 2017 GoDaddy Inc. All Rights Reserved.
Strategies for Avoiding Accidental Architecture
• Spread the knowledge wealth
http://stella.report
* The Coming Software Apocalypse: https://www.theatlantic.com/technology/archive/2017/09/saving-the-world-from-code/540393/
“The problem, [...] is that we are attempting to build systems that are
beyond our ability to intellectually manage.” *
Copyright© 2016 GoDaddy Inc. All Rights Reserved.
Recap: How to Live with No Regrets
Questions?
Other Ideas?
klindgren@godaddy.com
mike.dorman@sendgrid.com
● Manage expectations
● Education and evangelism
● Helpful early adopters
● Ongoing guidance
● Remember your future self
● Account and plan for tech debt
● Sane scale expectations
● Automate, automate, automate
● Simplicity
● Knowledge sharing

More Related Content

PPTX
RICON 2014 - Build a Cloud Day - Crash Course Open Source Cloud Computing
ODP
Devstack On Demand
PPTX
CloudStack EU user group - fast SAP provisioning
PPTX
Cloudstack container service
PPTX
Openstack portal-bestpractices-campbell mcneill
PDF
OpenStack in Action 4! Alan Clark - The fundation for openstack Cloud
PDF
NoSQL - Vital Open Source Ingredient for Modern Success
PPTX
CloudStack EU user group - Trillian
RICON 2014 - Build a Cloud Day - Crash Course Open Source Cloud Computing
Devstack On Demand
CloudStack EU user group - fast SAP provisioning
Cloudstack container service
Openstack portal-bestpractices-campbell mcneill
OpenStack in Action 4! Alan Clark - The fundation for openstack Cloud
NoSQL - Vital Open Source Ingredient for Modern Success
CloudStack EU user group - Trillian

What's hot (20)

PPTX
Cloudstack: the best kept secret in the cloud
PPTX
Fast SAP system provisioning based on CloudStack
PPTX
Welcome to CloudLand - DevOps Seattle Feb 2020
PDF
From metal to service 100% automation with Apache CloudStack and Ansible - ...
PDF
CloudStack IPv6 in production
PPTX
Running OpenStack in Production
PPT
CloudStack EU User Group - Making stuff better through CloudStack
PDF
OpenStack in Action 4! Thierry Carrez - From Havana to Icehouse
PPTX
OpenStack in the Enterprise
PDF
Designing Lean CloudStack Environments for the Edge - IndiQus - CloudStack E...
PPTX
Securing your Cloud Environment v2
PPTX
Kubernetes on OpenStack @eBay
PPTX
Telia latvija cloudstack
PDF
Build and Deploy Cloud Native Camel Quarkus routes with Tekton and Knative
PPTX
CNCF Keynote - What is cloud native?
PDF
20140708 - Jeremy Edberg: How Netflix Delivers Software
PPTX
Leveraging OpenStack to Run Mesos/Marathon at Charter Communications
PDF
Robert Sander: CloudStack and Terraform
PPTX
Decomposing Lithium's Monolith with Kubernetes and OpenStack
PPTX
Cache first cloud native microservices
Cloudstack: the best kept secret in the cloud
Fast SAP system provisioning based on CloudStack
Welcome to CloudLand - DevOps Seattle Feb 2020
From metal to service 100% automation with Apache CloudStack and Ansible - ...
CloudStack IPv6 in production
Running OpenStack in Production
CloudStack EU User Group - Making stuff better through CloudStack
OpenStack in Action 4! Thierry Carrez - From Havana to Icehouse
OpenStack in the Enterprise
Designing Lean CloudStack Environments for the Edge - IndiQus - CloudStack E...
Securing your Cloud Environment v2
Kubernetes on OpenStack @eBay
Telia latvija cloudstack
Build and Deploy Cloud Native Camel Quarkus routes with Tekton and Knative
CNCF Keynote - What is cloud native?
20140708 - Jeremy Edberg: How Netflix Delivers Software
Leveraging OpenStack to Run Mesos/Marathon at Charter Communications
Robert Sander: CloudStack and Terraform
Decomposing Lithium's Monolith with Kubernetes and OpenStack
Cache first cloud native microservices
Ad

Similar to Don't Repeat Our Mistakes! Lessons Learned from Running Go Daddy's Private Cloud (OpenStack Queens Summit) (20)

PDF
State of the Stack v4 - OpenStack in All It's Glory
PDF
OpenStack- A ringside view of Services and Architecture
PDF
OpenStack at NTT Resonant: Lessons Learned in Web Infrastructure
PPTX
OpenStack & the Evolving Cloud Ecosystem
PPTX
Capacity Management/Provisioning (Cloud's full, Can't build here)
PDF
OSCON 2013 - Planning an OpenStack Cloud - Tom Fifield
PDF
Openstack Pakistan intro
PDF
State of the Stack April 2013
PPTX
Nairobi OpenStack Meetup - July 2013
PDF
Openstack Pakistan Workshop (intro)
PDF
[Rakuten TechConf2014] [F-4] At Rakuten, The Rakuten OpenStack Platform and B...
PPTX
OpenStack at EBSCO
ODP
Learning to Scale OpenStack: An Update from the Rackspace Public Cloud
PDF
[OpenStack Day in Korea 2015] Keynote 1 - OpenStack Mission Update
PPTX
IaaS: the past, present and the future
PDF
OpenStack Ecosystem – Xen Cloud Platform and Integration into OpenStack - in...
PPTX
OpenStack: Why Is It Gaining So Much Traction?
PPTX
Cloud 2.0: Containers, Microservices and Cloud Hybridization
PPTX
Deployment of private cloud infrastructure.
PPTX
Deployment of private cloud infrastructure copy
State of the Stack v4 - OpenStack in All It's Glory
OpenStack- A ringside view of Services and Architecture
OpenStack at NTT Resonant: Lessons Learned in Web Infrastructure
OpenStack & the Evolving Cloud Ecosystem
Capacity Management/Provisioning (Cloud's full, Can't build here)
OSCON 2013 - Planning an OpenStack Cloud - Tom Fifield
Openstack Pakistan intro
State of the Stack April 2013
Nairobi OpenStack Meetup - July 2013
Openstack Pakistan Workshop (intro)
[Rakuten TechConf2014] [F-4] At Rakuten, The Rakuten OpenStack Platform and B...
OpenStack at EBSCO
Learning to Scale OpenStack: An Update from the Rackspace Public Cloud
[OpenStack Day in Korea 2015] Keynote 1 - OpenStack Mission Update
IaaS: the past, present and the future
OpenStack Ecosystem – Xen Cloud Platform and Integration into OpenStack - in...
OpenStack: Why Is It Gaining So Much Traction?
Cloud 2.0: Containers, Microservices and Cloud Hybridization
Deployment of private cloud infrastructure.
Deployment of private cloud infrastructure copy
Ad

Recently uploaded (20)

PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
WOOl fibre morphology and structure.pdf for textiles
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PDF
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
PDF
A review of recent deep learning applications in wood surface defect identifi...
PPTX
observCloud-Native Containerability and monitoring.pptx
PDF
Unlock new opportunities with location data.pdf
PDF
1 - Historical Antecedents, Social Consideration.pdf
PPTX
Benefits of Physical activity for teenagers.pptx
PDF
DP Operators-handbook-extract for the Mautical Institute
PPT
Module 1.ppt Iot fundamentals and Architecture
PDF
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
PDF
CloudStack 4.21: First Look Webinar slides
PDF
Taming the Chaos: How to Turn Unstructured Data into Decisions
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PDF
A novel scalable deep ensemble learning framework for big data classification...
PDF
Five Habits of High-Impact Board Members
PPTX
Web Crawler for Trend Tracking Gen Z Insights.pptx
PPTX
Modernising the Digital Integration Hub
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
Enhancing emotion recognition model for a student engagement use case through...
WOOl fibre morphology and structure.pdf for textiles
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
A review of recent deep learning applications in wood surface defect identifi...
observCloud-Native Containerability and monitoring.pptx
Unlock new opportunities with location data.pdf
1 - Historical Antecedents, Social Consideration.pdf
Benefits of Physical activity for teenagers.pptx
DP Operators-handbook-extract for the Mautical Institute
Module 1.ppt Iot fundamentals and Architecture
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
CloudStack 4.21: First Look Webinar slides
Taming the Chaos: How to Turn Unstructured Data into Decisions
NewMind AI Weekly Chronicles – August ’25 Week III
A novel scalable deep ensemble learning framework for big data classification...
Five Habits of High-Impact Board Members
Web Crawler for Trend Tracking Gen Z Insights.pptx
Modernising the Digital Integration Hub
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf

Don't Repeat Our Mistakes! Lessons Learned from Running Go Daddy's Private Cloud (OpenStack Queens Summit)

  • 1. Copyright© 2017 GoDaddy Inc. All Rights Reserved. Don’t Repeat Our Mistakes! Lessons Learned from Running Go Daddy’s Private Cloud Kris Lindgren klindgren@godaddy.com Mike Dorman mike.dorman@sendgrid.com OpenStack Queens Summit, November 2017, Sydney
  • 2. Copyright© 2016 GoDaddy Inc. All Rights Reserved. OpenStack at Go Daddy ● 2013: POC cloud (Havana) ● 2014: First production apps (Icehouse) ● 2014: Nova cells v1 (Kilo) ● 2015: “OpenStack everywhere” (Liberty) ● 2017: Working toward containerized services
  • 3. Copyright© 2016 GoDaddy Inc. All Rights Reserved. OpenStack at Go Daddy ● What we built: ○ Shared nothing regions ○ Ephemeral disk on local storage ○ Simple networking ○ No live migration ○ Multiple AZ’s ● Scale ○ 1000’s Computes, >100,000 Cores ○ 10,000’s VM’s
  • 4. Copyright© 2017 GoDaddy Inc. All Rights Reserved. Avoiding “Accidental Architecture” Product Infrastructure & Scaling Management
  • 5. Copyright© 2017 GoDaddy Inc. All Rights Reserved. Private Cloud = Free Compute High Demand = Overconsumption Product - Need for Chargeback/Showback Free Compute = High Demand
  • 6. Copyright© 2017 GoDaddy Inc. All Rights Reserved. Product - Have a Cohesive Vision • Which OpenStack Services/features • User onboard/off-boarding • Patching cadences/methodology • Legacy integrations • Adding capacity • SLAs • How do end users “consume” OpenStack? • Procedure for changing the vision • Helps with cloud paradigm shift • Expect and tolerate failure
  • 7. Copyright© 2017 GoDaddy Inc. All Rights Reserved. Product Issues - How to Avoid • Manage expectations (for yourself and for users) • Showback and controls around quota • Education and evangelism • Docs and sample code • “Cloud ready” early adopters • Ongoing guidance 1.Cloud 2.?????? 3.Profit!X
  • 8. Copyright© 2017 GoDaddy Inc. All Rights Reserved. Scaling - Nova Cells (v1) Justification • Assumed we would grow fast • Challenges with scaling Nova/RMQ • Easier earlier than later • Ongoing debt to manage patches • Cells v2 was coming soon http://www.dorm.org/blog/converting-to-openstack-nova-cells-without-destroying-the-world/
  • 9. Copyright© 2017 GoDaddy Inc. All Rights Reserved. Scaling - Nova Cells (v1) Retrospective Good • Helped us to scale • Gained expertise with Nova • Community street cred Bad • No scaling for Neutron • Patches get more difficult • Non-standard config • Delays on v2 • Migration to v2 is unknown 20/20 Hindsight • Scale/shard RMQ instead • Aspirations about scale • Porting patches is top blocker
  • 10. Copyright© 2017 GoDaddy Inc. All Rights Reserved. • Colocated API services and RMQ • (Except Glance) • Dedicated hardware overkill • Local python packages • Made sense for POC • Nova separated later with Cells v1 Scaling - Collapsed Architecture Justification
  • 11. Copyright© 2017 GoDaddy Inc. All Rights Reserved. Scaling - Collapsed Architecture Retrospective Good • Simple architecture • Minimal hardware • Easy network ACLs • Up and running fast Bad • Large failure impacts • Resource contention • Single API endpoints 20/20 Hindsight • OK for POC • Ignored it too long • Easy to scale out • (Implementing now)
  • 12. Copyright© 2017 GoDaddy Inc. All Rights Reserved. Infrastructure - Special Neutron Architecture Justification • Neutron L2 assumptions • L3 folded clos network • L2 stops at leafs • Uncomfortable with overlays • Provider network per rack • Routed floating IPs • Overload AZ to pick a network • Local patches for network scheduling
  • 13. Copyright© 2017 GoDaddy Inc. All Rights Reserved. Infrastructure - Special Neutron Architecture Retrospective Good • Same for VMs and metal • Simple infrastructure • Easy on users • Network IP usages API • Segmented networks spec Bad • Snowflake setup • L2 adjacency expectations • Added features difficult (LBaaS) • Migration to Neutron segmented networks? 20/20 Hindsight • Works pretty well • Patches are limited • IP usages API extension • Segmented networks in Neutron • Many others with same problem
  • 14. Copyright© 2017 GoDaddy Inc. All Rights Reserved. Management - Puppet Single Source of Truth Justification • Big Puppet shop • Single source of config • Good for server bootstrapping • OpenStack-Puppet modules • API providers • Code pipeline already in place • Ansible kicks off puppet apply
  • 15. Copyright© 2017 GoDaddy Inc. All Rights Reserved. Management - Puppet Single Source of Truth Retrospective Good • Single source of config (in theory) • Efficient bootstrapping • NOOP mode for sanity Bad • State in Puppet, Hiera, APIs • Some managed manually • Duplicate API objects • Omnibus deployments • NOOP report not always accurate! • Orphaned/forgotten servers • Orchestration difficult 20/20 Hindsight • Many unintended problems • Not really a single source • Need for targeted deployments • Other tools for orchestration • Use for bootstrapping
  • 16. Copyright© 2017 GoDaddy Inc. All Rights Reserved. Strategies for Avoiding Accidental Architecture • Think of your future selves •Quantify tech debt interest • Almost nothing will be temporary •Make a specific plan and timeline • Carefully consider scale •Overestimating can be as bad as underestimating • Automate first •At least make it capable
  • 17. Copyright© 2017 GoDaddy Inc. All Rights Reserved. Strategies for Avoiding Accidental Architecture • KISS! http://stella.report
  • 18. Copyright© 2017 GoDaddy Inc. All Rights Reserved. Strategies for Avoiding Accidental Architecture • Spread the knowledge wealth http://stella.report * The Coming Software Apocalypse: https://www.theatlantic.com/technology/archive/2017/09/saving-the-world-from-code/540393/ “The problem, [...] is that we are attempting to build systems that are beyond our ability to intellectually manage.” *
  • 19. Copyright© 2016 GoDaddy Inc. All Rights Reserved. Recap: How to Live with No Regrets Questions? Other Ideas? klindgren@godaddy.com mike.dorman@sendgrid.com ● Manage expectations ● Education and evangelism ● Helpful early adopters ● Ongoing guidance ● Remember your future self ● Account and plan for tech debt ● Sane scale expectations ● Automate, automate, automate ● Simplicity ● Knowledge sharing

Editor's Notes

  • #2: KRIS
  • #3: KRIS Late 2013, Havana, POC/”dev pilot” cloud Morphed into production cloud by 2014 Liberty since 2015, blocked on containerization
  • #4: KRIS So what did we end up building? Totally separate clouds at multiple location (no shared keystone) VM’s boot to local storage, may have cinder volume to store “persistent data” Network centric approach to networking. Letting the networking gear take care of the packets. No live migration - meaning we didn’t want to have pets. Teams were advised to be able rebuild servers. Anything of state should not go on openstack (Databases)
  • #5: MIKE Decisions have long lasting impacts Tough to change later Talk about general categories of things Infrastructure & Scaling Management (Config Management) Product We’re going to work backwards, and Kris is going to kick us off with some issues around product
  • #6: MIKE Free for everybody, then it gets all used up and isn’t there when needed Tragedy of the commons (find some good images for this) Had a lot of trouble keeping up with capacity consumption, so we would run out of space We had ridiculously high quotas, intending on reporting back to corporate finance (but that never happened)
  • #7: KRIS I know there is a lot of text on this slide, but I am not going to go through everything here DOCUMENT WHAT YOUR ARE PROVIDING SLA’s Patching policy How do you want end users to use what you built Example deployment’s/architectures Integrations with legacy applications If you are running a cloud and you don’t have this documented. Please, Please, Please do the work to get this documented and agreed upon. product vision drives your technical requirements Small changes to the vision/requirements can have a fundamental shift in what you need to provide. After getting this documented, also get the process for how to change the vision documented.
  • #8: MIKE Be clear on what you’re providing and what you’re not (before you build it) Know where you are going! Even if you don’t plan to actually charge others real money for using your cloud, you need to show them what they’re using and translate that to value somehow Definitely enforce some quota control Talk about how we opened up our quotas with intentions to report back to finance department against budgets (which never happened) Unless you can actually scale hardware super fast (you can’t) then it can’t just be a free-for-all Education and evangelism Good docs, getting started guides, sample code Give them something to copy and paste Start with teams that are already “cloud ready” as early adopters Provide ongoing architectural advice and constructive advice Don’t be arrogant or treat people who aren’t to your level yet poorly This should go without saying, but if we’re honest, we all have condescending attitudes toward some Help when things go wrong Describe SendGrid ProdOps team Now, moving on to the more technical architecture decisions
  • #9: MIKE We knew we would grow fast (see earlier graph) Known challenges with scaling Nova/RMQ Easier to move to cells v1 early, rather than a fire drill scaling exercise later Knew we would take on some ongoing debt to forward port v1 patches for each new version Cells v2 was coming “real soon now” Details about how we did it in Link to my YVR talk about moving to cells
  • #10: MIKE Good Helped us to scale and segment our infrastructure (failure boundaries) Gained a lot of expertise with Nova Street cred in community (LDT group, etc.) Bad Neutron doesn’t scale the same way, which ended up being our main bottleneck (not Nova) Forward porting patches becomes more and more difficult over time (eternal thanks to Sam from NeCTAR) Unknown how/if we can online migrate to cells v2 Cells v2 still coming “real soon now” (mostly there now)
  • #11: KRIS Run all API/server services, plus RMQ all on one set of servers Glance separate to stay network adjacent to computes Most nova services moved later as part of cells v1 Symptom of starting small with POC environment and then growing larger
  • #12: KRIS Good Less hardware to deal with Simpler architecture Easier network/firewall ACLs It helped us get started quickly Bad Any problems are very impactful, it takes out a wide swath of services Resource contention (RMQ and Neutron fighting over RAM and oom killing each other) No admin vs. public endpoints, more difficult to do maintenance that doesn’t expose errors to users
  • #13: KRIS Neutron assumes (or used to assume) L2 everywhere and its available anywhere In our datacenter network L2 stops at TOR So getting to a server in another rack goes though the gateway of the local switch to the spine and in to the other rack Persistent IP’s can be routed to any vm within the network Overlays viewed as unnecessarily complex, difficult to troubleshoot Provider network per rack (L2 domain), we pick a network for you based on AZ selection Local patches to do network scheduling
  • #14: KRIS Good Able to provide the same networking paradigm to VMs as to metal Simple infrastructure, VMs just get an IP and they’re good to go Network IP usages API implemented and committed upstream Kicked off segmented networks spec as collaboration between LDT and Neutron This remains the thing I’m most proud of accomplishing/helping with in OpenStack Bad Our Neutron doesn’t work like everybody else’s People love their L2 adjacency Unable to support more complex networking features out of the box (e.g. LBaaS) Unsure how we will go about migrating to real Neutron segmented networks
  • #15: MIKE Big puppet shop Pretty good for server bootstrapping and config management Wanted one stop shop for all config OpenStack API providers for managing users, groups, roles, AZs, networks, etc. Code review/pipeline already in place Config mostly in Puppet and Hiera repos State of OS resources inside APIs Physical hosts in manually curated Ansible hosts file
  • #16: MIKE Good Single place for all config (in theory) Helpful for new server bootstrapping and initial config Noop mode helpful to see what will happen Bad Config and current state was actually split across Puppet and Hiera repos, as well as the service APIs Difficulties with API providers led to duplicate objects (networks, AZs) Difficult to do non-omnibus targeted deployments (Puppet upgraded RabbitMQ, woops!) Roles and grants still managed manually ad-hocly Noop report not always accurate! Sometimes servers are forgotten about because we forget to put them in the list Difficult to do more intelligent orchestration of things when the data is all over the place
  • #17: MIKE Think of your future self Almost nothing will be temporary Unless have you have a specific plan and timeline for moving away from it, and you can trust yourself to follow through Try to quantify the interest you will pay on the tech debt Consider your expected scale (more than seat-of-your-pants) Just as bad to overestimate and overbuild than to underestimate Automate first (or at least make sure the capability is there)
  • #18: MIKE Keep it simple The perfect design is not when nothing else can be added, but when nothing else can be removed. As we were working on this, the Stella Report came out which articulates pretty well a lot of the ideas we were thinking about. Particularly around the idea of complexity, and a term they coined “dark debt” Dark debt/unknown unknowns that come from complexity (link to Velocity talk/Stella paper) Best thing you can do to minimize is to keep things as simple as possible
  • #19: MIKE Do as much as you can to simplify, but it’s still complex. Spread the knowledge wealth Try to keep everybody up to speed with what’s going on Keep the “mental models” of the system accurate and up to date (above the line/below the line) Avoid individual/tribal knowledge
  • #20: KRIS