SlideShare a Scribd company logo
Capacity Management 
and Provisioning 
(Cloud's full, can't build here) 
Matt Van Winkle, Manager Cloud Engineering @mvanwink 
Andy Hill, Systems Engineer @andyhky 
Joel Preas, Systems Engineer @joelintheory
Public Cloud Capacity at Rackspace 
• Rackspace Public Cloud has deployed 100+ cells in ~2 
years 
• New cells used to take engineer assembly and 3-5w 
after bare OS install 
• 1 year later done by on-shift operators ~1w (as low as 
1d) 
• Usually constrained by networking
Control Plane Sizing 
• Data plane operations impacting both cell and top level 
control plane 
– Image downloads/uploads 
• How large should Nova DB be? 
– Breaking point of ‘standard’ cell control plane 
buildout - particularly database
Cell Sizing Considerations 
• Efficient use of Private IP address space 
– Used for connections to services like Swift and 
dedicated environment 
• Broadcast domains 
• Attempt to have minimal control plane for 
overhead/complexity
Hypervisor Sizing Considerations 
• Enough spare drive space for COW images 
– XS VHD size can easily be 2x space given to guest during normal 
operation! 
– Errors in cleaning up “snapshots” exacerbated by tight disk overhead 
constraints 
• Drive space for pre-cached images 
– cache_images=some # nova 
– use_cow_images=True # nova 
– cache_in_nova=True # glance
Other Sizing Notes 
• Need reserve space for emergencies (host evac) 
• Reserve space is cell-bound, due to instances being 
unable to move between cells 
– https://review.openstack.org/#/c/125607/ 
– cells.host_reserve_percent 
• VM overhead 
– https://wiki.openstack.org/wiki/XenServer/Overhead 
– https://review.openstack.org/#/c/60087/
Problems 
• Load Balancers 
• Glance and Swift 
• Fraud / Non Payment 
• Routes 
• Road Testing
Load Balancers 
• Alternate Routes needed for high BW operations 
– Generally Glance 
• Load Balancer can become bottleneck 
• Database queries returning lots of rows (cell sizing)
Swift and Glance Bandwidth 
Problems: 
• Creates single bottleneck 
• Imaging speeds monitored, exceeding thresholds 
triggers investigation / scale out 
• Cache not shared between glance-api nodes
Swift and Glance Bandwidth 
Monitoring / Solutions: 
• Need to get downloads out of path of control plane (compute direct to 
image store) 
• Cache base images 
– Pre-seed when possible 
– Can cache images to HV ahead of time for fast-cloning 
https://wiki.openstack.org/wiki/FastCloningForXenServer 
• Glance and Swift having shared request IDs would be nice 
• Shared cache might elevate hit-rate, save bandwidth 
What about when scaling out doesn’t work? Rearchitecture.
Fraud and Non-Payment 
Fraud 
• Mark instance as 
suspended 
• Still takes capacity 
• What do? 
• Account Actioneer 
Non-Payment 
• Similar to fraud but worse for capacity! 
• Try to give customer as much time as 
possible to return to the fold 
• Same overall strategy as fraud but 
instances kept longer
Road Testing nodes before enabling 
• New Cell 
– Bypass URLs (cell-specific API nodes) 
• Different nova.conf not using cells 
– compute_api_class=nova.compute.api.API # before 
• Cell tenant restrictions 
• Existing Cell/Rekick - Not as easy :( 
– How to ensure customer builds don’t land on box 
that isn’t road tested?
Managing the Capacity Management 
● Supply Chain/Resource Pipeline 
● Impact from Product Development 
● Gaps/Challenges from upstream
Capacity Pipeline 
• Large Customer Requests 
• Triggers 
– % Used 
– # Largest Slots per flavor 
• IPv4 Addresses 
– Cells and scheduler unaware :( 
– Auditor + Resolver 
• Control Plane (runs on OpenStack too)
Product Implications 
• Keep up with code deploys (hotpatches) 
• Adjusting provisioning playbooks to: 
– new flavor types 
– new configurations/applications (quantum- 
>neutron, nova-conductor) 
– control plane changes (10g glance) 
– new hardware manufacturers (OCP) 
• Non production environments
Upstream Challenges 
• Disabled flag for cells 
– Blueprint: http://bit.do/CellDisableBP 
– Bug: http://bit.do/CellDisableBug 
• Build to “disabled” host 
– Testing after a re-provision 
– Testing for adding new capacity to existing cell 
• Scheduling based on IP capacity 
– New scheduler service? 
– Currently handled by outside service “Resolver”, similar to Entropy 
• General “Cells as first class citizen” effort led by alaski
Questions? 
THANK YOU 
RACKSPACE® | 1 FANATICAL PLACE, CITY OF WINDCREST | SAN ANTONIO, TX 78218 
US SALES: 1-800-961-2888 | US SUPPORT: 1-800-961-4454 | WWW.RACKSPACE.COM 
© RACKSPACE LTD. | RACKSPACE® AND FANATICAL SUPPORT® ARE SERVICE MARKS OF RACKSPACE US, INC. REGISTERED IN THE UNITED STATES AND OTHER COUNTRIES. | WWW.RACKSPACE.COM

More Related Content

PDF
2016 may-countdown-to-postgres-v96-parallel-query
PDF
FOSS4G In The Cloud: Using Open Source to build Cloud based Spatial Infrastru...
PDF
PostreSQL HA and DR Setup & Use Cases
PDF
(ATS6-PLAT06) Maximizing AEP Performance
ODP
Zero Downtime JEE Architectures
PDF
25 snowflake
PDF
(ATS4-PLAT08) Server Pool Management
PPT
Performance and Scalability Tuning
2016 may-countdown-to-postgres-v96-parallel-query
FOSS4G In The Cloud: Using Open Source to build Cloud based Spatial Infrastru...
PostreSQL HA and DR Setup & Use Cases
(ATS6-PLAT06) Maximizing AEP Performance
Zero Downtime JEE Architectures
25 snowflake
(ATS4-PLAT08) Server Pool Management
Performance and Scalability Tuning

What's hot (10)

PPTX
Apache Performance Tuning: Scaling Out
PDF
Planning for Disaster Recovery (DR) with Galera Cluster
PDF
Technical Introduction to PostgreSQL and PPAS
PDF
Java Performance Tuning
PDF
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
PPTX
Pascal benois performance_troubleshooting-spsbe18
PDF
Five Years of EC2 Distilled
PDF
Lightening Talk - PostgreSQL Worst Practices
PPTX
Performance out
PPT
Weblogic - clustering failover, and load balancing
Apache Performance Tuning: Scaling Out
Planning for Disaster Recovery (DR) with Galera Cluster
Technical Introduction to PostgreSQL and PPAS
Java Performance Tuning
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
Pascal benois performance_troubleshooting-spsbe18
Five Years of EC2 Distilled
Lightening Talk - PostgreSQL Worst Practices
Performance out
Weblogic - clustering failover, and load balancing
Ad

Viewers also liked (20)

PDF
Case Study: HCL Technologies On Capacity Planning for Cloud and Virtualized E...
PPTX
Virtualization and how it leads to cloud
DOC
Capacity model
PDF
Capacity Managementand the Cloud
PPTX
20140128 webinar-get-more-out-of-mysql-with-tokudb-140319063324-phpapp02
PDF
Deterministic capacity planning for OpenStack as elastic cloud infrastructure
PPTX
Capacity Management in a Cloud Computing World
PPTX
Build Consumer-Facing Apps with Heroku Connect
PPT
Capturing Measurable Non Functional Requirements
KEY
Traditional Infrastructure Capacity Models vs. Cloud Capacity Models
PDF
Capacity Planning for Cloud Computing
PDF
Who's Who in Container Land
PPTX
Handling Non Functional Requirements on an Agile Project
PPTX
Case study: integrating azure with google app engine
PPTX
Adressing nonfunctional requirements with agile practices
PPTX
Private PaaS & Container-as-a-Service for ISVs and Enterprise - Use Cases and...
DOC
Non functional requirements framework
PPTX
Cross Platform Mobile Application Architecture
PPTX
Design for non functional requirements
PPTX
Non functional requirements. do we really care…?
Case Study: HCL Technologies On Capacity Planning for Cloud and Virtualized E...
Virtualization and how it leads to cloud
Capacity model
Capacity Managementand the Cloud
20140128 webinar-get-more-out-of-mysql-with-tokudb-140319063324-phpapp02
Deterministic capacity planning for OpenStack as elastic cloud infrastructure
Capacity Management in a Cloud Computing World
Build Consumer-Facing Apps with Heroku Connect
Capturing Measurable Non Functional Requirements
Traditional Infrastructure Capacity Models vs. Cloud Capacity Models
Capacity Planning for Cloud Computing
Who's Who in Container Land
Handling Non Functional Requirements on an Agile Project
Case study: integrating azure with google app engine
Adressing nonfunctional requirements with agile practices
Private PaaS & Container-as-a-Service for ISVs and Enterprise - Use Cases and...
Non functional requirements framework
Cross Platform Mobile Application Architecture
Design for non functional requirements
Non functional requirements. do we really care…?
Ad

Similar to Capacity Management/Provisioning (Cloud's full, Can't build here) (20)

PDF
Tips Tricks and Tactics with Cells and Scaling OpenStack - May, 2015
PPTX
Don't Repeat Our Mistakes! Lessons Learned from Running Go Daddy's Private Cl...
ODP
Learning to Scale OpenStack: An Update from the Rackspace Public Cloud
PPTX
Moving to Nova Cells without Destroying the World
PDF
Moving from CellsV1 to CellsV2 at CERN
PDF
Consideration for Building a Private Cloud
PDF
OSCON 2013 - Planning an OpenStack Cloud - Tom Fifield
PDF
Multi-Cell OpenStack: How to Evolve Your Cloud to Scale - November, 2014
PDF
[Rakuten TechConf2014] [F-4] At Rakuten, The Rakuten OpenStack Platform and B...
ODP
Deep Dive: OpenStack Summit (Red Hat Summit 2014)
PDF
Lessons Learned Running The Largest OpenStack Clouds
PPTX
Optimized NFV placement in Openstack Clouds
PDF
<iframe src="http://video.yandex.ru/iframe/ya-events/0ro6nfi3fv.5216/" hei...
PPTX
Power of OpenStack & Hadoop
PPTX
Nova states summit
PDF
Gordonh0945deepdive openstackcompute-140417174059-phpapp02
PDF
Ensuring Your Technology Will Scale
PDF
Deep Dive Into the CERN Cloud Infrastructure - November, 2013
PPTX
Cloud computing and OpenStack
PPTX
OpenStack: Toward a More Resilient Cloud
Tips Tricks and Tactics with Cells and Scaling OpenStack - May, 2015
Don't Repeat Our Mistakes! Lessons Learned from Running Go Daddy's Private Cl...
Learning to Scale OpenStack: An Update from the Rackspace Public Cloud
Moving to Nova Cells without Destroying the World
Moving from CellsV1 to CellsV2 at CERN
Consideration for Building a Private Cloud
OSCON 2013 - Planning an OpenStack Cloud - Tom Fifield
Multi-Cell OpenStack: How to Evolve Your Cloud to Scale - November, 2014
[Rakuten TechConf2014] [F-4] At Rakuten, The Rakuten OpenStack Platform and B...
Deep Dive: OpenStack Summit (Red Hat Summit 2014)
Lessons Learned Running The Largest OpenStack Clouds
Optimized NFV placement in Openstack Clouds
<iframe src="http://video.yandex.ru/iframe/ya-events/0ro6nfi3fv.5216/" hei...
Power of OpenStack & Hadoop
Nova states summit
Gordonh0945deepdive openstackcompute-140417174059-phpapp02
Ensuring Your Technology Will Scale
Deep Dive Into the CERN Cloud Infrastructure - November, 2013
Cloud computing and OpenStack
OpenStack: Toward a More Resilient Cloud

Recently uploaded (20)

PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Electronic commerce courselecture one. Pdf
PDF
Approach and Philosophy of On baking technology
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
Big Data Technologies - Introduction.pptx
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
“AI and Expert System Decision Support & Business Intelligence Systems”
Electronic commerce courselecture one. Pdf
Approach and Philosophy of On baking technology
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Diabetes mellitus diagnosis method based random forest with bat algorithm
Understanding_Digital_Forensics_Presentation.pptx
20250228 LYD VKU AI Blended-Learning.pptx
Review of recent advances in non-invasive hemoglobin estimation
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Big Data Technologies - Introduction.pptx
NewMind AI Monthly Chronicles - July 2025
Chapter 3 Spatial Domain Image Processing.pdf
Empathic Computing: Creating Shared Understanding
Encapsulation_ Review paper, used for researhc scholars
Dropbox Q2 2025 Financial Results & Investor Presentation

Capacity Management/Provisioning (Cloud's full, Can't build here)

  • 1. Capacity Management and Provisioning (Cloud's full, can't build here) Matt Van Winkle, Manager Cloud Engineering @mvanwink Andy Hill, Systems Engineer @andyhky Joel Preas, Systems Engineer @joelintheory
  • 2. Public Cloud Capacity at Rackspace • Rackspace Public Cloud has deployed 100+ cells in ~2 years • New cells used to take engineer assembly and 3-5w after bare OS install • 1 year later done by on-shift operators ~1w (as low as 1d) • Usually constrained by networking
  • 3. Control Plane Sizing • Data plane operations impacting both cell and top level control plane – Image downloads/uploads • How large should Nova DB be? – Breaking point of ‘standard’ cell control plane buildout - particularly database
  • 4. Cell Sizing Considerations • Efficient use of Private IP address space – Used for connections to services like Swift and dedicated environment • Broadcast domains • Attempt to have minimal control plane for overhead/complexity
  • 5. Hypervisor Sizing Considerations • Enough spare drive space for COW images – XS VHD size can easily be 2x space given to guest during normal operation! – Errors in cleaning up “snapshots” exacerbated by tight disk overhead constraints • Drive space for pre-cached images – cache_images=some # nova – use_cow_images=True # nova – cache_in_nova=True # glance
  • 6. Other Sizing Notes • Need reserve space for emergencies (host evac) • Reserve space is cell-bound, due to instances being unable to move between cells – https://review.openstack.org/#/c/125607/ – cells.host_reserve_percent • VM overhead – https://wiki.openstack.org/wiki/XenServer/Overhead – https://review.openstack.org/#/c/60087/
  • 7. Problems • Load Balancers • Glance and Swift • Fraud / Non Payment • Routes • Road Testing
  • 8. Load Balancers • Alternate Routes needed for high BW operations – Generally Glance • Load Balancer can become bottleneck • Database queries returning lots of rows (cell sizing)
  • 9. Swift and Glance Bandwidth Problems: • Creates single bottleneck • Imaging speeds monitored, exceeding thresholds triggers investigation / scale out • Cache not shared between glance-api nodes
  • 10. Swift and Glance Bandwidth Monitoring / Solutions: • Need to get downloads out of path of control plane (compute direct to image store) • Cache base images – Pre-seed when possible – Can cache images to HV ahead of time for fast-cloning https://wiki.openstack.org/wiki/FastCloningForXenServer • Glance and Swift having shared request IDs would be nice • Shared cache might elevate hit-rate, save bandwidth What about when scaling out doesn’t work? Rearchitecture.
  • 11. Fraud and Non-Payment Fraud • Mark instance as suspended • Still takes capacity • What do? • Account Actioneer Non-Payment • Similar to fraud but worse for capacity! • Try to give customer as much time as possible to return to the fold • Same overall strategy as fraud but instances kept longer
  • 12. Road Testing nodes before enabling • New Cell – Bypass URLs (cell-specific API nodes) • Different nova.conf not using cells – compute_api_class=nova.compute.api.API # before • Cell tenant restrictions • Existing Cell/Rekick - Not as easy :( – How to ensure customer builds don’t land on box that isn’t road tested?
  • 13. Managing the Capacity Management ● Supply Chain/Resource Pipeline ● Impact from Product Development ● Gaps/Challenges from upstream
  • 14. Capacity Pipeline • Large Customer Requests • Triggers – % Used – # Largest Slots per flavor • IPv4 Addresses – Cells and scheduler unaware :( – Auditor + Resolver • Control Plane (runs on OpenStack too)
  • 15. Product Implications • Keep up with code deploys (hotpatches) • Adjusting provisioning playbooks to: – new flavor types – new configurations/applications (quantum- >neutron, nova-conductor) – control plane changes (10g glance) – new hardware manufacturers (OCP) • Non production environments
  • 16. Upstream Challenges • Disabled flag for cells – Blueprint: http://bit.do/CellDisableBP – Bug: http://bit.do/CellDisableBug • Build to “disabled” host – Testing after a re-provision – Testing for adding new capacity to existing cell • Scheduling based on IP capacity – New scheduler service? – Currently handled by outside service “Resolver”, similar to Entropy • General “Cells as first class citizen” effort led by alaski
  • 17. Questions? THANK YOU RACKSPACE® | 1 FANATICAL PLACE, CITY OF WINDCREST | SAN ANTONIO, TX 78218 US SALES: 1-800-961-2888 | US SUPPORT: 1-800-961-4454 | WWW.RACKSPACE.COM © RACKSPACE LTD. | RACKSPACE® AND FANATICAL SUPPORT® ARE SERVICE MARKS OF RACKSPACE US, INC. REGISTERED IN THE UNITED STATES AND OTHER COUNTRIES. | WWW.RACKSPACE.COM