SlideShare a Scribd company logo
© 2015 PayPal Inc. All rights reserved. Confidential and proprietary.
‘fsck’ for Openstack
Wei Tian -- Cloud Performance Lead at Paypal
Zhenhua Feng -- Staff Software Engineer
10/ 27 / 2015
Detect Resource Leaking and Keep the Cloud Consistent
© 2014-15 PayPal Inc. All rights reserved. Confidential and proprietary.
Agenda
2
• Some numbers about Paypal Cloud
• What makes our cloud inconsistent
• Our solutions to keep our cloud consistent
© 2014-15 PayPal Inc. All rights reserved. Confidential and proprietary.
About PayPal Cloud
3
• Background
– Started in July 2012 with 1 engineer and 16 decommissioned servers
– Today, one of the world’s Largest OpenStack Private Cloud
– Number of VMs : 82,000
– Number of Physical Servers: 8064
– Number of Racks: 84
– Total Cores: 386,000
– Block Storage: 2 peta bytes
– Largest AZ with 2500+ hypervisors
• Business Goals
– Hosting ~100% of PayPal’s production traffic (except Databases and Messaging)
– Powers 100% of PaaS, Dev/QA and M&As
– First production workload on SDN in 2013
© 2015 PayPal Inc. All rights reserved. Confidential and proprietary.
What Makes the Cloud Inconsistent?
4
• VPC
• Flavor
• Image properties
• Host Aggregate metadata
• Default security group
• Networks
• Volumes
• VM Sprawl
• inconsistent cinder volume states
• Orphaned ports
• Inconsistent DNS entries
• Inconsistent states between neutron and NSX
• Inconsistent states caused by RPC timeout
• Inconsistent DB states between API and Compute cells
Misconfiguration Resource Leaking
© 2015 PayPal Inc. All rights reserved. Confidential and proprietary.
Misconfiguration
• In Paypal cloud, administrator does the initial
resource allocation and configuration.
• The resources set up includes VPC, flavor,
image, network, host aggregate, etc.
• Administrator uses Openstack cli to create all
those resources and make sure they match with
each others.
• As long as we are human, we are bound to
make mistakes.
5
© 2015 PayPal Inc. All rights reserved. Confidential and proprietary.
Scenario may have misconfiguration (1)
6
© 2015 PayPal Inc. All rights reserved. Confidential and proprietary.
Scenario may have misconfiguration (2)
7
© 2015 PayPal Inc. All rights reserved. Confidential and proprietary.
Resource Leaking
8
• When running Openstack, sometimes the state of a resource
(volume, instance, port, etc.) can be inconsistent on the cluster.
• Sometime, it is not able to correct the state through REST API
alone.
• You may need to manually edit the database or to run a shell
script on hypervisor to correct the state.
• Please note that it is important to find and fix the underlying
issue, and to edit database or run shell script is a just a quick
hack.
• However, as a service operator, you also need to fix the issue
right away to meet the SLA before the engineering fixes the code.
© 2015 PayPal Inc. All rights reserved. Confidential and proprietary.
Resource Leaking (1)
9
VM Sprawl can cause major performance and capacity problems.
The resource leaking includes zombie VMs and orphaned disk files.
The state of a volume or an instance can be inconsistent.
The volume shows attached in nova but not in cinder, or otherwise.
Sometimes a volume deletion hangs, or a detach does not work.
Neutron orphaned ports. Ports not deleted when VM deleted. Or ports
without device_id. Ports leaking causes IP leaking and DNS leaking.
Inconsistent state between neutron and NSX controller.
A port is deleted from neutron but still exists in NSX.
© 2015 PayPal Inc. All rights reserved. Confidential and proprietary.
Resource Leaking (2)
10
Inconsistent DNS entries. One IP with multiple DNS entries, or
multiple IPs with same DNS entry, or fails to create/delete DNS entry
Inconsistent states caused by RPC timeout. The caller says
A RPC timeout, but the handler does the job but fails to reply.
Inconsistent states between API and Compute cells DBs
© 2015 PayPal Inc. All rights reserved. Confidential and proprietary.
Introduce CloudKeeper
11
CloudBuilder CloudSweeper
• Resolve the misconfiguration.
• Eliminating manual steps to setup
Openstack cloud.
• The CloudBuilder automates the entire
setup process to avoid human errors.
• Declarative instead of Imperative. All
settings are described in a set of config
files called Blueprint.
• Like Puppet, the CloudBuilder
continuously pushes the changes from
BluePrint to Openstack cloud and keeps
them in sync.
• Resolve the resource leaking.
• CloudSweeper has a task manager
which triggers all plugin tools
periodically.
• CloudSweeper logs the results of each
cleaning tool and report to dashboard
for statistics and troubleshooting.
© 2015 PayPal Inc. All rights reserved. Confidential and proprietary.
CloudBuilder -- Blueprint
12
Everything Data-Driven
We define how the initial setup for the cloud in a set of JSON files. The CloudBuilder will create
all the resources based on the JSON files:
• VPC metadata
• Flavor class
• VPC networks
• VPC host-aggregate
• VPC images
© 2015 PayPal Inc. All rights reserved. Confidential and proprietary.
CloudBuilder – Blueprint – VPC metadata
© 2015 PayPal Inc. All rights reserved. Confidential and proprietary.
CloudBuilder – Blueprint – VPC Resources
© 2015 PayPal Inc. All rights reserved. Confidential and proprietary.
CloudBuilder – Add Hypervisor to Host Aggregate
New Hypervisor can be automatically add
to the right host aggregates based on its
characteristics
The hypervisor asset information can
be retrieved from CMS (Configuration
Management System)
© 2015 PayPal Inc. All rights reserved. Confidential and proprietary.
CloudSweeper
© 2015 PayPal Inc. All rights reserved. Confidential and proprietary.
CloudSweeper – Neutron Port Cleaner
© 2015 PayPal Inc. All rights reserved. Confidential and proprietary.
CloudSweeper – Volume Cleaner
• Mismatched volume state in nova
and cinder
• Volume stuck in deleting state
• Missing connection_info in
block_device_mappimg table
Symptom
• Find the REAL state of the
volume from hypervisor
• Modify the nova and cinder DBs
to reset the state.
• Re-run “nova volume-delete”
after cleaning state in DB for
volume stuck in deleting state.
Fix
© 2015 PayPal Inc. All rights reserved. Confidential and proprietary. 19
Questions ?

More Related Content

PDF
ContainerDays NYC 2016: "From Hello World to Real World: Building a Productio...
PPTX
How Cloudify uses Chef as a Foundation for PaaS
PPTX
Openstack Swift Introduction
PPTX
Stratoscale Latest and Greatest
PDF
Single tenant software to multi-tenant SaaS using K8S
PDF
SSL certificates in the Oracle Database without surprises
PDF
Overcoming 5 Common Docker Challenges: How We Do It at RightScale
PDF
Oracle Java & Developer Cloud Service: What It Does & Doesn't Do
ContainerDays NYC 2016: "From Hello World to Real World: Building a Productio...
How Cloudify uses Chef as a Foundation for PaaS
Openstack Swift Introduction
Stratoscale Latest and Greatest
Single tenant software to multi-tenant SaaS using K8S
SSL certificates in the Oracle Database without surprises
Overcoming 5 Common Docker Challenges: How We Do It at RightScale
Oracle Java & Developer Cloud Service: What It Does & Doesn't Do

What's hot (20)

PDF
Building A Diverse Geo-Architecture For Cloud Native Applications In One Day
PDF
Hey Relational Developer, Let's Go Crazy (Patrick McFadin, DataStax) | Cassan...
PDF
Matt Bruzek - Monitoring Your Public Cloud With Nagios
PDF
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
PDF
Developing Web Services from Scratch - For DBAs and Database Developers
PDF
Ruby Driver Explained: DataStax Webinar May 5th 2015
PDF
CERN Data Centre Evolution
PPTX
Hammock, a Good Place to Rest
PDF
Developing the Stratoscale System at Scale - Muli Ben-Yehuda, Stratoscale - D...
PDF
Building a better web
PDF
OpenStack Tutorial For Beginners | OpenStack Tutorial | OpenStack Training | ...
PPTX
Load Balancing and Scaling with NGINX
PDF
Dave Williams - Nagios Log Server - Practical Experience
PPTX
Deployment topologies for high availability (ha)
PDF
Running Galera Cluster on Microsoft Azure
PPTX
McAfee Open Source Insight - Aharon Robbins - OpenStack Day Israel 2017
PPTX
How Cloud Native VNFs Deployed on OpenStack Will Change the Telecom Industry ...
PPT
Avoiding cloud lock-in
PDF
OpenStack Swift overview oscon2011
PDF
PaaS: An Introduction
Building A Diverse Geo-Architecture For Cloud Native Applications In One Day
Hey Relational Developer, Let's Go Crazy (Patrick McFadin, DataStax) | Cassan...
Matt Bruzek - Monitoring Your Public Cloud With Nagios
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
Developing Web Services from Scratch - For DBAs and Database Developers
Ruby Driver Explained: DataStax Webinar May 5th 2015
CERN Data Centre Evolution
Hammock, a Good Place to Rest
Developing the Stratoscale System at Scale - Muli Ben-Yehuda, Stratoscale - D...
Building a better web
OpenStack Tutorial For Beginners | OpenStack Tutorial | OpenStack Training | ...
Load Balancing and Scaling with NGINX
Dave Williams - Nagios Log Server - Practical Experience
Deployment topologies for high availability (ha)
Running Galera Cluster on Microsoft Azure
McAfee Open Source Insight - Aharon Robbins - OpenStack Day Israel 2017
How Cloud Native VNFs Deployed on OpenStack Will Change the Telecom Industry ...
Avoiding cloud lock-in
OpenStack Swift overview oscon2011
PaaS: An Introduction
Ad

Viewers also liked (8)

PDF
Minix smp
PPT
Educational operating system-Minix&Weenix
PDF
Do journaling filesystems guarantee against corruption after a power failure (1)
PPTX
STUDY EDUCATIONAL OPERATING SYSTEM MINIX OPERATING SYSTEM AND DEVELOP REASO...
PPTX
Scrubbing and gowning
PPT
Disk scheduling
PPTX
Hyper threading technology
PDF
LinkedIn SlideShare: Knowledge, Well-Presented
Minix smp
Educational operating system-Minix&Weenix
Do journaling filesystems guarantee against corruption after a power failure (1)
STUDY EDUCATIONAL OPERATING SYSTEM MINIX OPERATING SYSTEM AND DEVELOP REASO...
Scrubbing and gowning
Disk scheduling
Hyper threading technology
LinkedIn SlideShare: Knowledge, Well-Presented
Ad

Similar to ‘fsck’ for Openstack (20)

PDF
Will your cloud be compliant
PPTX
Will Your Cloud Be Compliant? OpenStack Security
PDF
OSCON 2013 - Planning an OpenStack Cloud - Tom Fifield
PDF
Openstack In Action 1st Edition V K Cody Bumgardner
KEY
OpenStack Boston User Group, OpenStack overview
PPTX
Marriage of Openstack with KVM and ESX at PayPal OpenStack Summit Hong Kong F...
PPTX
Interop Las Vegas Cloud Connect Summit 2014 - Software Defined Data Center
PPTX
Operating OpenStack on a Budget
PPTX
Operating OpenStack on a Budget
PPTX
Compute Waste Management for Operators
PPTX
So Your OpenStack Cloud is Built...Now What?
PDF
Marriage of ESX and OpenStack - PayPal - VMWorld US 2013
PPTX
Open stack operations guide
PPTX
OpenStack: Why Is It Gaining So Much Traction?
PDF
OpenStack- A ringside view of Services and Architecture
PPTX
So Your OpenStack Cloud is Built... Now What's Next - Walter Bentley - OpenSt...
PDF
OpenStack at NTT Resonant: Lessons Learned in Web Infrastructure
PDF
Open stack@ebay
PDF
[Rakuten TechConf2014] [F-4] At Rakuten, The Rakuten OpenStack Platform and B...
PPTX
PayPal's Private Cloud @ Scale
Will your cloud be compliant
Will Your Cloud Be Compliant? OpenStack Security
OSCON 2013 - Planning an OpenStack Cloud - Tom Fifield
Openstack In Action 1st Edition V K Cody Bumgardner
OpenStack Boston User Group, OpenStack overview
Marriage of Openstack with KVM and ESX at PayPal OpenStack Summit Hong Kong F...
Interop Las Vegas Cloud Connect Summit 2014 - Software Defined Data Center
Operating OpenStack on a Budget
Operating OpenStack on a Budget
Compute Waste Management for Operators
So Your OpenStack Cloud is Built...Now What?
Marriage of ESX and OpenStack - PayPal - VMWorld US 2013
Open stack operations guide
OpenStack: Why Is It Gaining So Much Traction?
OpenStack- A ringside view of Services and Architecture
So Your OpenStack Cloud is Built... Now What's Next - Walter Bentley - OpenSt...
OpenStack at NTT Resonant: Lessons Learned in Web Infrastructure
Open stack@ebay
[Rakuten TechConf2014] [F-4] At Rakuten, The Rakuten OpenStack Platform and B...
PayPal's Private Cloud @ Scale

Recently uploaded (20)

PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PDF
composite construction of structures.pdf
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PPTX
Internet of Things (IOT) - A guide to understanding
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
web development for engineering and engineering
PPTX
Construction Project Organization Group 2.pptx
PPTX
Artificial Intelligence
PPTX
CH1 Production IntroductoryConcepts.pptx
PPTX
additive manufacturing of ss316l using mig welding
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
DOCX
573137875-Attendance-Management-System-original
CYBER-CRIMES AND SECURITY A guide to understanding
composite construction of structures.pdf
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
Internet of Things (IOT) - A guide to understanding
Embodied AI: Ushering in the Next Era of Intelligent Systems
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
web development for engineering and engineering
Construction Project Organization Group 2.pptx
Artificial Intelligence
CH1 Production IntroductoryConcepts.pptx
additive manufacturing of ss316l using mig welding
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
bas. eng. economics group 4 presentation 1.pptx
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
Operating System & Kernel Study Guide-1 - converted.pdf
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
573137875-Attendance-Management-System-original

‘fsck’ for Openstack

  • 1. © 2015 PayPal Inc. All rights reserved. Confidential and proprietary. ‘fsck’ for Openstack Wei Tian -- Cloud Performance Lead at Paypal Zhenhua Feng -- Staff Software Engineer 10/ 27 / 2015 Detect Resource Leaking and Keep the Cloud Consistent
  • 2. © 2014-15 PayPal Inc. All rights reserved. Confidential and proprietary. Agenda 2 • Some numbers about Paypal Cloud • What makes our cloud inconsistent • Our solutions to keep our cloud consistent
  • 3. © 2014-15 PayPal Inc. All rights reserved. Confidential and proprietary. About PayPal Cloud 3 • Background – Started in July 2012 with 1 engineer and 16 decommissioned servers – Today, one of the world’s Largest OpenStack Private Cloud – Number of VMs : 82,000 – Number of Physical Servers: 8064 – Number of Racks: 84 – Total Cores: 386,000 – Block Storage: 2 peta bytes – Largest AZ with 2500+ hypervisors • Business Goals – Hosting ~100% of PayPal’s production traffic (except Databases and Messaging) – Powers 100% of PaaS, Dev/QA and M&As – First production workload on SDN in 2013
  • 4. © 2015 PayPal Inc. All rights reserved. Confidential and proprietary. What Makes the Cloud Inconsistent? 4 • VPC • Flavor • Image properties • Host Aggregate metadata • Default security group • Networks • Volumes • VM Sprawl • inconsistent cinder volume states • Orphaned ports • Inconsistent DNS entries • Inconsistent states between neutron and NSX • Inconsistent states caused by RPC timeout • Inconsistent DB states between API and Compute cells Misconfiguration Resource Leaking
  • 5. © 2015 PayPal Inc. All rights reserved. Confidential and proprietary. Misconfiguration • In Paypal cloud, administrator does the initial resource allocation and configuration. • The resources set up includes VPC, flavor, image, network, host aggregate, etc. • Administrator uses Openstack cli to create all those resources and make sure they match with each others. • As long as we are human, we are bound to make mistakes. 5
  • 6. © 2015 PayPal Inc. All rights reserved. Confidential and proprietary. Scenario may have misconfiguration (1) 6
  • 7. © 2015 PayPal Inc. All rights reserved. Confidential and proprietary. Scenario may have misconfiguration (2) 7
  • 8. © 2015 PayPal Inc. All rights reserved. Confidential and proprietary. Resource Leaking 8 • When running Openstack, sometimes the state of a resource (volume, instance, port, etc.) can be inconsistent on the cluster. • Sometime, it is not able to correct the state through REST API alone. • You may need to manually edit the database or to run a shell script on hypervisor to correct the state. • Please note that it is important to find and fix the underlying issue, and to edit database or run shell script is a just a quick hack. • However, as a service operator, you also need to fix the issue right away to meet the SLA before the engineering fixes the code.
  • 9. © 2015 PayPal Inc. All rights reserved. Confidential and proprietary. Resource Leaking (1) 9 VM Sprawl can cause major performance and capacity problems. The resource leaking includes zombie VMs and orphaned disk files. The state of a volume or an instance can be inconsistent. The volume shows attached in nova but not in cinder, or otherwise. Sometimes a volume deletion hangs, or a detach does not work. Neutron orphaned ports. Ports not deleted when VM deleted. Or ports without device_id. Ports leaking causes IP leaking and DNS leaking. Inconsistent state between neutron and NSX controller. A port is deleted from neutron but still exists in NSX.
  • 10. © 2015 PayPal Inc. All rights reserved. Confidential and proprietary. Resource Leaking (2) 10 Inconsistent DNS entries. One IP with multiple DNS entries, or multiple IPs with same DNS entry, or fails to create/delete DNS entry Inconsistent states caused by RPC timeout. The caller says A RPC timeout, but the handler does the job but fails to reply. Inconsistent states between API and Compute cells DBs
  • 11. © 2015 PayPal Inc. All rights reserved. Confidential and proprietary. Introduce CloudKeeper 11 CloudBuilder CloudSweeper • Resolve the misconfiguration. • Eliminating manual steps to setup Openstack cloud. • The CloudBuilder automates the entire setup process to avoid human errors. • Declarative instead of Imperative. All settings are described in a set of config files called Blueprint. • Like Puppet, the CloudBuilder continuously pushes the changes from BluePrint to Openstack cloud and keeps them in sync. • Resolve the resource leaking. • CloudSweeper has a task manager which triggers all plugin tools periodically. • CloudSweeper logs the results of each cleaning tool and report to dashboard for statistics and troubleshooting.
  • 12. © 2015 PayPal Inc. All rights reserved. Confidential and proprietary. CloudBuilder -- Blueprint 12 Everything Data-Driven We define how the initial setup for the cloud in a set of JSON files. The CloudBuilder will create all the resources based on the JSON files: • VPC metadata • Flavor class • VPC networks • VPC host-aggregate • VPC images
  • 13. © 2015 PayPal Inc. All rights reserved. Confidential and proprietary. CloudBuilder – Blueprint – VPC metadata
  • 14. © 2015 PayPal Inc. All rights reserved. Confidential and proprietary. CloudBuilder – Blueprint – VPC Resources
  • 15. © 2015 PayPal Inc. All rights reserved. Confidential and proprietary. CloudBuilder – Add Hypervisor to Host Aggregate New Hypervisor can be automatically add to the right host aggregates based on its characteristics The hypervisor asset information can be retrieved from CMS (Configuration Management System)
  • 16. © 2015 PayPal Inc. All rights reserved. Confidential and proprietary. CloudSweeper
  • 17. © 2015 PayPal Inc. All rights reserved. Confidential and proprietary. CloudSweeper – Neutron Port Cleaner
  • 18. © 2015 PayPal Inc. All rights reserved. Confidential and proprietary. CloudSweeper – Volume Cleaner • Mismatched volume state in nova and cinder • Volume stuck in deleting state • Missing connection_info in block_device_mappimg table Symptom • Find the REAL state of the volume from hypervisor • Modify the nova and cinder DBs to reset the state. • Re-run “nova volume-delete” after cleaning state in DB for volume stuck in deleting state. Fix
  • 19. © 2015 PayPal Inc. All rights reserved. Confidential and proprietary. 19 Questions ?