OPENSTACK HA @PAYPAL

Open Stack Summit – Hong Kong - 2013
ABOUT PAYPAL
PayPal offers flexible and innovative payment solutions for consumers
and merchants of all sizes.

• 137,000,000 users
• $300,000 payments processed
each minute

• 193 markets / 26 currencies
• The World‟s Most Widely Used Digital Wallet

2
AGENDA
Why HA is important for PayPal?

Our Learning
Our Solution
What is not solved?
Q&A

3
WHY HA IS IMPORTANT?
“no perceived downtime” for cloud users

Enterprise Class
Auto Scaling & Flex up/down can never break
API Integrations always succeed
Everyone expected to use the cloud

4
AVAILABILITY REQUIREMENTS
No SPOF “Under the Cloud”

Scale Across the Data Center(s)
Scale Across Racks & Containers

Respect natural availability zones within the data centers
No „cloud‟ can impact any other „cloud‟

5
INFRASTRUCTURE RACK
Layer 2
versus
Layer 3

10g
Active

10g
Passive

1g
Mgmt

Infrastructure / Controller Racks

10g
Passive

10g
Active

LB Passive

1g
Mgmt

6

10g
Active

Compute Racks …

10g
Passive

…

1g
Mgmt

1g
Mgmt

LB Active

10g
Passive

Access

10g
Active

Cattle
&
Puppies
INFRASTRUCTURE RACK

OpenStack Services are all VM on KVM
Every infra component resides on 2+ nodes
Redundant physical racks
Redundant power/switches in each rack
Layer-3 connectivity between racks (no Layer 2)
Enterprise Grade Physical LB (floating VIP)
7
COMPUTE
1
2
LB Active

Access

LB Passive

LB Active

LB Passive

3
1g
Mgmt
10g
Passive
10g
Active

1g
Mgmt
10g
Passive
10g
Active

1g
Mgmt

1g
Mgmt
10g
Passive

10g
Passive
10g
Active

10g
Active

10g
Active

10g
Passive

10g
Active
Compute Node
96 Hyperscale
16 Core
256GB Ram
1.1T Disk

1g
Mgmt

10g
Passive

10g
Active

10g
Active

Compute Node
96 Hyperscale
16 Core
256GB Ram
1.1T Disk

1g
Mgmt

10g
Passive

10g
Passive

8

1g
Mgmt

1g
Mgmt
Compute Node
96 Hyperscale
16 Core
256GB Ram
1.1T Disk

Compute Node
96 Hyperscale
16 Core
256GB Ram
1.1T Disk
COMPUTE

Active

10g 10g

10g
10g
bond0

1g

Top Of Rack

10g
10g
bond0

Hyperscale
Raid-10

1g

9

Passive

10g 10g

Management

1g

Top Of Rack

1g

Hyperscale
Raid-10
swift storage node
swift storage node
swift storage node

OPENSTACK SERVICES

swift
swift-object
swift-container
swift-account

6000 / TCP

Browser

6001 / TCP

UDNS (DNSaas)
UDNS (DNSaas)

6002 / TCP

80 / TCP

quantum

Openstack Controller
Openstack Controller
Openstack Controller

9696 / TCP

80 / TCP

Quantum Server
Quantum Server

quantum-api

LBaas
LBaas

53 / TCP

10053 / TCP

22,80,443,161 /
TCP
161/ UDP

80 / TCP

DNS Master

F5 Load Balancer

Remedy API

httpd (dashboard)
443 / TCP

glance
9292 / TCP
9191 / TCP

openflow

6633 / TCP

mgmt port

6632 / TCP

35357 / TCP
5000 / TCP
8773 / TCP
8774 / TCP

NVP Service Node
NVP Service Node
NVP Service Node

8776 / TCP
8080 / TCP

glance-admin
glance-reg

8140 / TCP
F5 Load
Balancer

Puppet DB

61613 / TCP

Puppet VIP

keystone
keystone-admin
keystone-api

nova
nova-api
novametadata-api
novavolume-api

swift-proxy

3115 / TCP

Nicira NVP Controller
Nicira NVP Controller
Nicira NVP Controller

3115 / TCP

F5 Load
Balancer

xxxx / TCP
NVP Gateway
NVP Gateway
NVP Gateway

Compute Node
Hypervisor

MYSQL DB
MYSQL DB
mysql 5

nova

mq
OpenVswitch
ovs-vswitchd
ovsdb-server

puppet

Mongo DB
Mongo DB
mongo db
OPENSTACK CONSIDERATIONS
LB VIP for every service (unless it can‟t)
Connect to LB VIP, not individual nodes
Script to close Server Connections
Pacemaker only works inside a single Layer-2 (not a large enterprise)

Auto Restart using Monit
MySQL

Swift Cluster

11
CONTINUED…
HEAT with Corosync/Pacemaker/keepalived (for now)

KeyStone / Nova / Glance / Swift Proxy
Rabbit MQ Cluster
Cinder Volume Service

12
CINDER SERVICES WORKFLOW
User request
(create volume)

1

Cinder API

2
AMPQ

5
Cinder Volume

6
Storage
Backend1
13

Cinder
Scheduler

3

Storage
Backend2

4

Figure shows a typical
interaction between
Cinder components to
serve a end user request.
(create new volume in
this example).
CINDER SERVICES WITH HA
User request
(create volume)

1

How HA is implemented for
Cinder Components:

Load Balancer
Cinder
Scheduler A

2
Cinder API A

Cinder
Scheduler B

Cinder API B

AMPQ
Cluster

3

4

5
Cinder Volume A

Cinder Volume B

6
14

Storage
Backend1

Storage
Backend2

• API (stateless) – Load Balancer
(A/A or A/P);

• Scheduler (stateless) –
Pacemaker, Queue itself (A/A or
A/P);
• Volume – Pacemaker, Queue
itself (A/A or A/P).
UNRESOLVED
VIP-friendly Cinder Volume service

Seamless Upgrade Flip
Failed DB TX Reconciliation
Consistent API Response Time

15
cloud@paypal.com

16

Confidential and Proprietary
THANK YOU
HTTP://GITHUB.COM/PAYPAL/AURORA
SCOTT CARLSON - @RELAXED137
RAJ GEDA
ZHITENG HUANG IRC:WINSTON-D

More Related Content

PDF
Building a Small DC
PPTX
TADSummit Dangerous demo: Oracle
PDF
How to build megaservices mind7 2021 June 29
PDF
Introduction to Akka Serverless
PPTX
ClueCon 2017
PDF
Dangerous Demo: Apidaze
PPTX
Open stack HA - Theory to Reality
PPTX
Monoliths to Microservices: App Transformation - Jacksonville Workshop Slides
Building a Small DC
TADSummit Dangerous demo: Oracle
How to build megaservices mind7 2021 June 29
Introduction to Akka Serverless
ClueCon 2017
Dangerous Demo: Apidaze
Open stack HA - Theory to Reality
Monoliths to Microservices: App Transformation - Jacksonville Workshop Slides

What's hot (20)

PDF
OpenStack Control Plane High Availability
ODP
Wireless openflow (english)
PPTX
Microsoft signal r
PPTX
Introduction to Kafka Cruise Control
PDF
High availability and fault tolerance of openstack
PDF
Practical tips and tricks for Apache Kafka messages integration | Francesco T...
PDF
Kuberntes Ingress with Kong
PDF
OpenContrail Cloudwatt Feedback
PPTX
OpenStack Upgrade - It can be done! Koby Holzer, Liran Cohen - OpenStack Day ...
PDF
Kong API
PPTX
WebSocket MicroService vs. REST Microservice
PPTX
Stacking up with OpenStack: Building for High Availability
PDF
High Availability for OpenStack
PDF
VPNaaS in Neutron
KEY
Experience on-freeswitch-cluecon2011
PPTX
How Hootsuite Manages Its Growing Microservice Landscape
PPTX
Gluecon - Kafka and the service mesh
PPTX
ONAP Overview Webinar - Aarna Networks & Cloudify
PDF
OpenDataPlane Project
PDF
Manage your APIs and Microservices with an API Gateway
OpenStack Control Plane High Availability
Wireless openflow (english)
Microsoft signal r
Introduction to Kafka Cruise Control
High availability and fault tolerance of openstack
Practical tips and tricks for Apache Kafka messages integration | Francesco T...
Kuberntes Ingress with Kong
OpenContrail Cloudwatt Feedback
OpenStack Upgrade - It can be done! Koby Holzer, Liran Cohen - OpenStack Day ...
Kong API
WebSocket MicroService vs. REST Microservice
Stacking up with OpenStack: Building for High Availability
High Availability for OpenStack
VPNaaS in Neutron
Experience on-freeswitch-cluecon2011
How Hootsuite Manages Its Growing Microservice Landscape
Gluecon - Kafka and the service mesh
ONAP Overview Webinar - Aarna Networks & Cloudify
OpenDataPlane Project
Manage your APIs and Microservices with an API Gateway
Ad

Similar to High Availability OpenStack at PayPal - OpenStack Summit Fall Hong Kong 2013 (20)

PPTX
DPDK summit 2015: It's kind of fun to do the impossible with DPDK
PDF
DPDK Summit 2015 - NTT - Yoshihiro Nakajima
PDF
Lagopus presentation on 14th Annual ON*VECTOR International Photonics Workshop
PDF
Advanced Networking: The Critical Path for HPC, Cloud, Machine Learning and more
PDF
Sven Vogel: Running CloudStack and OpenShift with NetApp on KVM
PDF
CSG Huawei.pdf
PPTX
HP Virtual Connect technical fundamental101 v2.1
PPTX
Software Stacks to enable SDN and NFV
PPTX
Building Data Streaming Platforms using OpenShift and Kafka
PDF
The advantages of Arista/OVH configurations, and the technologies behind buil...
PDF
Introduction to Industrial Control Systems : Pentesting PLCs 101 (BlackHat Eu...
PPT
OpenFlow Tutorial
PPTX
Connect Everything with NATS - Cloud Expo Europe
PDF
Challenges of L2 NID Based Architecture for vCPE and NFV Deployment
PDF
Approaching hyperconvergedopenstack
PDF
Sdn dell lab report v2
PDF
Cowboy dating with big data
PDF
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
PDF
Safer Commutes & Streaming Data | George Padavick, Ohio Department of Transpo...
PDF
XS Boston 2008 Network Topology
DPDK summit 2015: It's kind of fun to do the impossible with DPDK
DPDK Summit 2015 - NTT - Yoshihiro Nakajima
Lagopus presentation on 14th Annual ON*VECTOR International Photonics Workshop
Advanced Networking: The Critical Path for HPC, Cloud, Machine Learning and more
Sven Vogel: Running CloudStack and OpenShift with NetApp on KVM
CSG Huawei.pdf
HP Virtual Connect technical fundamental101 v2.1
Software Stacks to enable SDN and NFV
Building Data Streaming Platforms using OpenShift and Kafka
The advantages of Arista/OVH configurations, and the technologies behind buil...
Introduction to Industrial Control Systems : Pentesting PLCs 101 (BlackHat Eu...
OpenFlow Tutorial
Connect Everything with NATS - Cloud Expo Europe
Challenges of L2 NID Based Architecture for vCPE and NFV Deployment
Approaching hyperconvergedopenstack
Sdn dell lab report v2
Cowboy dating with big data
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
Safer Commutes & Streaming Data | George Padavick, Ohio Department of Transpo...
XS Boston 2008 Network Topology
Ad

More from Scott Carlson (15)

PDF
What are Blockchain & Tokens and are they useful ?
PPTX
RSA APJ - BLOCKCHAIN SECURITY – IS IT REALLY DIFFERENT THAN ANYTHING ELSE ?
PPTX
Just Trust Everyone and We Will Be Fine, Right?
PPTX
DCD Converged Brazil 2016
PPTX
Trust But Control: Managing Privileges without killing productivity
PDF
RSA 2015 Realities of Private Cloud Security
PDF
RSA 2016 Realities of Data Security
PPTX
Will Your Cloud Be Compliant? OpenStack Security
PPTX
Interop Las Vegas Cloud Connect Summit 2014 - Software Defined Data Center
PPTX
Can Security & Agility Co-Exist
PPTX
You Can't Correlate what you don't have - ArcSight Protect 2011
PDF
HP Enterprise Security Customer Case Study - Apollo Group
PDF
Marriage of ESX and OpenStack - PayPal - VMWorld US 2013
PDF
McAfee Focus 2011 - Security in the Age of a Mobile Workforce and Mobile Devices
PPTX
Marriage of Openstack with KVM and ESX at PayPal OpenStack Summit Hong Kong F...
What are Blockchain & Tokens and are they useful ?
RSA APJ - BLOCKCHAIN SECURITY – IS IT REALLY DIFFERENT THAN ANYTHING ELSE ?
Just Trust Everyone and We Will Be Fine, Right?
DCD Converged Brazil 2016
Trust But Control: Managing Privileges without killing productivity
RSA 2015 Realities of Private Cloud Security
RSA 2016 Realities of Data Security
Will Your Cloud Be Compliant? OpenStack Security
Interop Las Vegas Cloud Connect Summit 2014 - Software Defined Data Center
Can Security & Agility Co-Exist
You Can't Correlate what you don't have - ArcSight Protect 2011
HP Enterprise Security Customer Case Study - Apollo Group
Marriage of ESX and OpenStack - PayPal - VMWorld US 2013
McAfee Focus 2011 - Security in the Age of a Mobile Workforce and Mobile Devices
Marriage of Openstack with KVM and ESX at PayPal OpenStack Summit Hong Kong F...

Recently uploaded (20)

PDF
Taming the Chaos: How to Turn Unstructured Data into Decisions
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
sustainability-14-14877-v2.pddhzftheheeeee
PDF
Unlock new opportunities with location data.pdf
PDF
Architecture types and enterprise applications.pdf
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PPTX
Web Crawler for Trend Tracking Gen Z Insights.pptx
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
A comparative study of natural language inference in Swahili using monolingua...
PPTX
Modernising the Digital Integration Hub
PDF
August Patch Tuesday
PDF
Getting started with AI Agents and Multi-Agent Systems
PPT
Geologic Time for studying geology for geologist
PPT
Module 1.ppt Iot fundamentals and Architecture
PDF
WOOl fibre morphology and structure.pdf for textiles
Taming the Chaos: How to Turn Unstructured Data into Decisions
Group 1 Presentation -Planning and Decision Making .pptx
Univ-Connecticut-ChatGPT-Presentaion.pdf
sustainability-14-14877-v2.pddhzftheheeeee
Unlock new opportunities with location data.pdf
Architecture types and enterprise applications.pdf
A contest of sentiment analysis: k-nearest neighbor versus neural network
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
1 - Historical Antecedents, Social Consideration.pdf
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
Web Crawler for Trend Tracking Gen Z Insights.pptx
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
A comparative study of natural language inference in Swahili using monolingua...
Modernising the Digital Integration Hub
August Patch Tuesday
Getting started with AI Agents and Multi-Agent Systems
Geologic Time for studying geology for geologist
Module 1.ppt Iot fundamentals and Architecture
WOOl fibre morphology and structure.pdf for textiles

High Availability OpenStack at PayPal - OpenStack Summit Fall Hong Kong 2013

  • 1. OPENSTACK HA @PAYPAL Open Stack Summit – Hong Kong - 2013
  • 2. ABOUT PAYPAL PayPal offers flexible and innovative payment solutions for consumers and merchants of all sizes. • 137,000,000 users • $300,000 payments processed each minute • 193 markets / 26 currencies • The World‟s Most Widely Used Digital Wallet 2
  • 3. AGENDA Why HA is important for PayPal? Our Learning Our Solution What is not solved? Q&A 3
  • 4. WHY HA IS IMPORTANT? “no perceived downtime” for cloud users Enterprise Class Auto Scaling & Flex up/down can never break API Integrations always succeed Everyone expected to use the cloud 4
  • 5. AVAILABILITY REQUIREMENTS No SPOF “Under the Cloud” Scale Across the Data Center(s) Scale Across Racks & Containers Respect natural availability zones within the data centers No „cloud‟ can impact any other „cloud‟ 5
  • 6. INFRASTRUCTURE RACK Layer 2 versus Layer 3 10g Active 10g Passive 1g Mgmt Infrastructure / Controller Racks 10g Passive 10g Active LB Passive 1g Mgmt 6 10g Active Compute Racks … 10g Passive … 1g Mgmt 1g Mgmt LB Active 10g Passive Access 10g Active Cattle & Puppies
  • 7. INFRASTRUCTURE RACK OpenStack Services are all VM on KVM Every infra component resides on 2+ nodes Redundant physical racks Redundant power/switches in each rack Layer-3 connectivity between racks (no Layer 2) Enterprise Grade Physical LB (floating VIP) 7
  • 8. COMPUTE 1 2 LB Active Access LB Passive LB Active LB Passive 3 1g Mgmt 10g Passive 10g Active 1g Mgmt 10g Passive 10g Active 1g Mgmt 1g Mgmt 10g Passive 10g Passive 10g Active 10g Active 10g Active 10g Passive 10g Active Compute Node 96 Hyperscale 16 Core 256GB Ram 1.1T Disk 1g Mgmt 10g Passive 10g Active 10g Active Compute Node 96 Hyperscale 16 Core 256GB Ram 1.1T Disk 1g Mgmt 10g Passive 10g Passive 8 1g Mgmt 1g Mgmt Compute Node 96 Hyperscale 16 Core 256GB Ram 1.1T Disk Compute Node 96 Hyperscale 16 Core 256GB Ram 1.1T Disk
  • 9. COMPUTE Active 10g 10g 10g 10g bond0 1g Top Of Rack 10g 10g bond0 Hyperscale Raid-10 1g 9 Passive 10g 10g Management 1g Top Of Rack 1g Hyperscale Raid-10
  • 10. swift storage node swift storage node swift storage node OPENSTACK SERVICES swift swift-object swift-container swift-account 6000 / TCP Browser 6001 / TCP UDNS (DNSaas) UDNS (DNSaas) 6002 / TCP 80 / TCP quantum Openstack Controller Openstack Controller Openstack Controller 9696 / TCP 80 / TCP Quantum Server Quantum Server quantum-api LBaas LBaas 53 / TCP 10053 / TCP 22,80,443,161 / TCP 161/ UDP 80 / TCP DNS Master F5 Load Balancer Remedy API httpd (dashboard) 443 / TCP glance 9292 / TCP 9191 / TCP openflow 6633 / TCP mgmt port 6632 / TCP 35357 / TCP 5000 / TCP 8773 / TCP 8774 / TCP NVP Service Node NVP Service Node NVP Service Node 8776 / TCP 8080 / TCP glance-admin glance-reg 8140 / TCP F5 Load Balancer Puppet DB 61613 / TCP Puppet VIP keystone keystone-admin keystone-api nova nova-api novametadata-api novavolume-api swift-proxy 3115 / TCP Nicira NVP Controller Nicira NVP Controller Nicira NVP Controller 3115 / TCP F5 Load Balancer xxxx / TCP NVP Gateway NVP Gateway NVP Gateway Compute Node Hypervisor MYSQL DB MYSQL DB mysql 5 nova mq OpenVswitch ovs-vswitchd ovsdb-server puppet Mongo DB Mongo DB mongo db
  • 11. OPENSTACK CONSIDERATIONS LB VIP for every service (unless it can‟t) Connect to LB VIP, not individual nodes Script to close Server Connections Pacemaker only works inside a single Layer-2 (not a large enterprise) Auto Restart using Monit MySQL Swift Cluster 11
  • 12. CONTINUED… HEAT with Corosync/Pacemaker/keepalived (for now) KeyStone / Nova / Glance / Swift Proxy Rabbit MQ Cluster Cinder Volume Service 12
  • 13. CINDER SERVICES WORKFLOW User request (create volume) 1 Cinder API 2 AMPQ 5 Cinder Volume 6 Storage Backend1 13 Cinder Scheduler 3 Storage Backend2 4 Figure shows a typical interaction between Cinder components to serve a end user request. (create new volume in this example).
  • 14. CINDER SERVICES WITH HA User request (create volume) 1 How HA is implemented for Cinder Components: Load Balancer Cinder Scheduler A 2 Cinder API A Cinder Scheduler B Cinder API B AMPQ Cluster 3 4 5 Cinder Volume A Cinder Volume B 6 14 Storage Backend1 Storage Backend2 • API (stateless) – Load Balancer (A/A or A/P); • Scheduler (stateless) – Pacemaker, Queue itself (A/A or A/P); • Volume – Pacemaker, Queue itself (A/A or A/P).
  • 15. UNRESOLVED VIP-friendly Cinder Volume service Seamless Upgrade Flip Failed DB TX Reconciliation Consistent API Response Time 15
  • 17. THANK YOU HTTP://GITHUB.COM/PAYPAL/AURORA SCOTT CARLSON - @RELAXED137 RAJ GEDA ZHITENG HUANG IRC:WINSTON-D

Editor's Notes

  • #3: So a little bit about PayPal before we start, let’s quickly run through with some key details on what PayPal is and what we do.And we’re a payments company.You can think of PayPal as a digital wallet – one convenient, secure spot to keep all your ways to pay.And PayPal is not just on the internetfor you to send money to a friend, or buy something on eBay – along with numerous merchants that let you pay with PayPal online,we are also in-store, in places like Home Depot and GNC. And with this brick and mortar presence, you can leave your wallet at home, punch in your phone number and PIN code, and still buy something.And with payment innovations like that, we continue to grow, as these numbers show, 137m active users, 300,000 dollars worth of payments/min… this tells you that scale is important to us, and we scale on a global basis to meet theneeds of our customers worldwide, especially here in Asia.We’re talking about nearly 200markets and 26 currencies. We literally are the world’s most widely used digital wallet.
  • #4: Shift from Enterprise design model to cloud-based designElastically scale and self-heal infrastructure to accommodate unpredictable usage patterns of customers and internet commerceSeparate rapidly iterating customer experiences from core servicesreduce overall cost per transaction within the environment
  • #7: Infrastructure Rack only for Cloud Management GearCompute racks scale as far asIP addresses run outNeutron network(s) …NVP Gateway Limit …
  • #8: Infrastructure Rack only for Cloud Management GearCompute racks scale as far asIP addresses run outNeutron network(s) …NVP Gateway Limit …
  • #9: Two Entry Points for InfrastructurePayPal Product DevelopersCloud Operators to manage CloudCentrally Orchestrated using HeatLocal StorageHP 4X600 GB(MirrorCisco 4948 & Arista 7050Nicira NVPF5 10.2.2 LB
  • #12: http://www.palominodb.com/blog/2012/12/10/benchmarking-ndb-vs-galeraMaria DBBottleneck on LB during Image transferHeat active/standby support, no active/active cluster
  • #13: http://www.palominodb.com/blog/2012/12/10/benchmarking-ndb-vs-galeraMaria DBBottleneck on LB during Image transferHeat active/standby support, no active/active clusterCinder Volume Service doesn’t play well with load balancer and VIP.
  • #16: Talk about cinder HA issuesVM Create issues due to failed Rabbit MQ message deliveryIssues in Upgrade without downtime for major versions rolloutNo Auto cleanup for stale DB rowsThe API Response is not consistent due to DB locks and DB Connection threads