SlideShare a Scribd company logo
Operating a Highly Available
Cloud Service
November 14, 2013

Depankar Neogi
Chief Architect
QuickBase, Intuit Inc.

Presented at Boston Cloud Services Meetup
http://www.meetup.com/Boston-cloud-services/events/141118632/
Agenda

• Intuit and QuickBase
• Building and Running Highly Available Cloud
Services
–People & Process
–Technology

The single most important thing to keep in mind when
designing for High Availability is to anticipate failure.

2
Improving
#1 Financial Management
Software

Facilitate $40B Tax
Refunds
3

60M
Lives

#1 for Innovation
in Computer Software
Industry

20% of GDP & Pay 1
in 12

Apps for >50% of
Fortune 500
What is QuickBase?
Easily customized
to meet unique
business needs

Excel to
QuickBase
in less than
5 minutes

Brand NEW modern UI
enables Ease of Use

An Enterprise
platform to
empower your
team to build
applications

Requirements,
processes and
teams evolving
constantly
More than

4,500

companies
use QuickBase

500,000+
current users

One platform solves jobs across the enterprise.
Project Management, IT helpdesk, CRM, Field service, Human resources, etc.

4
QuickBase – Customized applications matching
your unique requirements

Roles Based UI

Dashboards
& Reports

Data Storage
& Backup

Secure Access
Control

Relational Data
Tables

Business logic &
workflow

Open extensible API’s
Common Infrastructure Services

5
Modern, Easy, Productive, Dynamic, Fast

30 million requests per day
80 K unique visitors per day
100,000 active apps at any time
25 milliseconds median processing time
Supports Dynamic DML, DDL, CRUD
Cloud based Database with a beautiful UX
6
New QuickBase DIY Data Access

Liberators

Data Mapping
WSQL Transforms
Virtual tables
Liberator
Cache
Library
Warehouse
Scheduler
Repository

1. QuickBase UI
Extended with new
DIY data sharing

2. New Data Sharing
Service

A
N
Y
A
P
I

3. Connections to
Popular Industry Data

Intuit-class infrastructure
(security, billing, HADR, hosting)
8
AVAILABILITY

9
PSTN Systems Availability SLA

Downtime
99.9999 %  “six nines”  31.5 secs/yr, 2.59 secs/month, 0.605 secs/week

99.999 %

10

 “five nines”  5.26 mins/yr, 25.9 secs/month, 6.05 secs/week
Web Services Availability SLA

Downtime
99.95 %  4.38 hrs/yr, 21.56 mins/month, 5.04 mins/week

99.9 %

11

 8.76 hrs/yr, 43.8 mins/month, 10.1 mins/week
12

http://www.google.com/apps/intl/en/terms/sla.html
Operating High Availability Service

PEOPLE & PROCESSES

13
People & Process: Monitoring Business Metrics
• It’s critical to detect a problem before your customers have
to tell you or you have to ask them.
• By monitoring real time business metrics and comparing
the actual data to a historical curve you can more quickly
detect if there is a problem and avoid sifting through alerting
and monitoring white noise that your systems will
inevitability produce.
• Five evolutionary questions that monitoring should answer:
1.
2.
3.
4.
5.

Is there a problem?
Where is the problem?
What is the problem?
Why is there a problem?
Will there be a problem?

• External versus Internal Monitoring
http://akfpartners.com/techblog/2009/06/15/monitoring-strategies/
14
People & Process: Invest in Good Tools

A good tool will help you find the
needle in a haystack - fast

95 K Requests in 12 hour window
Peak Request: 4.3 req/sec (1286 request/5 min window)
15

Processing Time: 61 millisecond per request
People & Process: Incident Management Process
•
•
•
•
•
•
•
•
•

Incident Management Team (IMT)
Incident Management Response Plan
Activating the IMT, notifications
Having the right break-out rooms
Classification of the incident
Communication of the incident
Time keeper
Management versus Technical Process
Tracking:
– SLA
– RPO (recovery point objective)
– RTO (recovery time objective)

• Incident closure, recovery
• Evaluation process
16
People & Process: Runbook and messaging
• Runbook
– Detail process for managing the incident
– Contact Information
– Managing data center cutover, recovery steps, testing, managing
replication

• Messaging book
–
–
–
–
–

Who is responsible for communication
Who creates and approves the message
How you communicate
At what cadence
What you tell your customers

• Social Media Strategy
–
–

17

If you are not transparent, your customers will let you know
Social Media coordinator – own the channels
People & Process: Service Page

Provide Customers ability to find out the health of the system
and be notified of any service related issues
18
People & Process: Service Page

Transparency is Key. If you let the customers know what you know,
they will respect you and may remain loyal to your business.
19
People & Process: Business Fault Isolation
•
•
•
•
•

What if your data center went down
And the production server is down because the data center is down
And your email server was in the same data center
And your marketing server was in the same data center
And your service page was on a server in the same date center

• How do you communicate with all your customers?

Business Fault Isolation prevents your business from a SPOF
(single point of failure).
20
People & Process: Review Process
• SaaS or Operations Review Process should have a fixed
cadence and be led by a company leader
• Review Team should include leaders from:
– Finance
– Compliance & Risk
– CTO
– Operations
– Product

• Dashboard with KPI
• Review Fire drills
• Change Control Process
– Preferably change one thing at a time

21
Operating High Availability Service

TECHNOLOGIES

22
The Three Pillars of High Availability
The goal of High Availability and Disaster Recovery (HA/DR) is
to provide Business Continuance through:

Lack of Service Outage = Happy Customers = Greater Business Value

HA/DR directly enhances a customer’s experience through
greater offering availability
High Availability Architecture Principles
• Design for Failure
– Avoid Single Points of Failure
– Graceful Degradation and Soft Dependencies
– Asynchronous Design
– Keep State Confined to Where it is Needed

• Design for Operability
– Design to be Monitored
– Design for Hot Deployment and Rollback
– Automate Where Possible

• Keep Everything “In Production”
• Scale Out (Not Up)
• Keep it Fresh…and Mature
Architecture Patterns for High Availability
Swimlanes

1)
2)

Active/Active

3)

Single Write Master

4)

25

Active/Passive

Store and Forward
Active / Passive

Primary Data Center

Secondary Data
Center

Near Real-time
Replication

Active
Data

26

Passive
Back Up
Swimlane Principle
A “Swimlane” is:
A set of predefined systems and software infrastructure tuned
to support a predefined workload
• Only a portion of an offering’s total users are hosted on any
given swimlane

Within a Swimlane:
– Each Swimlane is independent and self-sufficient and
shares no compute/storage resources with other swimlanes
– Offering transactions occur within a Swimlane
– Only access to Shared Services go outside the Swimlane
– Standard Fault Detection and Fault Recovery methods
are used

27
High Availability with Swimlanes
Application Partitioning

GTM

via Swimlanes

DC 1

Fault Domain 1

Fault Domain 2

WS

AS

Storage

28
WS: web server; AS: app server

WS
AS

Swimlane 2

AS

Storage

Swimlane 4’

Swimlane 3

Storage

WS

F5 GTM

Storage

WS

AS

Storage

WS

AS

Storage

Intuit Proprietary & Confidential

WS
AS

Storage

Swimlane 4

AS

F5 LTM

Swimlane 3’

WS

DNS

Swimlane 1’

F5 GTM

Swimlane 2’

F5 LTM

Swimlane 1

DC 2

Internet

WS

AS

Storage
Swimlanes Support Application Needs
• Scalability
• Replicated swimlanes add capacity with linear scalability

• Fault Isolation
• Complete failure only impacts a subset of users due to application
partitioning and data sharding

• High Availability
• Individual tiers can be made highly available through intra-VM application
recovery, intra-swimlane application failover or intra-swimlane VM restart

• Disaster Recovery
• Disaster recovery is achieved through swimlane failover, either in the same
or a remote data center

• Automation
• The identical nature of a swimlane allows for a high degree of operational
automation

29
Active / Active – Swim Lanes
Global
Load
Balancer

Data Center 1

25%
customers

Data Center 2

25%
customers

25%
customers

Replication

25%
customers

DB3 active

DB1 active

-----------------

-----------------

DB1 passive

DB3 passive
DB2 active

Replication

DB4 active

----------------DB4 passive

30

----------------DB2 passive
Active / Active – Single Write Master
DC1

DC2

DC3

DC4

Writes

Updates

Cache Updates

Read
Cache

31

Read
Cache

Read
Cache

Read
Cache
Design for Failure: Resiliency Patterns
Throttling versus Circuit Breaker

32
Circuit Breaker Pattern

Circuit Breaker State Diagram
Caller
C

Dependency

Closed
On call/ pass through

Open

Trip breaker

D

Call succeeds / reset count

On Call / Fail

Call fail/count failure

On timeout / attempt reset

Threshold reached/trip breaker

Trip breaker

Attempt

Attempt
Reset

Reset

Half Open
On call / pass through
On succeed/reset
On fail /trip breaker

http://techblog.netflix.com/2012_02_01_archive.html
33
34

http://techblog.netflix.com/2012_02_01_archive.html

Circuit Breaker Pattern :
Example
35

http://techblog.netflix.com/2012_02_01_archive.html

Circuit Breaker Pattern:
Example
Example of how threads, network timeouts and retries combine
Examples of Tools for Building HA Systems
•
•
•
•
•
•
•
•
•
•
•
•
•
•
36

Highly Available DNS– Akamai, Dyn, AWS Route53
Load Balancing – F5 LTM, F5 GTM, AWS ELB
Data Replication – Golden Gate
Monitoring – eHealth, Spectrum, Wily, Splunk, Cacti
Application Performance – DynaTrace, NewRelic
Deployment – Perforce, Maven, Nexus, Hudson, Puppet
Distributed Databases – NuoDB, VoltDB, several NoSQL types
Distributed Storage – GlusterFS, Atmos, OpenStack
HA Devices – Veritas Cluster Server
OS Virtualization – AWS, Mware, Xen, Parallels
Network Virtualization – AWS, Mware NSX, PLUMgrid
Caching– Memcached, Akamai, CloudFront
Caching– Netflix Chaos Monkey
DDos Protection– Arbor, Riverbed
Trust Not the Execution Environment
“Everything Fails, All the Time.” – Werner Vogels, CTO of
Amazon.com

37
Summary: Operating HA Service
Monitoring Business Metrics
Incident Management Process
Runbooks
Social Media & Messaging
Service Page
Business Fault Isolation
SLA, RPO, RTO
Failover Drills
Review Process
Change one thing at a time

Principles:
–
–
–
–
–

Design for Failure
Design for Operability
Keep Everything “In Production”
Scale Out (stateless)
Keep it Fresh

Patterns:
–
–
–
–

Active/Active
Swimlanes
Active/Passive
Store-Forward

Design:
–
–
–
–
–
38

Throttling
Circuit Breaker
Caching
Rollback
Healthchecks

Tools
Thank You!

39

More Related Content

PPTX
How Application Discovery and Dependency Mapping can stop you from losing cus...
PPT
ManageEngine Applications Manager Overview
PPT
Simple, effective 'Synthetic End-User Experience Monitoring' with ManageEngin...
PPTX
Virtualization Management With Quest V Foglight
PPT
Best practices in deploying IBM Operation Decision Manager Standard 8.8.0
PPTX
Salesforce Lightning Process Builder IS the next-generation workflow tool
PDF
Virtualization performance management
PPT
Personnel Productivity System - Exec Pres
How Application Discovery and Dependency Mapping can stop you from losing cus...
ManageEngine Applications Manager Overview
Simple, effective 'Synthetic End-User Experience Monitoring' with ManageEngin...
Virtualization Management With Quest V Foglight
Best practices in deploying IBM Operation Decision Manager Standard 8.8.0
Salesforce Lightning Process Builder IS the next-generation workflow tool
Virtualization performance management
Personnel Productivity System - Exec Pres

What's hot (20)

PPTX
Webinar - How to Get Real-Time Network Management Right?
PPTX
Foglight for Virtualization, Enterprise Edition
PPT
Introduction To Server Virtualisation Planning And Implementing A Virtualisat...
PPTX
Customer.pptx
PDF
Building Operational Intelligence in Telecom with IBM ODM @Claro
PDF
VMworld 2013: SDDC IT Operations Transformation: Multi-customer Lessons Learned
PDF
BigInsights For Telecom
PDF
How Financial Engines Drives Business Outcomes Using AppDynamics Analytics - ...
PPTX
The Business Case for Hosting JD Edwards in the Cloud
PDF
Technologies: Expert in the Room Webinar: Navigate Infrastructure Management
PPT
Best practices in IBM Operational Decision Manager Standard 8.7.0 topologies
PDF
How Nationwide Insurance use IBM Decision Manager and BPM
PDF
JD Edwards in the Cloud - Flipbook: What are your peers doing?
PDF
Real life with Oracle's JD Edwards Applications in the Cloud
PDF
Presentation managing the virtual environment
PDF
Visualizing Your Network Health - Know your Network
PPT
SmartCloud Monitoring and Capacity Planning
PPTX
vbrownbag dcd6-2.4-merged
PDF
De-Mystifying Capacity Management in the Digital World
PPTX
vBrownbag VCAP6-DCV Design Objective 1.1
Webinar - How to Get Real-Time Network Management Right?
Foglight for Virtualization, Enterprise Edition
Introduction To Server Virtualisation Planning And Implementing A Virtualisat...
Customer.pptx
Building Operational Intelligence in Telecom with IBM ODM @Claro
VMworld 2013: SDDC IT Operations Transformation: Multi-customer Lessons Learned
BigInsights For Telecom
How Financial Engines Drives Business Outcomes Using AppDynamics Analytics - ...
The Business Case for Hosting JD Edwards in the Cloud
Technologies: Expert in the Room Webinar: Navigate Infrastructure Management
Best practices in IBM Operational Decision Manager Standard 8.7.0 topologies
How Nationwide Insurance use IBM Decision Manager and BPM
JD Edwards in the Cloud - Flipbook: What are your peers doing?
Real life with Oracle's JD Edwards Applications in the Cloud
Presentation managing the virtual environment
Visualizing Your Network Health - Know your Network
SmartCloud Monitoring and Capacity Planning
vbrownbag dcd6-2.4-merged
De-Mystifying Capacity Management in the Digital World
vBrownbag VCAP6-DCV Design Objective 1.1
Ad

Viewers also liked (20)

PDF
Intuit QuickBase at MassTLC Cloud Summit - Drivers of Cloud Adoption with All...
PDF
Welcome from Intuit QuickBase Keynote
PDF
Guiding Principles on Effective Rapid Application Development
PPT
01 0 trm_pscd_introduction_new
PDF
Creating an IT Revolution within your Organization - QuickBase, Inc. at CIO V...
PDF
Dr matthew katz_médias_sociaux_19_avril_2012
PDF
分散システムの協調処理
DOC
China banking industry market research and prospect forecast report
PDF
Arthur Bodolec of Feedly on Designing With Your Ears
PDF
China organosilicon industry market demand prospects and investment strategy ...
DOCX
Interbrand vianey maya
PDF
Pencil vs camera
DOCX
mickey shariff
PPT
Meine Freizeit, Fani Michou
DOC
publications and presentations
DOC
China dredging engineering industry development prospect and investment strat...
PPTX
Filming day
DOC
Ephata 630
PPTX
Amazon rds
PPTX
Technology presantation
Intuit QuickBase at MassTLC Cloud Summit - Drivers of Cloud Adoption with All...
Welcome from Intuit QuickBase Keynote
Guiding Principles on Effective Rapid Application Development
01 0 trm_pscd_introduction_new
Creating an IT Revolution within your Organization - QuickBase, Inc. at CIO V...
Dr matthew katz_médias_sociaux_19_avril_2012
分散システムの協調処理
China banking industry market research and prospect forecast report
Arthur Bodolec of Feedly on Designing With Your Ears
China organosilicon industry market demand prospects and investment strategy ...
Interbrand vianey maya
Pencil vs camera
mickey shariff
Meine Freizeit, Fani Michou
publications and presentations
China dredging engineering industry development prospect and investment strat...
Filming day
Ephata 630
Amazon rds
Technology presantation
Ad

Similar to Operating a Highly Available Cloud Service (20)

PPTX
The 3 Pillars of Remote Application Development
PPTX
VMworld 2015: vRealize Operations Insight: Manage vSphere and Your Entire Dat...
PPTX
DCIM Software Five Years Later: What I Wish I Had Known When I Started (Case ...
PPTX
Pivoting to Cloud: How an MSP Brokers Cloud Services
PDF
Are your cloud applications performing? How Application Performance Managemen...
PPTX
The Business Justification for APM
PDF
Ndh group+intacct cloud-financial-management-you-can-count-on
PDF
Postgres in Production - Best Practices 2014
 
PDF
A DevOps adoption playbook- achieving business value at scale
PPTX
Implementing a Disconnected Mobile Application with DSI for Field Operations
PPTX
Why Business is Better in the Cloud
PPTX
Tales from the Postgres Front - and What We Can Learn
 
PDF
IBM Collaborative Lifecycle Management Solution for DevOps v6
PDF
Unlock your core business assets for the hybrid cloud with addi webinar dec...
PPTX
Technology insights: Decision Science Platform
PPTX
OpenWorld: 4 Real-world Cloud Migration Case Studies
PDF
VMworld 2013: Building the Management Stack for Your Software Defined Data Ce...
PPTX
2013-11-13 Cloud Based Accounting Systems
PDF
Redefine ECM Monitoring
The 3 Pillars of Remote Application Development
VMworld 2015: vRealize Operations Insight: Manage vSphere and Your Entire Dat...
DCIM Software Five Years Later: What I Wish I Had Known When I Started (Case ...
Pivoting to Cloud: How an MSP Brokers Cloud Services
Are your cloud applications performing? How Application Performance Managemen...
The Business Justification for APM
Ndh group+intacct cloud-financial-management-you-can-count-on
Postgres in Production - Best Practices 2014
 
A DevOps adoption playbook- achieving business value at scale
Implementing a Disconnected Mobile Application with DSI for Field Operations
Why Business is Better in the Cloud
Tales from the Postgres Front - and What We Can Learn
 
IBM Collaborative Lifecycle Management Solution for DevOps v6
Unlock your core business assets for the hybrid cloud with addi webinar dec...
Technology insights: Decision Science Platform
OpenWorld: 4 Real-world Cloud Migration Case Studies
VMworld 2013: Building the Management Stack for Your Software Defined Data Ce...
2013-11-13 Cloud Based Accounting Systems
Redefine ECM Monitoring

Recently uploaded (20)

PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
Big Data Technologies - Introduction.pptx
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Modernizing your data center with Dell and AMD
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
A Presentation on Artificial Intelligence
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Empathic Computing: Creating Shared Understanding
PDF
Approach and Philosophy of On baking technology
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Big Data Technologies - Introduction.pptx
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Modernizing your data center with Dell and AMD
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Encapsulation_ Review paper, used for researhc scholars
A Presentation on Artificial Intelligence
Reach Out and Touch Someone: Haptics and Empathic Computing
NewMind AI Weekly Chronicles - August'25 Week I
Empathic Computing: Creating Shared Understanding
Approach and Philosophy of On baking technology
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Unlocking AI with Model Context Protocol (MCP)
Diabetes mellitus diagnosis method based random forest with bat algorithm
20250228 LYD VKU AI Blended-Learning.pptx
CIFDAQ's Market Insight: SEC Turns Pro Crypto
The Rise and Fall of 3GPP – Time for a Sabbatical?
Agricultural_Statistics_at_a_Glance_2022_0.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”

Operating a Highly Available Cloud Service

  • 1. Operating a Highly Available Cloud Service November 14, 2013 Depankar Neogi Chief Architect QuickBase, Intuit Inc. Presented at Boston Cloud Services Meetup http://www.meetup.com/Boston-cloud-services/events/141118632/
  • 2. Agenda • Intuit and QuickBase • Building and Running Highly Available Cloud Services –People & Process –Technology The single most important thing to keep in mind when designing for High Availability is to anticipate failure. 2
  • 3. Improving #1 Financial Management Software Facilitate $40B Tax Refunds 3 60M Lives #1 for Innovation in Computer Software Industry 20% of GDP & Pay 1 in 12 Apps for >50% of Fortune 500
  • 4. What is QuickBase? Easily customized to meet unique business needs Excel to QuickBase in less than 5 minutes Brand NEW modern UI enables Ease of Use An Enterprise platform to empower your team to build applications Requirements, processes and teams evolving constantly More than 4,500 companies use QuickBase 500,000+ current users One platform solves jobs across the enterprise. Project Management, IT helpdesk, CRM, Field service, Human resources, etc. 4
  • 5. QuickBase – Customized applications matching your unique requirements Roles Based UI Dashboards & Reports Data Storage & Backup Secure Access Control Relational Data Tables Business logic & workflow Open extensible API’s Common Infrastructure Services 5
  • 6. Modern, Easy, Productive, Dynamic, Fast 30 million requests per day 80 K unique visitors per day 100,000 active apps at any time 25 milliseconds median processing time Supports Dynamic DML, DDL, CRUD Cloud based Database with a beautiful UX 6
  • 7. New QuickBase DIY Data Access Liberators Data Mapping WSQL Transforms Virtual tables Liberator Cache Library Warehouse Scheduler Repository 1. QuickBase UI Extended with new DIY data sharing 2. New Data Sharing Service A N Y A P I 3. Connections to Popular Industry Data Intuit-class infrastructure (security, billing, HADR, hosting) 8
  • 9. PSTN Systems Availability SLA Downtime 99.9999 %  “six nines”  31.5 secs/yr, 2.59 secs/month, 0.605 secs/week 99.999 % 10  “five nines”  5.26 mins/yr, 25.9 secs/month, 6.05 secs/week
  • 10. Web Services Availability SLA Downtime 99.95 %  4.38 hrs/yr, 21.56 mins/month, 5.04 mins/week 99.9 % 11  8.76 hrs/yr, 43.8 mins/month, 10.1 mins/week
  • 12. Operating High Availability Service PEOPLE & PROCESSES 13
  • 13. People & Process: Monitoring Business Metrics • It’s critical to detect a problem before your customers have to tell you or you have to ask them. • By monitoring real time business metrics and comparing the actual data to a historical curve you can more quickly detect if there is a problem and avoid sifting through alerting and monitoring white noise that your systems will inevitability produce. • Five evolutionary questions that monitoring should answer: 1. 2. 3. 4. 5. Is there a problem? Where is the problem? What is the problem? Why is there a problem? Will there be a problem? • External versus Internal Monitoring http://akfpartners.com/techblog/2009/06/15/monitoring-strategies/ 14
  • 14. People & Process: Invest in Good Tools A good tool will help you find the needle in a haystack - fast 95 K Requests in 12 hour window Peak Request: 4.3 req/sec (1286 request/5 min window) 15 Processing Time: 61 millisecond per request
  • 15. People & Process: Incident Management Process • • • • • • • • • Incident Management Team (IMT) Incident Management Response Plan Activating the IMT, notifications Having the right break-out rooms Classification of the incident Communication of the incident Time keeper Management versus Technical Process Tracking: – SLA – RPO (recovery point objective) – RTO (recovery time objective) • Incident closure, recovery • Evaluation process 16
  • 16. People & Process: Runbook and messaging • Runbook – Detail process for managing the incident – Contact Information – Managing data center cutover, recovery steps, testing, managing replication • Messaging book – – – – – Who is responsible for communication Who creates and approves the message How you communicate At what cadence What you tell your customers • Social Media Strategy – – 17 If you are not transparent, your customers will let you know Social Media coordinator – own the channels
  • 17. People & Process: Service Page Provide Customers ability to find out the health of the system and be notified of any service related issues 18
  • 18. People & Process: Service Page Transparency is Key. If you let the customers know what you know, they will respect you and may remain loyal to your business. 19
  • 19. People & Process: Business Fault Isolation • • • • • What if your data center went down And the production server is down because the data center is down And your email server was in the same data center And your marketing server was in the same data center And your service page was on a server in the same date center • How do you communicate with all your customers? Business Fault Isolation prevents your business from a SPOF (single point of failure). 20
  • 20. People & Process: Review Process • SaaS or Operations Review Process should have a fixed cadence and be led by a company leader • Review Team should include leaders from: – Finance – Compliance & Risk – CTO – Operations – Product • Dashboard with KPI • Review Fire drills • Change Control Process – Preferably change one thing at a time 21
  • 21. Operating High Availability Service TECHNOLOGIES 22
  • 22. The Three Pillars of High Availability The goal of High Availability and Disaster Recovery (HA/DR) is to provide Business Continuance through: Lack of Service Outage = Happy Customers = Greater Business Value HA/DR directly enhances a customer’s experience through greater offering availability
  • 23. High Availability Architecture Principles • Design for Failure – Avoid Single Points of Failure – Graceful Degradation and Soft Dependencies – Asynchronous Design – Keep State Confined to Where it is Needed • Design for Operability – Design to be Monitored – Design for Hot Deployment and Rollback – Automate Where Possible • Keep Everything “In Production” • Scale Out (Not Up) • Keep it Fresh…and Mature
  • 24. Architecture Patterns for High Availability Swimlanes 1) 2) Active/Active 3) Single Write Master 4) 25 Active/Passive Store and Forward
  • 25. Active / Passive Primary Data Center Secondary Data Center Near Real-time Replication Active Data 26 Passive Back Up
  • 26. Swimlane Principle A “Swimlane” is: A set of predefined systems and software infrastructure tuned to support a predefined workload • Only a portion of an offering’s total users are hosted on any given swimlane Within a Swimlane: – Each Swimlane is independent and self-sufficient and shares no compute/storage resources with other swimlanes – Offering transactions occur within a Swimlane – Only access to Shared Services go outside the Swimlane – Standard Fault Detection and Fault Recovery methods are used 27
  • 27. High Availability with Swimlanes Application Partitioning GTM via Swimlanes DC 1 Fault Domain 1 Fault Domain 2 WS AS Storage 28 WS: web server; AS: app server WS AS Swimlane 2 AS Storage Swimlane 4’ Swimlane 3 Storage WS F5 GTM Storage WS AS Storage WS AS Storage Intuit Proprietary & Confidential WS AS Storage Swimlane 4 AS F5 LTM Swimlane 3’ WS DNS Swimlane 1’ F5 GTM Swimlane 2’ F5 LTM Swimlane 1 DC 2 Internet WS AS Storage
  • 28. Swimlanes Support Application Needs • Scalability • Replicated swimlanes add capacity with linear scalability • Fault Isolation • Complete failure only impacts a subset of users due to application partitioning and data sharding • High Availability • Individual tiers can be made highly available through intra-VM application recovery, intra-swimlane application failover or intra-swimlane VM restart • Disaster Recovery • Disaster recovery is achieved through swimlane failover, either in the same or a remote data center • Automation • The identical nature of a swimlane allows for a high degree of operational automation 29
  • 29. Active / Active – Swim Lanes Global Load Balancer Data Center 1 25% customers Data Center 2 25% customers 25% customers Replication 25% customers DB3 active DB1 active ----------------- ----------------- DB1 passive DB3 passive DB2 active Replication DB4 active ----------------DB4 passive 30 ----------------DB2 passive
  • 30. Active / Active – Single Write Master DC1 DC2 DC3 DC4 Writes Updates Cache Updates Read Cache 31 Read Cache Read Cache Read Cache
  • 31. Design for Failure: Resiliency Patterns Throttling versus Circuit Breaker 32
  • 32. Circuit Breaker Pattern Circuit Breaker State Diagram Caller C Dependency Closed On call/ pass through Open Trip breaker D Call succeeds / reset count On Call / Fail Call fail/count failure On timeout / attempt reset Threshold reached/trip breaker Trip breaker Attempt Attempt Reset Reset Half Open On call / pass through On succeed/reset On fail /trip breaker http://techblog.netflix.com/2012_02_01_archive.html 33
  • 35. Examples of Tools for Building HA Systems • • • • • • • • • • • • • • 36 Highly Available DNS– Akamai, Dyn, AWS Route53 Load Balancing – F5 LTM, F5 GTM, AWS ELB Data Replication – Golden Gate Monitoring – eHealth, Spectrum, Wily, Splunk, Cacti Application Performance – DynaTrace, NewRelic Deployment – Perforce, Maven, Nexus, Hudson, Puppet Distributed Databases – NuoDB, VoltDB, several NoSQL types Distributed Storage – GlusterFS, Atmos, OpenStack HA Devices – Veritas Cluster Server OS Virtualization – AWS, Mware, Xen, Parallels Network Virtualization – AWS, Mware NSX, PLUMgrid Caching– Memcached, Akamai, CloudFront Caching– Netflix Chaos Monkey DDos Protection– Arbor, Riverbed
  • 36. Trust Not the Execution Environment “Everything Fails, All the Time.” – Werner Vogels, CTO of Amazon.com 37
  • 37. Summary: Operating HA Service Monitoring Business Metrics Incident Management Process Runbooks Social Media & Messaging Service Page Business Fault Isolation SLA, RPO, RTO Failover Drills Review Process Change one thing at a time Principles: – – – – – Design for Failure Design for Operability Keep Everything “In Production” Scale Out (stateless) Keep it Fresh Patterns: – – – – Active/Active Swimlanes Active/Passive Store-Forward Design: – – – – – 38 Throttling Circuit Breaker Caching Rollback Healthchecks Tools