System Availability Talk

Michael Richardson
Twitter: @Mr_SPB
1© 2011 Energized Work - www.energizedwork.com
Availability and Recoverability

So what is High Availability?
• Five 9s?
• No Single point of failure?
• Multiple Data Centre’s?
• Fault Tolerance?
• Load Balancing?
• Uptime?

The 9’s of Availability
9
9

The 9’s of Availability
Availability Downtime per Year
One nine (90%) 36.5 days
Two nines (99%) 3.65 days
Three nines (99.9%) 8.76 hours
Four nines (99.99%) 52.56 minutes
Five nines (99.999%) 5.26 minutes

Problem with the 9’s
• What do they mean?
• Guaranteed or just an SLA
• Multiplicity
(99.9% * 99.9% * 99.9% = 99.7%)

SLA availability numbers:
just aim to provide a level of
confidence in a website’s
service

No Single Point of
Failure (SPOF)

two of everything?

Start with this
Index.html
Users

End with this
WEB1
switch 1 switch 2
WEB2 APP1 APP2 DB1 DB2
Firewall 1 Firewall 2
Users

• It’s expensive ££
• Where do you draw the line?
• Are failures independent
• Can you guarantee No SPOF?
• Increased complexity
Problems with
eliminating SPOF

Problem: Data Centre’s Fail

Solution: Get a 2nd
Data Centre

Hot/Hot Multisite
• Full range of services available in
multiple locations.
• Easy to automate failover of sites
• Data Consistency is hard.
• Capacity Planning concerns
+

Hot/Warm Multisite
• Simpler than Hot/Hot
• Read/write ratio dependant
• Synchronous or Asynchronously
replicate data?
+

Hot/Cold Multisite
• Easy to setup
• Will it work?
• Can it be trusted?
• Cold site rapidly become stale
• Is it actually valuable?
+

DR Multisite
• Fingers crossed you never need it.
• How can/should you test it?
• Cloud?
+

Problems with Multiple sites
• ££ - it’s expensive
• Managing more systems
• Managing consistency of Data
• Managing Capacity
• Is it still fail proof?
• Unless you test it, it’s just a plan

We now have a
Complex System

• More redundancy and automation leads
to more complexity.
• More complexity often adds more
points of failure.
Complex Systems

Author: Dr. Richard Cook
“How Complex Systems fail”
• Catastrophe is always just around the
corner.
• Human Operators have dual roles.
• Change introduces new forms of failure

Failure and Recovery

Questions for the Customer
• What is the cost of downtime?
• What are the RTO and RPO?

RTO = Recovery Time Objective
RPO = Recovery Point Objective

Aggressive RTO & RPO is
expensive and has a
performance impact.

RTO / RPO example
problem
•Simple DB
•Business can tolerate up to 15 minutes
downtime
•10 minute window of data lose.

RTO / RPO example
Possible solution
1.Continuously replicate data to 2nd
host
2.Continue with nightly backups and also
copy DB transaction logs from the primary
host to another system.

So what’s more important?
Increasing Availability
Or
Reducing Recovery Time

MTBF
Or
MTTR
What about MTTD??

Answer?
It Depends

Failure is inevitable

Ask anyone

Thank you
The End
Twitter - @Mr_SPB

System Availability Talk

More Related Content

Similar to System Availability Talk (20)

More from m_richardson (9)

Recently uploaded (20)

System Availability Talk

Editor's Notes