SlideShare a Scribd company logo
Michael Richardson
Twitter: @Mr_SPB
1© 2011 Energized Work - www.energizedwork.com
Availability and Recoverability
So what is High Availability?
• Five 9s?
• No Single point of failure?
• Multiple Data Centre’s?
• Fault Tolerance?
• Load Balancing?
• Uptime?
2© 2012 Energized Work - www.energizedwork.com
The 9’s of Availability
3© 2012 Energized Work - www.energizedwork.com
9
9
The 9’s of Availability
4© 2012 Energized Work - www.energizedwork.com
Availability Downtime per Year
One nine (90%) 36.5 days
Two nines (99%) 3.65 days
Three nines (99.9%) 8.76 hours
Four nines (99.99%) 52.56 minutes
Five nines (99.999%) 5.26 minutes
Problem with the 9’s
5© 2012 Energized Work - www.energizedwork.com
• What do they mean?
• Guaranteed or just an SLA
• Multiplicity
(99.9% * 99.9% * 99.9% = 99.7%)
SLA availability numbers:
just aim to provide a level of
confidence in a website’s
service
6© 2012 Energized Work - www.energizedwork.com
No Single Point of
Failure (SPOF)
7© 2012 Energized Work - www.energizedwork.com
two of everything?
8© 2012 Energized Work - www.energizedwork.com
Start with this
9© 2012 Energized Work - www.energizedwork.com
Index.html
Users
End with this
10© 2012 Energized Work - www.energizedwork.com
WEB1
switch 1 switch 2
WEB2 APP1 APP2 DB1 DB2
Firewall 1 Firewall 2
Users
• It’s expensive ££
• Where do you draw the line?
• Are failures independent
• Can you guarantee No SPOF?
• Increased complexity
11© 2012 Energized Work - www.energizedwork.com
Problems with
eliminating SPOF
Problem: Data Centre’s Fail
12© 2012 Energized Work - www.energizedwork.com
Solution: Get a 2nd
Data Centre
13© 2012 Energized Work - www.energizedwork.com
Hot/Hot Multisite
14© 2012 Energized Work - www.energizedwork.com
• Full range of services available in
multiple locations.
• Easy to automate failover of sites
• Data Consistency is hard.
• Capacity Planning concerns
+
Hot/Warm Multisite
15© 2012 Energized Work - www.energizedwork.com
• Simpler than Hot/Hot
• Read/write ratio dependant
• Synchronous or Asynchronously
replicate data?
+
Hot/Cold Multisite
16© 2012 Energized Work - www.energizedwork.com
• Easy to setup
• Will it work?
• Can it be trusted?
• Cold site rapidly become stale
• Is it actually valuable?
+
DR Multisite
17© 2012 Energized Work - www.energizedwork.com
• Fingers crossed you never need it.
• How can/should you test it?
• Cloud?
+
Problems with Multiple sites
18© 2012 Energized Work - www.energizedwork.com
• ££ - it’s expensive
• Managing more systems
• Managing consistency of Data
• Managing Capacity
• Is it still fail proof?
• Unless you test it, it’s just a plan
19© 2012 Energized Work - www.energizedwork.com
We now have a
Complex System
• More redundancy and automation leads
to more complexity.
• More complexity often adds more
points of failure.
20© 2012 Energized Work - www.energizedwork.com
Complex Systems
Author: Dr. Richard Cook
21© 2012 Energized Work - www.energizedwork.com
“How Complex Systems fail”
• Catastrophe is always just around the
corner.
• Human Operators have dual roles.
• Change introduces new forms of failure
Failure and Recovery
22© 2012 Energized Work - www.energizedwork.com
Questions for the Customer
23© 2012 Energized Work - www.energizedwork.com
• What is the cost of downtime?
• What are the RTO and RPO?
24© 2012 Energized Work - www.energizedwork.com
RTO = Recovery Time Objective
RPO = Recovery Point Objective
Aggressive RTO & RPO is
expensive and has a
performance impact.
25© 2012 Energized Work - www.energizedwork.com
RTO / RPO example
26© 2012 Energized Work - www.energizedwork.com
problem
•Simple DB
•Business can tolerate up to 15 minutes
downtime
•10 minute window of data lose.
RTO / RPO example
27© 2012 Energized Work - www.energizedwork.com
Possible solution
1.Continuously replicate data to 2nd
host
2.Continue with nightly backups and also
copy DB transaction logs from the primary
host to another system.
So what’s more important?
28© 2012 Energized Work - www.energizedwork.com
Increasing Availability
Or
Reducing Recovery Time
29© 2012 Energized Work - www.energizedwork.com
MTBF
Or
MTTR
What about MTTD??
30© 2012 Energized Work - www.energizedwork.com
Answer?
It Depends
31© 2012 Energized Work - www.energizedwork.com
Failure is inevitable
32© 2012 Energized Work - www.energizedwork.com
Ask anyone
33© 2011 Energized Work - www.energizedwork.com
Thank you
The End
Twitter - @Mr_SPB

More Related Content

PPTX
Top ten secret weapons for performance testing in an agile environment
PPT
GWAVACon 2013: Novell GroupWise
PPTX
How to measure the business impact of web performance
PPTX
[Rakuten TechConf2014] [F-6] Changing the Behavior of IT
PDF
Interns as ft es for employer penalty
PPTX
Filosofiametafisica 140310174603-phpapp01
PPTX
Cooking with Chef
PPTX
Sales Hacker Series San Francisco - Elay Cohen - To Sell Is To Be Human
Top ten secret weapons for performance testing in an agile environment
GWAVACon 2013: Novell GroupWise
How to measure the business impact of web performance
[Rakuten TechConf2014] [F-6] Changing the Behavior of IT
Interns as ft es for employer penalty
Filosofiametafisica 140310174603-phpapp01
Cooking with Chef
Sales Hacker Series San Francisco - Elay Cohen - To Sell Is To Be Human

Similar to System Availability Talk (20)

PDF
MTBF / MTTR - Energized Work TekTalk, Mar 2012
PPTX
Emc sql server 2012 overview
PPT
2012 Annual State of the Union for Mobile Ecommerce Performance [Velocity EU]
KEY
Disaster Recovery with MySQL and Tungsten
PDF
Walmart pagespeed-slide
PDF
Walmart Web Performance Circa 2013
PDF
Presentation virtualizing oracle unlocked enterprise wide benefits
KEY
O'Reilly webcast: Joshua Bixby on Mobile Performance Trends and Predictions
PPTX
Scaling mature systems
PDF
Why You Should Move to the Cloud
PDF
Automation & Cloud Evolution - Long View VMware Forum Calgary January 21 2014
PDF
Executing the Digital Strategy
PPTX
Optimizing Browser Rendering
PPTX
How to Choose the Right Cloud for Continuity
PPTX
Works on my machine, your problem now? - QCon 2014
PPTX
At bruxelles scaling agile - v1.5 slideshare
PDF
Scaling CQ5
PDF
Dev talks Cluj 2018 : Java in the 21 Century: Are you thinking far enough ahead?
PDF
Oracle primavera and bpm the power of integration ppt
PDF
FME in Overdrive: Unleashing the Power of Parallel Processing
MTBF / MTTR - Energized Work TekTalk, Mar 2012
Emc sql server 2012 overview
2012 Annual State of the Union for Mobile Ecommerce Performance [Velocity EU]
Disaster Recovery with MySQL and Tungsten
Walmart pagespeed-slide
Walmart Web Performance Circa 2013
Presentation virtualizing oracle unlocked enterprise wide benefits
O'Reilly webcast: Joshua Bixby on Mobile Performance Trends and Predictions
Scaling mature systems
Why You Should Move to the Cloud
Automation & Cloud Evolution - Long View VMware Forum Calgary January 21 2014
Executing the Digital Strategy
Optimizing Browser Rendering
How to Choose the Right Cloud for Continuity
Works on my machine, your problem now? - QCon 2014
At bruxelles scaling agile - v1.5 slideshare
Scaling CQ5
Dev talks Cluj 2018 : Java in the 21 Century: Are you thinking far enough ahead?
Oracle primavera and bpm the power of integration ppt
FME in Overdrive: Unleashing the Power of Parallel Processing
Ad

More from m_richardson (9)

PPTX
Persistence in the cloud with bosh
PPTX
bootstrapping containers with confd
PPTX
Docker Service Registration and Discovery
PPTX
Puppetcamp Melbourne - puppetdb
PPTX
Node collaboration - sharing information between your systems
PPTX
Node collaboration - Exported Resources and PuppetDB
PPTX
Serverspec and Sensu - Testing and Monitoring collide
PPT
Chef - managing yours servers with Code
PPTX
Open Source Monitoring Tools
Persistence in the cloud with bosh
bootstrapping containers with confd
Docker Service Registration and Discovery
Puppetcamp Melbourne - puppetdb
Node collaboration - sharing information between your systems
Node collaboration - Exported Resources and PuppetDB
Serverspec and Sensu - Testing and Monitoring collide
Chef - managing yours servers with Code
Open Source Monitoring Tools
Ad

Recently uploaded (20)

PDF
Electronic commerce courselecture one. Pdf
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Big Data Technologies - Introduction.pptx
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Approach and Philosophy of On baking technology
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
cuic standard and advanced reporting.pdf
Electronic commerce courselecture one. Pdf
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Per capita expenditure prediction using model stacking based on satellite ima...
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
The AUB Centre for AI in Media Proposal.docx
Machine learning based COVID-19 study performance prediction
Big Data Technologies - Introduction.pptx
Empathic Computing: Creating Shared Understanding
Understanding_Digital_Forensics_Presentation.pptx
MYSQL Presentation for SQL database connectivity
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Approach and Philosophy of On baking technology
The Rise and Fall of 3GPP – Time for a Sabbatical?
20250228 LYD VKU AI Blended-Learning.pptx
Network Security Unit 5.pdf for BCA BBA.
NewMind AI Weekly Chronicles - August'25 Week I
Building Integrated photovoltaic BIPV_UPV.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
cuic standard and advanced reporting.pdf

System Availability Talk

  • 1. Michael Richardson Twitter: @Mr_SPB 1© 2011 Energized Work - www.energizedwork.com Availability and Recoverability
  • 2. So what is High Availability? • Five 9s? • No Single point of failure? • Multiple Data Centre’s? • Fault Tolerance? • Load Balancing? • Uptime? 2© 2012 Energized Work - www.energizedwork.com
  • 3. The 9’s of Availability 3© 2012 Energized Work - www.energizedwork.com 9 9
  • 4. The 9’s of Availability 4© 2012 Energized Work - www.energizedwork.com Availability Downtime per Year One nine (90%) 36.5 days Two nines (99%) 3.65 days Three nines (99.9%) 8.76 hours Four nines (99.99%) 52.56 minutes Five nines (99.999%) 5.26 minutes
  • 5. Problem with the 9’s 5© 2012 Energized Work - www.energizedwork.com • What do they mean? • Guaranteed or just an SLA • Multiplicity (99.9% * 99.9% * 99.9% = 99.7%)
  • 6. SLA availability numbers: just aim to provide a level of confidence in a website’s service 6© 2012 Energized Work - www.energizedwork.com
  • 7. No Single Point of Failure (SPOF) 7© 2012 Energized Work - www.energizedwork.com
  • 8. two of everything? 8© 2012 Energized Work - www.energizedwork.com
  • 9. Start with this 9© 2012 Energized Work - www.energizedwork.com Index.html Users
  • 10. End with this 10© 2012 Energized Work - www.energizedwork.com WEB1 switch 1 switch 2 WEB2 APP1 APP2 DB1 DB2 Firewall 1 Firewall 2 Users
  • 11. • It’s expensive ££ • Where do you draw the line? • Are failures independent • Can you guarantee No SPOF? • Increased complexity 11© 2012 Energized Work - www.energizedwork.com Problems with eliminating SPOF
  • 12. Problem: Data Centre’s Fail 12© 2012 Energized Work - www.energizedwork.com
  • 13. Solution: Get a 2nd Data Centre 13© 2012 Energized Work - www.energizedwork.com
  • 14. Hot/Hot Multisite 14© 2012 Energized Work - www.energizedwork.com • Full range of services available in multiple locations. • Easy to automate failover of sites • Data Consistency is hard. • Capacity Planning concerns +
  • 15. Hot/Warm Multisite 15© 2012 Energized Work - www.energizedwork.com • Simpler than Hot/Hot • Read/write ratio dependant • Synchronous or Asynchronously replicate data? +
  • 16. Hot/Cold Multisite 16© 2012 Energized Work - www.energizedwork.com • Easy to setup • Will it work? • Can it be trusted? • Cold site rapidly become stale • Is it actually valuable? +
  • 17. DR Multisite 17© 2012 Energized Work - www.energizedwork.com • Fingers crossed you never need it. • How can/should you test it? • Cloud? +
  • 18. Problems with Multiple sites 18© 2012 Energized Work - www.energizedwork.com • ££ - it’s expensive • Managing more systems • Managing consistency of Data • Managing Capacity • Is it still fail proof? • Unless you test it, it’s just a plan
  • 19. 19© 2012 Energized Work - www.energizedwork.com We now have a Complex System
  • 20. • More redundancy and automation leads to more complexity. • More complexity often adds more points of failure. 20© 2012 Energized Work - www.energizedwork.com Complex Systems
  • 21. Author: Dr. Richard Cook 21© 2012 Energized Work - www.energizedwork.com “How Complex Systems fail” • Catastrophe is always just around the corner. • Human Operators have dual roles. • Change introduces new forms of failure
  • 22. Failure and Recovery 22© 2012 Energized Work - www.energizedwork.com
  • 23. Questions for the Customer 23© 2012 Energized Work - www.energizedwork.com • What is the cost of downtime? • What are the RTO and RPO?
  • 24. 24© 2012 Energized Work - www.energizedwork.com RTO = Recovery Time Objective RPO = Recovery Point Objective
  • 25. Aggressive RTO & RPO is expensive and has a performance impact. 25© 2012 Energized Work - www.energizedwork.com
  • 26. RTO / RPO example 26© 2012 Energized Work - www.energizedwork.com problem •Simple DB •Business can tolerate up to 15 minutes downtime •10 minute window of data lose.
  • 27. RTO / RPO example 27© 2012 Energized Work - www.energizedwork.com Possible solution 1.Continuously replicate data to 2nd host 2.Continue with nightly backups and also copy DB transaction logs from the primary host to another system.
  • 28. So what’s more important? 28© 2012 Energized Work - www.energizedwork.com Increasing Availability Or Reducing Recovery Time
  • 29. 29© 2012 Energized Work - www.energizedwork.com MTBF Or MTTR What about MTTD??
  • 30. 30© 2012 Energized Work - www.energizedwork.com Answer? It Depends
  • 31. 31© 2012 Energized Work - www.energizedwork.com Failure is inevitable
  • 32. 32© 2012 Energized Work - www.energizedwork.com Ask anyone
  • 33. 33© 2011 Energized Work - www.energizedwork.com Thank you The End Twitter - @Mr_SPB

Editor's Notes

  • #2: 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • #3: 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • #4: Ask any business how much downtime is acceptable and you will get a consistent answer. 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • #5: 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • #6: Found more in Marketing literature than technical literature 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • #7: An SLA is just an instrument that makes business people comfortable (just like insurance) 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • #8: 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • #9: 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • #10: 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • #11: 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • #12: 1 & 2 Diminishing returns Paradoxically, adding more components to an overall system design can undermine efforts to achieve high availability Cascading failures 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • #13: 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • #14: 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • #15: Read & Write anywhere Global Server Load Balancing with DNS 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • #16: Read intensive apps are well suited to this – Reads Hot/Hot 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • #17: Cold site is so untrusted that perhaps spending hours restoring the primary DC is a better and safer bet. 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • #18: Cold site is so untrusted that perhaps spending hours restoring the primary DC is a better and safer bet. 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • #19: Talk about capacity planning Hot/Hot – config switches Most companies don ’ t thoroughly test DC failover. When failure occurs many companies will often focus on restoring the failure in the primary DC rather attempt a failover. So why bother having a 2 nd DC anyway. If you plan on having multiple DC ’ s or DR then test your procedures when you ’ re not in an emergency situation. Game Day events 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • #20: 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • #21: 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • #22: Mention John Alspaw ’ s Qcon talk 2. Dual roles of humans Defenders against failure Producers of failure 3. Introduce a technology change To prevent low-consequence, but high frequency failures May introduce low frequency, but high consequence failure Introduce new pathways to large-scale, catastrophic failures. Focus of humans is on the beneficial charactistics of the change. New failure ’ s maybe difficult to foresee. Give config management example Knife Resolv.conf 3. Also covers maintenance and why many find it difficult. Build and forget mentality. 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • #23: 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • #24: Cost of downtime – easy or difficult to measure Can downtime actually be equated to lost revenue. Give online shopping example 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • #25: RTO and RPO are often in competition Give eg of replication lag between 2 sites. Zero RPO example - If replication lags between systems and you have an aggressive RPO you maybe better off taking a few hours outage and focusing on restoring your primary site. Zero RTO example – if replication lags between DC ’ s you may decide to failover immediately and take the data loss for some inflight transactions Aggressive RTO & RPO is expensive and has a performance 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • #26: 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • #27: Typical nightly backups aren ’ t going to cut it. Common practice is to backup systems nightly. Is your business happy to lose up to 24 hours of data? Probably not. 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • #28: Covers you for any catastrophic hardware failure 2 nd host has independent storage infrastructure. Data corruption would however result in 2 copies of crap 2. Covers you for data corruption Playing back transaction logs will also allow you to identify the place where corruption occurred. 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • #29: 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • #30: What about MTTD? 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • #31: My experience tells me most companies focus on availability How many companies take nightly tape backups but have never bothered trying to restore or test them? If you think you can built a completely fail-proof system you are kidding yourself. How many companies have game days? 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • #32: 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • #33: 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • #34: 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING