SlideShare a Scribd company logo
IT Infrastructure Architecture
Availability Concepts
(chapter 4)
Infrastructure Building Blocks
and Concepts
Introduction
• Everyone expects
their infrastructure
to be available all
the time
• A 100%
guaranteed
availability of an
infrastructure is
impossible
Calculating availability
• Availability can neither be calculated, nor
guaranteed upfront
– It can only be reported on afterwards, when a system
has run for some years
• Over the years, much knowledge and experience
is gained on how to design high available systems
– Failover
– Redundancy
– Structured programming
– Avoiding Single Points of Failures (SPOFs)
– Implementing systems management
Calculating availability
• The availability of a system is usually expressed as
a percentage of uptime in a given time period
– Usually one year or one month
• Example for downtime expressed as a percentage
per year:
Availability %
Downtime
per year
Downtime
per month
Downtime
per week
99.8% 17.5 hours 86.2 minutes 20.2 minutes
99.9% ("three nines") 8.8 hours 43.2 minutes 10.1 minutes
99.99% ("four nines") 52.6 minutes 4.3 minutes 1.0 minutes
99.999% ("five nines") 5.3 minutes 25.9 seconds 6.1 seconds
Calculating availability
• Typical requirements used in service level
agreements today are 99.8% or 99.9% availability
per month for a full IT system
• The availability of the infrastructure must be
much higher
– Typically in the range of 99.99% or higher
• 99.999% uptime is also known as carrier grade
availability
– For one component
– Higher availability levels for a complete system are
very uncommon, as they are almost impossible to
reach
Calculating availability
• It is good practice to agree on the maximum
frequency of unavailability
Unavailability
(minutes)
Number of events
(per year)
0 – 5 <= 35
5 – 10 <= 10
10 – 20 <= 5
20 – 30 <=2
> 30 <= 1
MTBF and MTTR
• Mean Time Between Failures (MTBF)
–The average time that passes between
failures
• Mean Time To Repair (MTTR)
–The time it takes to recover from a failure
MTBF and MTTR
• Some components have higher MTBF than others
• Some typical MTB’s:
Component MTBF (hours)
Hard disk 750,000
Power supply 100,000
Fan 100,000
Ethernet Network Switch 350,000
RAM 1,000,000
MTTR
• MTTR can be kept low by:
– Having a service contract with the supplier
– Having spare parts on-site
– Automated redundancy and failover
MTTR
• Steps to complete repairs:
– Notification of the fault (time before seeing an alarm
message)
– Processing the alarm
– Finding the root cause of the error
– Looking up repair information
– Getting spare components from storage
– Having technician come to the datacenter with the
spare component
– Physically repairing the fault
– Restarting and testing the component
Calculation examples
Availability =
MTBF
MTBF + MTTR
× 100%
Component MTBF (h) MTTR (h) Availability in %
Power supply 100,000 8 0.9999200 99.99200
Fan 100,000 8 0.9999200 99.99200
System board 300,000 8 0.9999733 99.99733
Memory 1,000,000 8 0,9999920 99.99920
CPU 500,000 8 0.9999840 99.99840
Network
Interface
Controller (NIC)
250,000 8 0.9999680 99.99680
Calculation examples
• Serial components: One defect leads to
downtime
• Example: the above system’s availability is:
0.9999200 × 0.9999200 × 0.9999733
× 0.9999920 × 0.9999840 × 0.9999680
= 0.99977 = 𝟗𝟗. 𝟗𝟕𝟕%
(each components’ availability is at least 99.99%)
Calculation examples
• Parallel components: One defect: no downtime!
• But beware of SPOFs!
• Calculate availability:
𝐴 = 1 − (1 − 𝐴1) 𝑛
• Total availability = 1 − (1 − 0.99)2
= 99.99%
Sources of unavailability - human
errors
• 80% of outages impacting mission-critical
services is caused by people and process issues
• Examples:
• Performing a test in the production environment
• Switching off the wrong component for repair
• Swapping a good working disk in a RAID set instead of the
defective one
• Restoring the wrong backup tape to production
• Accidentally removing files
– Mail folders, configuration files
• Accidentally removing database entries
– Drop table x instead of drop table y
Sources of unavailability - software
bugs
• Because of the complexity of most software it
is nearly impossible (and very costly) to create
bug-free software
• Application software bugs can stop an entire
system
• Operating systems are software too
– Operating systems containing bugs can lead to
corrupted file systems, network failures, or other
sources of unavailability
Sources of unavailability - planned
maintenance
• Sometimes needed to perform systems management
tasks:
– Upgrading hardware or software
– Implementing software changes
– Migrating data
– Creation of backups
• Should only be performed on parts of the
infrastructure where other parts keep serving clients
• During planned maintenance the system is more
vulnerable to downtime than under normal
circumstances
– A temporary SPOF could be introduced
– Systems managers could make mistakes
Sources of unavailability - physical
defects
• Everything breaks down eventually
• Mechanical parts are most likely to break first
• Examples:
– Fans for cooling equipment usually break because
of dust in the bearings
– Disk drives contain moving parts
– Tapes are very vulnerable to defects as the tape is
spun on and off the reels all the time
– Tape drives contain very sensitive pieces of
mechanics that can break easily
Sources of unavailability - bathtub
curve
• A component failure is most likely when the
component is new
• When a component still works after the first month, it
is likely that it will continue working without failure
until the end of its life
Sources of unavailability -
environmental issues
• Environmental issues can cause downtime:
– Failing facilities
• Power
• Cooling
– Disasters
• Fire
• Earthquakes
• Flooding
Sources of unavailability - complexity
of the infrastructure
• Adding more components to an overall system
design can undermine high availability
– Even if the extra components are implemented to
achieve high availability
• Complex systems
– Have more potential points of failure
– Are more difficult to implement correctly
– Are harder to manage
• Sometimes it is better to just have an extra spare
system in the closet than to use complex
redundant systems
Redundancy
• Redundancy is the duplication of critical
components in a single system, to avoid a
single point of failure (SPOF)
• Examples:
– A single component having two power supplies; if
one fails, the other takes over
– Dual networking interfaces
– Redundant cabling
Failover
• Failover is the (semi)automatic switch-over to
a standby system or component
• Examples:
– Windows Server failover clustering
– VMware High Availability
– Oracle Real Application Cluster (RAC) database
Fallback
• Fallback is the manual switchover to an
identical standby computer system in a
different location
• Typically used for disaster recovery
• Three basic forms of fallback solutions:
– Hot site
– Cold site
– Warm site
Fallback – hot site
• A hot site is
– A fully configured fallback datacentre
– Fully equipped with power and cooling
– Applications are installed on the servers
– Data is kept up-to-date to fully mirror the production
system
• Requires constant maintenance of the hardware,
software, data, and applications to be sure the
site accurately mirrors the state of the production
site at all times
Fallback - cold site
• Is ready for equipment to be brought in during
an emergency, but no computer hardware is
available at the site
• Applications will need to be installed and
current data fully restored from backups
• If an organization has very little budget for a
fallback site, a cold site may be better than
nothing
Fallback - warm site
• A computer facility readily available with
power, cooling, and computers, but the
applications may not be installed or
configured
• A mix between a hot site and cold site
• Applications and data must be restored from
backup media and tested
– This typically takes a day
Business Continuity
• An IT disaster is defined as an irreparable
problem in a datacenter, making the datacenter
unusable
• Natural disasters:
– Floods
– Hurricanes
– Tornadoes
– Earthquakes
• Manmade disasters:
– Hazardous material spills
– Infrastructure failure
– Bio-terrorism
Business Continuity
• In case of a disaster, the infrastructure could
become unavailable, in some cases for a longer
period of time
• Business Continuity Management includes:
– IT
– Managing business processes
– Availability of people and work places in disaster
situations
• Disaster recovery planning (DRP) contains a set of
measures to take in case of a disaster, when
(parts of) the IT infrastructure must be
accommodated in an alternative location
RTO and RPO
• RTO and RPO are objectives in case of a disaster
• Recovery Time Objective (RTO)
– The maximum duration of time within which a
business process must be restored after a disaster, in
order to avoid unacceptable consequences (like
bankruptcy)
RTO and RPO
• Recovery Point Objective (RPO)
– The point in time to which data must be recovered
considering some "acceptable loss" in a disaster
situation
• RTO and RPO are individual objectives
– They are not related

More Related Content

PPTX
07. datacenters
PPTX
05. performance-concepts
PPTX
09. storage-part-1
PPTX
08. networking-part-2
PPTX
11. operating-systems-part-2
PPTX
05. performance-concepts-26-slides
PPTX
08. networking
PPTX
11. operating-systems-part-1
07. datacenters
05. performance-concepts
09. storage-part-1
08. networking-part-2
11. operating-systems-part-2
05. performance-concepts-26-slides
08. networking
11. operating-systems-part-1

What's hot (20)

PPTX
06. security concept
PPTX
03. non-functional-attributes-introduction-4-slides
PPTX
01. 03.-introduction-to-infrastructure
PPTX
12. End user devices.pptx
PPTX
01. 02. introduction (13 slides)
PPTX
PPTX
Process synchronization in Operating Systems
PPTX
Distributed Systems Introduction and Importance
PPTX
Firewall in Network Security
PPT
Web Engineering
PPTX
Distributed file system
PPTX
Hadoop File system (HDFS)
PPTX
Windows File Systems
PPTX
Deployment Models of Cloud Computing.pptx
PDF
CNIT 123: 6: Enumeration
PPTX
Cloud Deployments Models
PPTX
Data Backup (IT) Lecture Slide # 5
PPTX
Process management os concept
PDF
Introduction to High-Performance Computing
06. security concept
03. non-functional-attributes-introduction-4-slides
01. 03.-introduction-to-infrastructure
12. End user devices.pptx
01. 02. introduction (13 slides)
Process synchronization in Operating Systems
Distributed Systems Introduction and Importance
Firewall in Network Security
Web Engineering
Distributed file system
Hadoop File system (HDFS)
Windows File Systems
Deployment Models of Cloud Computing.pptx
CNIT 123: 6: Enumeration
Cloud Deployments Models
Data Backup (IT) Lecture Slide # 5
Process management os concept
Introduction to High-Performance Computing
Ad

Viewers also liked (11)

PPT
Chapter06
PPTX
10. compute-part-1
PPT
Chapter03
PPTX
10. compute-part-2
PPT
Chapter13
PPT
Chapter14
PPT
Chapter02
PPTX
Chapter04
PPT
Chapter01
PPT
Chapter05
PPT
Artificial Intelligence
Chapter06
10. compute-part-1
Chapter03
10. compute-part-2
Chapter13
Chapter14
Chapter02
Chapter04
Chapter01
Chapter05
Artificial Intelligence
Ad

Similar to 04. availability-concepts (20)

PPTX
501 ch 9 implementing controls
PPTX
Availability conceptin operating system.
PPTX
It security for libraries part 3 - disaster recovery
PDF
Dependable Systems - Introduction (1/16)
PPTX
CBS3209-4-High Level Fault Tolerant Techniques.pptx
PDF
KoprowskiT_PASSEastMidsFEB16_2AMaDisasterJustBegan
PPTX
PPTX
Disaster Recover : 10 tips for disaster recovery planning
PPT
Real-Time Operating Systems Real-Time Operating Systems RTOS .ppt
PDF
KoprowskiT_2AMaDisasterJustBeganAD2018
PPT
Fdp embedded systems
PDF
MySQL enterprise backup overview
PPT
lec-7.ppt It Infrastructure: Storage
PDF
Resource replication in cloud computing.
PPTX
High availability and disaster recovery in IBM PureApplication System
PPTX
Real-Time Systems Intro.pptx
PDF
Hadoop availability
PDF
Building data intensive applications
PPT
Weeks [01 02] 20100921
PDF
Mtc learnings from isv & enterprise interaction
501 ch 9 implementing controls
Availability conceptin operating system.
It security for libraries part 3 - disaster recovery
Dependable Systems - Introduction (1/16)
CBS3209-4-High Level Fault Tolerant Techniques.pptx
KoprowskiT_PASSEastMidsFEB16_2AMaDisasterJustBegan
Disaster Recover : 10 tips for disaster recovery planning
Real-Time Operating Systems Real-Time Operating Systems RTOS .ppt
KoprowskiT_2AMaDisasterJustBeganAD2018
Fdp embedded systems
MySQL enterprise backup overview
lec-7.ppt It Infrastructure: Storage
Resource replication in cloud computing.
High availability and disaster recovery in IBM PureApplication System
Real-Time Systems Intro.pptx
Hadoop availability
Building data intensive applications
Weeks [01 02] 20100921
Mtc learnings from isv & enterprise interaction

More from Muhammad Ahad (7)

PPT
Chapter12
PPT
Chapter11
PPT
Chapter10
PPT
Chapter09
PPT
Chapter08
PPT
Chapter07
PPT
Artificial Intelligence
Chapter12
Chapter11
Chapter10
Chapter09
Chapter08
Chapter07
Artificial Intelligence

Recently uploaded (20)

PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Tartificialntelligence_presentation.pptx
PPTX
Machine Learning_overview_presentation.pptx
PDF
Encapsulation theory and applications.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Machine learning based COVID-19 study performance prediction
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
A Presentation on Artificial Intelligence
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Getting Started with Data Integration: FME Form 101
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Spectral efficient network and resource selection model in 5G networks
PPT
Teaching material agriculture food technology
Group 1 Presentation -Planning and Decision Making .pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Tartificialntelligence_presentation.pptx
Machine Learning_overview_presentation.pptx
Encapsulation theory and applications.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
NewMind AI Weekly Chronicles - August'25-Week II
Machine learning based COVID-19 study performance prediction
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
A Presentation on Artificial Intelligence
Dropbox Q2 2025 Financial Results & Investor Presentation
Encapsulation_ Review paper, used for researhc scholars
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
Getting Started with Data Integration: FME Form 101
MYSQL Presentation for SQL database connectivity
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Spectral efficient network and resource selection model in 5G networks
Teaching material agriculture food technology

04. availability-concepts

  • 1. IT Infrastructure Architecture Availability Concepts (chapter 4) Infrastructure Building Blocks and Concepts
  • 2. Introduction • Everyone expects their infrastructure to be available all the time • A 100% guaranteed availability of an infrastructure is impossible
  • 3. Calculating availability • Availability can neither be calculated, nor guaranteed upfront – It can only be reported on afterwards, when a system has run for some years • Over the years, much knowledge and experience is gained on how to design high available systems – Failover – Redundancy – Structured programming – Avoiding Single Points of Failures (SPOFs) – Implementing systems management
  • 4. Calculating availability • The availability of a system is usually expressed as a percentage of uptime in a given time period – Usually one year or one month • Example for downtime expressed as a percentage per year: Availability % Downtime per year Downtime per month Downtime per week 99.8% 17.5 hours 86.2 minutes 20.2 minutes 99.9% ("three nines") 8.8 hours 43.2 minutes 10.1 minutes 99.99% ("four nines") 52.6 minutes 4.3 minutes 1.0 minutes 99.999% ("five nines") 5.3 minutes 25.9 seconds 6.1 seconds
  • 5. Calculating availability • Typical requirements used in service level agreements today are 99.8% or 99.9% availability per month for a full IT system • The availability of the infrastructure must be much higher – Typically in the range of 99.99% or higher • 99.999% uptime is also known as carrier grade availability – For one component – Higher availability levels for a complete system are very uncommon, as they are almost impossible to reach
  • 6. Calculating availability • It is good practice to agree on the maximum frequency of unavailability Unavailability (minutes) Number of events (per year) 0 – 5 <= 35 5 – 10 <= 10 10 – 20 <= 5 20 – 30 <=2 > 30 <= 1
  • 7. MTBF and MTTR • Mean Time Between Failures (MTBF) –The average time that passes between failures • Mean Time To Repair (MTTR) –The time it takes to recover from a failure
  • 8. MTBF and MTTR • Some components have higher MTBF than others • Some typical MTB’s: Component MTBF (hours) Hard disk 750,000 Power supply 100,000 Fan 100,000 Ethernet Network Switch 350,000 RAM 1,000,000
  • 9. MTTR • MTTR can be kept low by: – Having a service contract with the supplier – Having spare parts on-site – Automated redundancy and failover
  • 10. MTTR • Steps to complete repairs: – Notification of the fault (time before seeing an alarm message) – Processing the alarm – Finding the root cause of the error – Looking up repair information – Getting spare components from storage – Having technician come to the datacenter with the spare component – Physically repairing the fault – Restarting and testing the component
  • 11. Calculation examples Availability = MTBF MTBF + MTTR × 100% Component MTBF (h) MTTR (h) Availability in % Power supply 100,000 8 0.9999200 99.99200 Fan 100,000 8 0.9999200 99.99200 System board 300,000 8 0.9999733 99.99733 Memory 1,000,000 8 0,9999920 99.99920 CPU 500,000 8 0.9999840 99.99840 Network Interface Controller (NIC) 250,000 8 0.9999680 99.99680
  • 12. Calculation examples • Serial components: One defect leads to downtime • Example: the above system’s availability is: 0.9999200 × 0.9999200 × 0.9999733 × 0.9999920 × 0.9999840 × 0.9999680 = 0.99977 = 𝟗𝟗. 𝟗𝟕𝟕% (each components’ availability is at least 99.99%)
  • 13. Calculation examples • Parallel components: One defect: no downtime! • But beware of SPOFs! • Calculate availability: 𝐴 = 1 − (1 − 𝐴1) 𝑛 • Total availability = 1 − (1 − 0.99)2 = 99.99%
  • 14. Sources of unavailability - human errors • 80% of outages impacting mission-critical services is caused by people and process issues • Examples: • Performing a test in the production environment • Switching off the wrong component for repair • Swapping a good working disk in a RAID set instead of the defective one • Restoring the wrong backup tape to production • Accidentally removing files – Mail folders, configuration files • Accidentally removing database entries – Drop table x instead of drop table y
  • 15. Sources of unavailability - software bugs • Because of the complexity of most software it is nearly impossible (and very costly) to create bug-free software • Application software bugs can stop an entire system • Operating systems are software too – Operating systems containing bugs can lead to corrupted file systems, network failures, or other sources of unavailability
  • 16. Sources of unavailability - planned maintenance • Sometimes needed to perform systems management tasks: – Upgrading hardware or software – Implementing software changes – Migrating data – Creation of backups • Should only be performed on parts of the infrastructure where other parts keep serving clients • During planned maintenance the system is more vulnerable to downtime than under normal circumstances – A temporary SPOF could be introduced – Systems managers could make mistakes
  • 17. Sources of unavailability - physical defects • Everything breaks down eventually • Mechanical parts are most likely to break first • Examples: – Fans for cooling equipment usually break because of dust in the bearings – Disk drives contain moving parts – Tapes are very vulnerable to defects as the tape is spun on and off the reels all the time – Tape drives contain very sensitive pieces of mechanics that can break easily
  • 18. Sources of unavailability - bathtub curve • A component failure is most likely when the component is new • When a component still works after the first month, it is likely that it will continue working without failure until the end of its life
  • 19. Sources of unavailability - environmental issues • Environmental issues can cause downtime: – Failing facilities • Power • Cooling – Disasters • Fire • Earthquakes • Flooding
  • 20. Sources of unavailability - complexity of the infrastructure • Adding more components to an overall system design can undermine high availability – Even if the extra components are implemented to achieve high availability • Complex systems – Have more potential points of failure – Are more difficult to implement correctly – Are harder to manage • Sometimes it is better to just have an extra spare system in the closet than to use complex redundant systems
  • 21. Redundancy • Redundancy is the duplication of critical components in a single system, to avoid a single point of failure (SPOF) • Examples: – A single component having two power supplies; if one fails, the other takes over – Dual networking interfaces – Redundant cabling
  • 22. Failover • Failover is the (semi)automatic switch-over to a standby system or component • Examples: – Windows Server failover clustering – VMware High Availability – Oracle Real Application Cluster (RAC) database
  • 23. Fallback • Fallback is the manual switchover to an identical standby computer system in a different location • Typically used for disaster recovery • Three basic forms of fallback solutions: – Hot site – Cold site – Warm site
  • 24. Fallback – hot site • A hot site is – A fully configured fallback datacentre – Fully equipped with power and cooling – Applications are installed on the servers – Data is kept up-to-date to fully mirror the production system • Requires constant maintenance of the hardware, software, data, and applications to be sure the site accurately mirrors the state of the production site at all times
  • 25. Fallback - cold site • Is ready for equipment to be brought in during an emergency, but no computer hardware is available at the site • Applications will need to be installed and current data fully restored from backups • If an organization has very little budget for a fallback site, a cold site may be better than nothing
  • 26. Fallback - warm site • A computer facility readily available with power, cooling, and computers, but the applications may not be installed or configured • A mix between a hot site and cold site • Applications and data must be restored from backup media and tested – This typically takes a day
  • 27. Business Continuity • An IT disaster is defined as an irreparable problem in a datacenter, making the datacenter unusable • Natural disasters: – Floods – Hurricanes – Tornadoes – Earthquakes • Manmade disasters: – Hazardous material spills – Infrastructure failure – Bio-terrorism
  • 28. Business Continuity • In case of a disaster, the infrastructure could become unavailable, in some cases for a longer period of time • Business Continuity Management includes: – IT – Managing business processes – Availability of people and work places in disaster situations • Disaster recovery planning (DRP) contains a set of measures to take in case of a disaster, when (parts of) the IT infrastructure must be accommodated in an alternative location
  • 29. RTO and RPO • RTO and RPO are objectives in case of a disaster • Recovery Time Objective (RTO) – The maximum duration of time within which a business process must be restored after a disaster, in order to avoid unacceptable consequences (like bankruptcy)
  • 30. RTO and RPO • Recovery Point Objective (RPO) – The point in time to which data must be recovered considering some "acceptable loss" in a disaster situation • RTO and RPO are individual objectives – They are not related