Maciej Lasyk, High Availability Explained
Maciej Lasyk
11. Sesja Linuksowa
Wrocław, 2014-04-05
1/14
High Availability Explained
Maciej Lasyk, High Availability Explained
“Anything that can go wrong, will go wrong”
Murphy's law
2/14
Maciej Lasyk, High Availability Explained
“Anything that can go wrong, will go wrong”
Murphy's law
2/14
Maciej Lasyk, High Availability Explained
An electrical explosion and fire Saturday at a Houston data
center operated by The Planet has taken the entire facility offline.
The company claimed power to the facility was interrupted when a
transformer exploded. Official reports that three walls were blown
down causing a fire.
“Anything that can go wrong, will go wrong”
Murphy's law
2/14
Maciej Lasyk, High Availability Explained
Three walls of the electrical equipment room on the first floor
blew several feet from their original position, and the underground
cabling that powers the first floor of H1 was destroyed.
An electrical explosion and fire Saturday at a Houston data
center operated by The Planet has taken the entire facility offline.
The company claimed power to the facility was interrupted when a
transformer exploded. Official reports that three walls were blown
down causing a fire.
“Anything that can go wrong, will go wrong”
Murphy's law
2/14
Maciej Lasyk, High Availability Explained
High Availability is in the eye of the beholder
3/14
Maciej Lasyk, High Availability Explained
High Availability is in the eye of the beholder
CEO: we don't loose sales
3/14
Maciej Lasyk, High Availability Explained
High Availability is in the eye of the beholder
CEO: we don't loose sales
Sales: we can extend our offer basing on HA level
3/14
Maciej Lasyk, High Availability Explained
High Availability is in the eye of the beholder
CEO: we don't loose sales
Sales: we can extend our offer basing on HA level
Accounts managers: we don't upset our customers (that often)
3/14
Maciej Lasyk, High Availability Explained
High Availability is in the eye of the beholder
CEO: we don't loose sales
Sales: we can extend our offer basing on HA level
Accounts managers: we don't upset our customers (that often)
Developers: we can be proud – our services are working ;)
3/14
Maciej Lasyk, High Availability Explained
High Availability is in the eye of the beholder
CEO: we don't loose sales
Sales: we can extend our offer basing on HA level
Accounts managers: we don't upset our customers (that often)
Developers: we can be proud – our services are working ;)
System engineers: we can sleep well (and fsck, we love to!)
3/14
Maciej Lasyk, High Availability Explained
High Availability is in the eye of the beholder
CEO: we don't loose sales
Sales: we can extend our offer basing on HA level
Accounts managers: we don't upset our customers (that often)
Developers: we can be proud – our services are working ;)
System engineers: we can sleep well (and fsck, we love to!)
Technical support: no calls? Back to WoW then.. ;)
3/14
Maciej Lasyk, High Availability Explained
So how many 9's?
4/14
Maciej Lasyk, High Availability Explained
So how many 9's?
4/14
Maciej Lasyk, High Availability Explained
So how many 9's?
Monthly: 1 hour of outage means 100% - 0.13888 ~= 99.86112 of availability
4/14
Maciej Lasyk, High Availability Explained
So how many 9's?
Monthly: 1 hour of outage means 100% - 0.13888 ~= 99.86112 of availability
Yearly: 1 hour of outage means 100% - 0.01142 ~= 99.98858 of availability
4/14
Maciej Lasyk, High Availability Explained
So how many 9's?
Monthly: 1 hour of outage means 100% - 0.13888 ~= 99.86112 of availability
Yearly: 1 hour of outage means 100% - 0.01142 ~= 99.98858 of availability
Availability Downtime (year) Downtime (month)
90% (“one nine”) 36.5 days 72 hours
95% 18.25 days 36 hours
97% 10.96 days 21.6 hours
98% 7.30 days 14.4 hours
99% (“two nines”) 3.65 days 7.2 hours
99.5% 1.83 days 3.6 hours
99.8% 17.52 hours 86.23 minutes
99.9% (“three nines”) 4.38 hours 21.56 minutes
99.99 (“four nines”) 52.56 minutes 4.32 minutes
99.999 (“five nines”) 5.26 minutes 25.9 seconds
4/14
Maciej Lasyk, High Availability Explained
So how many 9's?
https://jazz.net/wiki/bin/view/Deployment/HighAvailability
4/14
Maciej Lasyk, High Availability Explained
HA terminology
RPO: Recovery Point Objective; how much data can we loose?
5/14
Maciej Lasyk, High Availability Explained
HA terminology
RPO: Recovery Point Objective; how much data can we loose?
RTO: Recovery Time Objective; how long does it take to recover?
5/14
Maciej Lasyk, High Availability Explained
HA terminology
RPO: Recovery Point Objective; how much data can we loose?
RTO: Recovery Time Objective; how long does it take to recover?
MTBF: Mean-Times-Between-Failures; time between failures
(density fnc -> reliability fnc)
https://en.wikipedia.org/wiki/Mean_time_between_failures
5/14
Maciej Lasyk, High Availability Explained
HA terminology
SLA: Service Level Agreement;
formal definitions (customer <-> provider)
5/14
Maciej Lasyk, High Availability Explained
HA terminology
SLA: Service Level Agreement;
formal definitions (customer <-> provider)
OLA: Operational Level Agreement; definitions within organization;
help us keeping provided SLAs
5/14
Maciej Lasyk, High Availability Explained
SLAs..
So what is written in SLAs?
Availability Downtime (year) Downtime (month)
90% 36.5 days 72 hours
95% 18.25 days 36 hours
97% 10.96 days 21.6 hours
98% 7.30 days 14.4 hours
99% 3.65 days 7.2 hours
99.5% 1.83 days 3.6 hours
99.8% 17.52 hours 86.23 minutes
99.9% 4.38 hours 21.56 minutes
99.99 52.56 minutes 4.32 minutes
99.999 5.26 minutes 25.9 seconds
5/14
Maciej Lasyk, High Availability Explained
SLAs..
So what is written in SLAs?
Availability Downtime (year) Downtime (month)
90% 36.5 days 72 hours
95% 18.25 days 36 hours
97% 10.96 days 21.6 hours
98% 7.30 days 14.4 hours
99% 3.65 days 7.2 hours
99.5% (EC2, EBS) 1.83 days 3.6 hours
99.8% 17.52 hours 86.23 minutes
99.9% (SoftLayer, IBM) 4.38 hours 21.56 minutes
99.99 (OVH ded. cloud) 52.56 minutes 4.32 minutes
99.999 5.26 minutes 25.9 seconds
http://aws.amazon.com/ec2/sla/
http://www.softlayer.com/about/service-level-agreement
https://www.ovh.com/us/dedicated-cloud/security-and-sla.xml
5/14
Maciej Lasyk, High Availability Explained
SLAs..
Availability mentioned in SLAs are only goals of service provider
Usually when it's not met than company pays off the fees
5/14
Maciej Lasyk, High Availability Explained
SLAs..
5/14
Hetzner?
“We guarantee an annual average of 99% network availability”
“For indirect damages and loss of profits, we are liable only in
cases of intentional or gross negligence. In this case we are
liable only for the contract-typical predictable damage, a
maximum of 100% of the annually fee.”
http://www.hetzner.de/en/hosting/legal/agb
Maciej Lasyk, High Availability Explained
SLAs..
5/14
Leaseweb?
(yup - megaupload)
- no %s
- best effort
- $$$ for faster response
times
http://www.leaseweb.com/en/support/all-about/sla
Maciej Lasyk, High Availability Explained
How deep is this hole?
app layer (core, db, cache)
data storage
operating system
hardware
networking
location
So we would like to achieve 99,9999% which is about 30s of downtime per year
6/14
Maciej Lasyk, High Availability Explained
app layer (core, db, cache)
data storage
operating system
hardware
networking
location
Even Proof of Concept is very hard to provide: 2.5s of downtime per layer monthly!
6/14
How deep is this hole?
Maciej Lasyk, High Availability Explained
app layer (core, db, cache)
data storage
operating system
hardware
networking
location
AOL has 99,999%!
http://highscalability.com/blog/2014/2/17/how-the-aolcom-architecture-evolved-to-99999-availability-8.html
6/14
How deep is this hole?
Maciej Lasyk, High Availability Explained
Load-balancing and failover
LB:
http://www.netdigix.com/linux-loadbalancing.php
7/14
Maciej Lasyk, High Availability Explained
Load-balancing and failover
Failover:
http://www.simplefailover.com/
7/14
Maciej Lasyk, High Availability Explained
LB – 4th
layer or 7th
?
4th
layer:
- high performance
- just do the LB work!
- reliable
- scalable
7th
layer:
- low cost
- good for quickfixes / patches
- not that scalable
- low performance
- complex codebase
- custom code for protocols
- cookies? what about memcache..
8/14
Maciej Lasyk, High Availability Explained
Disaster Recovery
9/14
Maciej Lasyk, High Availability Explained
Disaster Recovery
http://disasterrecovery.starwindsoftware.com/planning-disaster-recovery-for-virtualized-environments
9/14
Maciej Lasyk, High Availability Explained
Disaster Recovery
http://disasterrecovery.starwindsoftware.com/planning-disaster-recovery-for-virtualized-environments
Hot site: active synchronization, could be serving services. Cost can be high
Warm site: periodical synchronization, DR tests needed. Low costs
Cold site: Nothing here – just echo and some place to spin services; nightmare
9/14
Maciej Lasyk, High Availability Explained
Disaster Recovery
http://disasterrecovery.starwindsoftware.com/planning-disaster-recovery-for-virtualized-environments
Hot site: active synchronization, could be serving services. Cost can be high
So maybe use “hot” as production and global LB? win-win scenario
9/14
Maciej Lasyk, High Availability Explained
Planning for failure
10/14
Maciej Lasyk, High Availability Explained
Planning for failure
Everything starts here - DNS:
- keep TTLs low (300s). Can't make under 60min? That's bad!
- check SLA of DNS servers (dnsmadeeasy.com history)
- what do you know about DNSes?
- zero downtime here is a must!
- this can be achieved with complicated network abracadabra
- remember what 99.9999% means?
- round robin is a load – balancer but without failover!
- GSLB – killed by OS/browser/srvs cache'ing
(GlobalServerLoadBalancing)
- GlobalIP (SoftLayer etc) – workaround for GSLB via routing
10/14
Maciej Lasyk, High Availability Explained
Planning for failure
E-mail servers:
- it's simple as MX records (delivering)
- it's almost simple as complicated system of SMTP servers (sending)
- it's not that simple when IMAP locking over DFS (reading)
5 gmail-smtp-in.l.google.com.
10 alt1.gmail-smtp-in.l.google.com.
20 alt2.gmail-smtp-in.l.google.com.
30 alt3.gmail-smtp-in.l.google.com.
40 alt4.gmail-smtp-in.l.google.com.
When MXing – watch the spam!
10/14
Maciej Lasyk, High Availability Explained
Planning for failure
WEB servers:
- it's simple as some frontend loadbalancer
- did you really stick user session to particular server? Memcache!
- LB balancing algorithm
- how many LBs?
- what if LB goes down?
10/14
Maciej Lasyk, High Availability Explained
Planning for failure
DB servers:
- it's.. not that simple
- replication (master – master? App should be aware..)
- replication ring? Complicated, works, but in case of failure...
- let's talk about MySQL:
- NoSPOF solution: MySQL cluster
- MySQL Galera cluster – synch, active-active multi-master
- master – master – simply works
- MySQL fabric – HA + sharding; use with large farms
- Failover? Matsunobu Yoshinori mysql-master-ha
- MySQL utilities (http://www.clusterdb.com/mysql/mysql-utilities-webinar-qa-replay-now-available/)
10/14
Maciej Lasyk, High Availability Explained
Planning for failure
DB servers:
Matsunobu Yoshinori mysql-master-ha
https://code.google.com/p/mysql-master-ha/
“I have heard a couple of cases where AWS users use MHA
instead of RDS because RDS takes much longer downtime
on slave promotion (more than 5 minutes usually, because
standby database is not running).
I'm surprised AWS users care about a few minutes of downtime..”
10/14
Maciej Lasyk, High Availability Explained
Planning for failure
Caching servers:
- this is cache for God's sake – why would we use HA here?
- just use proper architecture like... redundancy.
10/14
Maciej Lasyk, High Availability Explained
Planning for failure
Caching servers:
- this is cache for God's sake – why would we use HA here?
- just use proper architecture like... redundancy.
Load – balancers:
- remember about failovering IP addresses!
10/14
Maciej Lasyk, High Availability Explained
Planning for failure
Caching servers:
- this is cache for God's sake – why would we use HA here?
- just use proper architecture like... redundancy.
Load – balancers:
- remember about failovering IP addresses!
Storage – DFSes:
- GlusterFS – we'll see it in action in a minute
- NFS? Could be – over some SAN / NAS (high cost solution)
- CephFS – just like GlusterFS – it's great and does the work
- DRBD – lower level, does the work on block – device layer – slow...
10/14
Maciej Lasyk, High Availability Explained
Planning for failure
GlusterFS:
- low cost (could be..)
- distributed volumes
- replicated volumes
- striped volumes
- and...
- distributed – striped volumes
- distributed – replicated volumes
- distributed – striped – replicated volumes
- sound good? :)
10/14
Maciej Lasyk, High Availability Explained
Planning for failure
GlusterFS: replicated volumes vs Geo-replication
- replicated:
- mirrors data
- provides HA
- synch – replication
- Geo-replication:
- mirrors data across geo – distributed clusters
- ensures backing up data for DR
- asynch – replica (periodic checks)
10/14
Maciej Lasyk, High Availability Explained
Planning for failure
HA for virtualization solutions?
- it's really complicated, like...
11/14
Maciej Lasyk, High Availability Explained
Planning for failure
HA for virtualization solutions?
- it's really complicated, like...
11/14
Maciej Lasyk, High Availability Explained
Planning for failure
HA for virtualization solutions?
- but it could be done simpler...
11/14
Maciej Lasyk, High Availability Explained
Planning for failure
HA for virtualization solutions?
- it could be done simpler with containers!
- containers are very light
- deploy time from bare is tiny
- management is very easy
- resources throttling via cgroups
11/14
Maciej Lasyk, High Availability Explained
Tools
The most important tool would be the conclusion from the picture below:
12/14
Maciej Lasyk, High Availability Explained 12/14
Tools
The most important tool would be the conclusion from the picture below:
Maciej Lasyk, High Availability Explained 12/14
Tools
The most important tool would be the conclusion from the picture below:
Maciej Lasyk, High Availability Explained
Tools
- DNS: roundrobin, GSLB, low ttls, globalIP
12/14
Maciej Lasyk, High Availability Explained
Tools
- DNS: roundrobin, GSLB, low ttls, globalIP
- Load-Balancers (l7, stateless services)): HaProxy, Pound, Nginx
12/14
Maciej Lasyk, High Availability Explained
Tools
- DNS: roundrobin, GSLB, low ttls, globalIP
- Load-Balancers (l7, stateless services)): HaProxy, Pound, Nginx
- Failover (statefull services):
- IP: KeepAlived + sysctl
12/14
Maciej Lasyk, High Availability Explained
Tools
- DNS: roundrobin, GSLB, low ttls, globalIP
- Load-Balancers (l7, stateless services)): HaProxy, Pound, Nginx
- Failover (statefull services):
- IP: KeepAlived + sysctl
- Managing: pacemaker (manager) + corosync (message'ing)
12/14
Maciej Lasyk, High Availability Explained
Tools
- DNS: roundrobin, GSLB, low ttls, globalIP
- Load-Balancers (l7, stateless services)): HaProxy, Pound, Nginx
- Failover (statefull services):
- IP: KeepAlived + sysctl
- Managing: pacemaker (manager) + corosync (message'ing)
- (almost) All-In-One: Linux Virtual Server
12/14
Maciej Lasyk, High Availability Explained
Turn on HA thinking!
Main goal of HA? Improve user experience!
- keep the app fully functional
- keep the app resistant and tolerant to faults
- provide method for a successful audit
- sleep well (anyone awake?) ;)
13/14
Maciej Lasyk, High Availability Explained
Maciej Lasyk
11. Sesja Linuksowa
2014-04-05, Wrocław
http://maciek.lasyk.info/sysop
maciek@lasyk.info
@docent-net
High Availability Explained
Thank you :)
14/14

More Related Content

PDF
Red Hat Storage Server Replication Past, Present, & Future
PDF
CloudFront DESIGN PATTERNS
ODP
Bcache and Aerospike
PPTX
Migrating enterprise workloads to AWS
PDF
Scaling Pinterest
PDF
Cloud Native Cost Optimization
PDF
From Push Technology to Real-Time Messaging and WebSockets
PPT
Bezvadu zemene Ogres CB
Red Hat Storage Server Replication Past, Present, & Future
CloudFront DESIGN PATTERNS
Bcache and Aerospike
Migrating enterprise workloads to AWS
Scaling Pinterest
Cloud Native Cost Optimization
From Push Technology to Real-Time Messaging and WebSockets
Bezvadu zemene Ogres CB

Viewers also liked (7)

PDF
Business continuity management (case study)
PPTX
Storyboard colocation strategy
PPTX
Spanning Hot and Cold Data: Enabling Key Storage Technologies for the Data C...
PDF
High Availability (HA) Explained
PDF
EU-US Privacy Shield - Safe Harbor Replacement
PPTX
Webinar Herramientas de Marketing para aumentar tus ingresos en 2016
PPTX
Disaster Management and Major Disaster in INDIA
Business continuity management (case study)
Storyboard colocation strategy
Spanning Hot and Cold Data: Enabling Key Storage Technologies for the Data C...
High Availability (HA) Explained
EU-US Privacy Shield - Safe Harbor Replacement
Webinar Herramientas de Marketing para aumentar tus ingresos en 2016
Disaster Management and Major Disaster in INDIA
Ad

Similar to High Availability (HA) Explained - second edition (20)

PDF
MySQL InnoDB Cluster - Group Replication
PDF
Bulletproof Kafka with Fault Tree Analysis (Andrey Falko, Lyft) Kafka Summit ...
PDF
State of Akka 2017 - The best is yet to come
PDF
How Reactive Streams & Akka Streams change the JVM Ecosystem
PPT
Supercharge Your Applications
DOCX
#VirtualDesignMaster 3 Challenge 2 - Lubomir Zvolensky
PDF
Soa In Practice 1st Edition Nicolai M Josuttis
PPTX
Using Kubernetes to deliver a “serverless” service
PDF
Switching SaaS Hosting From dedicated virtual machines to container-based clu...
PDF
Performance culture through the looking-glass - performance.now() 2022
PDF
Giles Sirett - CloudStack news
PPT
VMworld 2011 Review: Preparing for vSphere 5 with Virtualization Manager
PPTX
IMC Summit 2016 Breakout - Aleksandar Seovic - The Illusion of Statelessness
PDF
Microservices 5 Things I Wish I'd Known - JFall 2017
PDF
Microservices 5 things i wish i'd known java with the best 2018
PDF
Fabio Tiriticco - Ádám Sándor - Akka Cluster versus Kubernetes: Clustering...
PPTX
What's new in MySQL Cluster 7.4 webinar charts
PPTX
JavaOne 2016 "Java, Microservices, Cloud and Containers"
PDF
Kafka Mirror Tester: Go and Kubernetes Powered Test Suite for Kafka Replicati...
PDF
Chicago AWS user group meetup - May 2014 at Cohesive
MySQL InnoDB Cluster - Group Replication
Bulletproof Kafka with Fault Tree Analysis (Andrey Falko, Lyft) Kafka Summit ...
State of Akka 2017 - The best is yet to come
How Reactive Streams & Akka Streams change the JVM Ecosystem
Supercharge Your Applications
#VirtualDesignMaster 3 Challenge 2 - Lubomir Zvolensky
Soa In Practice 1st Edition Nicolai M Josuttis
Using Kubernetes to deliver a “serverless” service
Switching SaaS Hosting From dedicated virtual machines to container-based clu...
Performance culture through the looking-glass - performance.now() 2022
Giles Sirett - CloudStack news
VMworld 2011 Review: Preparing for vSphere 5 with Virtualization Manager
IMC Summit 2016 Breakout - Aleksandar Seovic - The Illusion of Statelessness
Microservices 5 Things I Wish I'd Known - JFall 2017
Microservices 5 things i wish i'd known java with the best 2018
Fabio Tiriticco - Ádám Sándor - Akka Cluster versus Kubernetes: Clustering...
What's new in MySQL Cluster 7.4 webinar charts
JavaOne 2016 "Java, Microservices, Cloud and Containers"
Kafka Mirror Tester: Go and Kubernetes Powered Test Suite for Kafka Replicati...
Chicago AWS user group meetup - May 2014 at Cohesive
Ad

More from Maciej Lasyk (20)

PDF
Rundeck & Ansible
PDF
Docker 1.11
ODP
Programowanie AWSa z CLI, boto, Ansiblem i libcloudem
ODP
Co powinieneś wiedzieć na temat devops?f
ODP
"Containers do not contain"
PDF
Git Submodules
ODP
Linux containers & Devops
PDF
Under the Dome (of failure driven pipeline)
PDF
Continuous Security in DevOps
ODP
About cultural change w/Devops
ODP
Orchestrating docker containers at scale (#DockerKRK edition)
ODP
Orchestrating docker containers at scale (PJUG edition)
PDF
Orchestrating Docker containers at scale
ODP
Ghost in the shell
ODP
Scaling and securing node.js apps
ODP
Node.js security
ODP
Monitoring with Nagios and Ganglia
PDF
Stop disabling SELinux!
ODP
RHEL/Fedora + Docker (and SELinux)
PPTX
Shall we play a game? PL version
Rundeck & Ansible
Docker 1.11
Programowanie AWSa z CLI, boto, Ansiblem i libcloudem
Co powinieneś wiedzieć na temat devops?f
"Containers do not contain"
Git Submodules
Linux containers & Devops
Under the Dome (of failure driven pipeline)
Continuous Security in DevOps
About cultural change w/Devops
Orchestrating docker containers at scale (#DockerKRK edition)
Orchestrating docker containers at scale (PJUG edition)
Orchestrating Docker containers at scale
Ghost in the shell
Scaling and securing node.js apps
Node.js security
Monitoring with Nagios and Ganglia
Stop disabling SELinux!
RHEL/Fedora + Docker (and SELinux)
Shall we play a game? PL version

Recently uploaded (20)

PDF
Developing a website for English-speaking practice to English as a foreign la...
PPTX
Microsoft Excel 365/2024 Beginner's training
PPTX
2018-HIPAA-Renewal-Training for executives
PDF
Zenith AI: Advanced Artificial Intelligence
PDF
Improvisation in detection of pomegranate leaf disease using transfer learni...
PPT
What is a Computer? Input Devices /output devices
PPT
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
A review of recent deep learning applications in wood surface defect identifi...
PDF
The influence of sentiment analysis in enhancing early warning system model f...
PPTX
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
PPTX
Build Your First AI Agent with UiPath.pptx
PPTX
TEXTILE technology diploma scope and career opportunities
PDF
Credit Without Borders: AI and Financial Inclusion in Bangladesh
PDF
How IoT Sensor Integration in 2025 is Transforming Industries Worldwide
PPTX
Custom Battery Pack Design Considerations for Performance and Safety
PDF
A proposed approach for plagiarism detection in Myanmar Unicode text
PDF
Produktkatalog für HOBO Datenlogger, Wetterstationen, Sensoren, Software und ...
PPTX
Benefits of Physical activity for teenagers.pptx
PDF
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
Developing a website for English-speaking practice to English as a foreign la...
Microsoft Excel 365/2024 Beginner's training
2018-HIPAA-Renewal-Training for executives
Zenith AI: Advanced Artificial Intelligence
Improvisation in detection of pomegranate leaf disease using transfer learni...
What is a Computer? Input Devices /output devices
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
1 - Historical Antecedents, Social Consideration.pdf
A review of recent deep learning applications in wood surface defect identifi...
The influence of sentiment analysis in enhancing early warning system model f...
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
Build Your First AI Agent with UiPath.pptx
TEXTILE technology diploma scope and career opportunities
Credit Without Borders: AI and Financial Inclusion in Bangladesh
How IoT Sensor Integration in 2025 is Transforming Industries Worldwide
Custom Battery Pack Design Considerations for Performance and Safety
A proposed approach for plagiarism detection in Myanmar Unicode text
Produktkatalog für HOBO Datenlogger, Wetterstationen, Sensoren, Software und ...
Benefits of Physical activity for teenagers.pptx
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor

High Availability (HA) Explained - second edition

  • 1. Maciej Lasyk, High Availability Explained Maciej Lasyk 11. Sesja Linuksowa Wrocław, 2014-04-05 1/14 High Availability Explained
  • 2. Maciej Lasyk, High Availability Explained “Anything that can go wrong, will go wrong” Murphy's law 2/14
  • 3. Maciej Lasyk, High Availability Explained “Anything that can go wrong, will go wrong” Murphy's law 2/14
  • 4. Maciej Lasyk, High Availability Explained An electrical explosion and fire Saturday at a Houston data center operated by The Planet has taken the entire facility offline. The company claimed power to the facility was interrupted when a transformer exploded. Official reports that three walls were blown down causing a fire. “Anything that can go wrong, will go wrong” Murphy's law 2/14
  • 5. Maciej Lasyk, High Availability Explained Three walls of the electrical equipment room on the first floor blew several feet from their original position, and the underground cabling that powers the first floor of H1 was destroyed. An electrical explosion and fire Saturday at a Houston data center operated by The Planet has taken the entire facility offline. The company claimed power to the facility was interrupted when a transformer exploded. Official reports that three walls were blown down causing a fire. “Anything that can go wrong, will go wrong” Murphy's law 2/14
  • 6. Maciej Lasyk, High Availability Explained High Availability is in the eye of the beholder 3/14
  • 7. Maciej Lasyk, High Availability Explained High Availability is in the eye of the beholder CEO: we don't loose sales 3/14
  • 8. Maciej Lasyk, High Availability Explained High Availability is in the eye of the beholder CEO: we don't loose sales Sales: we can extend our offer basing on HA level 3/14
  • 9. Maciej Lasyk, High Availability Explained High Availability is in the eye of the beholder CEO: we don't loose sales Sales: we can extend our offer basing on HA level Accounts managers: we don't upset our customers (that often) 3/14
  • 10. Maciej Lasyk, High Availability Explained High Availability is in the eye of the beholder CEO: we don't loose sales Sales: we can extend our offer basing on HA level Accounts managers: we don't upset our customers (that often) Developers: we can be proud – our services are working ;) 3/14
  • 11. Maciej Lasyk, High Availability Explained High Availability is in the eye of the beholder CEO: we don't loose sales Sales: we can extend our offer basing on HA level Accounts managers: we don't upset our customers (that often) Developers: we can be proud – our services are working ;) System engineers: we can sleep well (and fsck, we love to!) 3/14
  • 12. Maciej Lasyk, High Availability Explained High Availability is in the eye of the beholder CEO: we don't loose sales Sales: we can extend our offer basing on HA level Accounts managers: we don't upset our customers (that often) Developers: we can be proud – our services are working ;) System engineers: we can sleep well (and fsck, we love to!) Technical support: no calls? Back to WoW then.. ;) 3/14
  • 13. Maciej Lasyk, High Availability Explained So how many 9's? 4/14
  • 14. Maciej Lasyk, High Availability Explained So how many 9's? 4/14
  • 15. Maciej Lasyk, High Availability Explained So how many 9's? Monthly: 1 hour of outage means 100% - 0.13888 ~= 99.86112 of availability 4/14
  • 16. Maciej Lasyk, High Availability Explained So how many 9's? Monthly: 1 hour of outage means 100% - 0.13888 ~= 99.86112 of availability Yearly: 1 hour of outage means 100% - 0.01142 ~= 99.98858 of availability 4/14
  • 17. Maciej Lasyk, High Availability Explained So how many 9's? Monthly: 1 hour of outage means 100% - 0.13888 ~= 99.86112 of availability Yearly: 1 hour of outage means 100% - 0.01142 ~= 99.98858 of availability Availability Downtime (year) Downtime (month) 90% (“one nine”) 36.5 days 72 hours 95% 18.25 days 36 hours 97% 10.96 days 21.6 hours 98% 7.30 days 14.4 hours 99% (“two nines”) 3.65 days 7.2 hours 99.5% 1.83 days 3.6 hours 99.8% 17.52 hours 86.23 minutes 99.9% (“three nines”) 4.38 hours 21.56 minutes 99.99 (“four nines”) 52.56 minutes 4.32 minutes 99.999 (“five nines”) 5.26 minutes 25.9 seconds 4/14
  • 18. Maciej Lasyk, High Availability Explained So how many 9's? https://jazz.net/wiki/bin/view/Deployment/HighAvailability 4/14
  • 19. Maciej Lasyk, High Availability Explained HA terminology RPO: Recovery Point Objective; how much data can we loose? 5/14
  • 20. Maciej Lasyk, High Availability Explained HA terminology RPO: Recovery Point Objective; how much data can we loose? RTO: Recovery Time Objective; how long does it take to recover? 5/14
  • 21. Maciej Lasyk, High Availability Explained HA terminology RPO: Recovery Point Objective; how much data can we loose? RTO: Recovery Time Objective; how long does it take to recover? MTBF: Mean-Times-Between-Failures; time between failures (density fnc -> reliability fnc) https://en.wikipedia.org/wiki/Mean_time_between_failures 5/14
  • 22. Maciej Lasyk, High Availability Explained HA terminology SLA: Service Level Agreement; formal definitions (customer <-> provider) 5/14
  • 23. Maciej Lasyk, High Availability Explained HA terminology SLA: Service Level Agreement; formal definitions (customer <-> provider) OLA: Operational Level Agreement; definitions within organization; help us keeping provided SLAs 5/14
  • 24. Maciej Lasyk, High Availability Explained SLAs.. So what is written in SLAs? Availability Downtime (year) Downtime (month) 90% 36.5 days 72 hours 95% 18.25 days 36 hours 97% 10.96 days 21.6 hours 98% 7.30 days 14.4 hours 99% 3.65 days 7.2 hours 99.5% 1.83 days 3.6 hours 99.8% 17.52 hours 86.23 minutes 99.9% 4.38 hours 21.56 minutes 99.99 52.56 minutes 4.32 minutes 99.999 5.26 minutes 25.9 seconds 5/14
  • 25. Maciej Lasyk, High Availability Explained SLAs.. So what is written in SLAs? Availability Downtime (year) Downtime (month) 90% 36.5 days 72 hours 95% 18.25 days 36 hours 97% 10.96 days 21.6 hours 98% 7.30 days 14.4 hours 99% 3.65 days 7.2 hours 99.5% (EC2, EBS) 1.83 days 3.6 hours 99.8% 17.52 hours 86.23 minutes 99.9% (SoftLayer, IBM) 4.38 hours 21.56 minutes 99.99 (OVH ded. cloud) 52.56 minutes 4.32 minutes 99.999 5.26 minutes 25.9 seconds http://aws.amazon.com/ec2/sla/ http://www.softlayer.com/about/service-level-agreement https://www.ovh.com/us/dedicated-cloud/security-and-sla.xml 5/14
  • 26. Maciej Lasyk, High Availability Explained SLAs.. Availability mentioned in SLAs are only goals of service provider Usually when it's not met than company pays off the fees 5/14
  • 27. Maciej Lasyk, High Availability Explained SLAs.. 5/14 Hetzner? “We guarantee an annual average of 99% network availability” “For indirect damages and loss of profits, we are liable only in cases of intentional or gross negligence. In this case we are liable only for the contract-typical predictable damage, a maximum of 100% of the annually fee.” http://www.hetzner.de/en/hosting/legal/agb
  • 28. Maciej Lasyk, High Availability Explained SLAs.. 5/14 Leaseweb? (yup - megaupload) - no %s - best effort - $$$ for faster response times http://www.leaseweb.com/en/support/all-about/sla
  • 29. Maciej Lasyk, High Availability Explained How deep is this hole? app layer (core, db, cache) data storage operating system hardware networking location So we would like to achieve 99,9999% which is about 30s of downtime per year 6/14
  • 30. Maciej Lasyk, High Availability Explained app layer (core, db, cache) data storage operating system hardware networking location Even Proof of Concept is very hard to provide: 2.5s of downtime per layer monthly! 6/14 How deep is this hole?
  • 31. Maciej Lasyk, High Availability Explained app layer (core, db, cache) data storage operating system hardware networking location AOL has 99,999%! http://highscalability.com/blog/2014/2/17/how-the-aolcom-architecture-evolved-to-99999-availability-8.html 6/14 How deep is this hole?
  • 32. Maciej Lasyk, High Availability Explained Load-balancing and failover LB: http://www.netdigix.com/linux-loadbalancing.php 7/14
  • 33. Maciej Lasyk, High Availability Explained Load-balancing and failover Failover: http://www.simplefailover.com/ 7/14
  • 34. Maciej Lasyk, High Availability Explained LB – 4th layer or 7th ? 4th layer: - high performance - just do the LB work! - reliable - scalable 7th layer: - low cost - good for quickfixes / patches - not that scalable - low performance - complex codebase - custom code for protocols - cookies? what about memcache.. 8/14
  • 35. Maciej Lasyk, High Availability Explained Disaster Recovery 9/14
  • 36. Maciej Lasyk, High Availability Explained Disaster Recovery http://disasterrecovery.starwindsoftware.com/planning-disaster-recovery-for-virtualized-environments 9/14
  • 37. Maciej Lasyk, High Availability Explained Disaster Recovery http://disasterrecovery.starwindsoftware.com/planning-disaster-recovery-for-virtualized-environments Hot site: active synchronization, could be serving services. Cost can be high Warm site: periodical synchronization, DR tests needed. Low costs Cold site: Nothing here – just echo and some place to spin services; nightmare 9/14
  • 38. Maciej Lasyk, High Availability Explained Disaster Recovery http://disasterrecovery.starwindsoftware.com/planning-disaster-recovery-for-virtualized-environments Hot site: active synchronization, could be serving services. Cost can be high So maybe use “hot” as production and global LB? win-win scenario 9/14
  • 39. Maciej Lasyk, High Availability Explained Planning for failure 10/14
  • 40. Maciej Lasyk, High Availability Explained Planning for failure Everything starts here - DNS: - keep TTLs low (300s). Can't make under 60min? That's bad! - check SLA of DNS servers (dnsmadeeasy.com history) - what do you know about DNSes? - zero downtime here is a must! - this can be achieved with complicated network abracadabra - remember what 99.9999% means? - round robin is a load – balancer but without failover! - GSLB – killed by OS/browser/srvs cache'ing (GlobalServerLoadBalancing) - GlobalIP (SoftLayer etc) – workaround for GSLB via routing 10/14
  • 41. Maciej Lasyk, High Availability Explained Planning for failure E-mail servers: - it's simple as MX records (delivering) - it's almost simple as complicated system of SMTP servers (sending) - it's not that simple when IMAP locking over DFS (reading) 5 gmail-smtp-in.l.google.com. 10 alt1.gmail-smtp-in.l.google.com. 20 alt2.gmail-smtp-in.l.google.com. 30 alt3.gmail-smtp-in.l.google.com. 40 alt4.gmail-smtp-in.l.google.com. When MXing – watch the spam! 10/14
  • 42. Maciej Lasyk, High Availability Explained Planning for failure WEB servers: - it's simple as some frontend loadbalancer - did you really stick user session to particular server? Memcache! - LB balancing algorithm - how many LBs? - what if LB goes down? 10/14
  • 43. Maciej Lasyk, High Availability Explained Planning for failure DB servers: - it's.. not that simple - replication (master – master? App should be aware..) - replication ring? Complicated, works, but in case of failure... - let's talk about MySQL: - NoSPOF solution: MySQL cluster - MySQL Galera cluster – synch, active-active multi-master - master – master – simply works - MySQL fabric – HA + sharding; use with large farms - Failover? Matsunobu Yoshinori mysql-master-ha - MySQL utilities (http://www.clusterdb.com/mysql/mysql-utilities-webinar-qa-replay-now-available/) 10/14
  • 44. Maciej Lasyk, High Availability Explained Planning for failure DB servers: Matsunobu Yoshinori mysql-master-ha https://code.google.com/p/mysql-master-ha/ “I have heard a couple of cases where AWS users use MHA instead of RDS because RDS takes much longer downtime on slave promotion (more than 5 minutes usually, because standby database is not running). I'm surprised AWS users care about a few minutes of downtime..” 10/14
  • 45. Maciej Lasyk, High Availability Explained Planning for failure Caching servers: - this is cache for God's sake – why would we use HA here? - just use proper architecture like... redundancy. 10/14
  • 46. Maciej Lasyk, High Availability Explained Planning for failure Caching servers: - this is cache for God's sake – why would we use HA here? - just use proper architecture like... redundancy. Load – balancers: - remember about failovering IP addresses! 10/14
  • 47. Maciej Lasyk, High Availability Explained Planning for failure Caching servers: - this is cache for God's sake – why would we use HA here? - just use proper architecture like... redundancy. Load – balancers: - remember about failovering IP addresses! Storage – DFSes: - GlusterFS – we'll see it in action in a minute - NFS? Could be – over some SAN / NAS (high cost solution) - CephFS – just like GlusterFS – it's great and does the work - DRBD – lower level, does the work on block – device layer – slow... 10/14
  • 48. Maciej Lasyk, High Availability Explained Planning for failure GlusterFS: - low cost (could be..) - distributed volumes - replicated volumes - striped volumes - and... - distributed – striped volumes - distributed – replicated volumes - distributed – striped – replicated volumes - sound good? :) 10/14
  • 49. Maciej Lasyk, High Availability Explained Planning for failure GlusterFS: replicated volumes vs Geo-replication - replicated: - mirrors data - provides HA - synch – replication - Geo-replication: - mirrors data across geo – distributed clusters - ensures backing up data for DR - asynch – replica (periodic checks) 10/14
  • 50. Maciej Lasyk, High Availability Explained Planning for failure HA for virtualization solutions? - it's really complicated, like... 11/14
  • 51. Maciej Lasyk, High Availability Explained Planning for failure HA for virtualization solutions? - it's really complicated, like... 11/14
  • 52. Maciej Lasyk, High Availability Explained Planning for failure HA for virtualization solutions? - but it could be done simpler... 11/14
  • 53. Maciej Lasyk, High Availability Explained Planning for failure HA for virtualization solutions? - it could be done simpler with containers! - containers are very light - deploy time from bare is tiny - management is very easy - resources throttling via cgroups 11/14
  • 54. Maciej Lasyk, High Availability Explained Tools The most important tool would be the conclusion from the picture below: 12/14
  • 55. Maciej Lasyk, High Availability Explained 12/14 Tools The most important tool would be the conclusion from the picture below:
  • 56. Maciej Lasyk, High Availability Explained 12/14 Tools The most important tool would be the conclusion from the picture below:
  • 57. Maciej Lasyk, High Availability Explained Tools - DNS: roundrobin, GSLB, low ttls, globalIP 12/14
  • 58. Maciej Lasyk, High Availability Explained Tools - DNS: roundrobin, GSLB, low ttls, globalIP - Load-Balancers (l7, stateless services)): HaProxy, Pound, Nginx 12/14
  • 59. Maciej Lasyk, High Availability Explained Tools - DNS: roundrobin, GSLB, low ttls, globalIP - Load-Balancers (l7, stateless services)): HaProxy, Pound, Nginx - Failover (statefull services): - IP: KeepAlived + sysctl 12/14
  • 60. Maciej Lasyk, High Availability Explained Tools - DNS: roundrobin, GSLB, low ttls, globalIP - Load-Balancers (l7, stateless services)): HaProxy, Pound, Nginx - Failover (statefull services): - IP: KeepAlived + sysctl - Managing: pacemaker (manager) + corosync (message'ing) 12/14
  • 61. Maciej Lasyk, High Availability Explained Tools - DNS: roundrobin, GSLB, low ttls, globalIP - Load-Balancers (l7, stateless services)): HaProxy, Pound, Nginx - Failover (statefull services): - IP: KeepAlived + sysctl - Managing: pacemaker (manager) + corosync (message'ing) - (almost) All-In-One: Linux Virtual Server 12/14
  • 62. Maciej Lasyk, High Availability Explained Turn on HA thinking! Main goal of HA? Improve user experience! - keep the app fully functional - keep the app resistant and tolerant to faults - provide method for a successful audit - sleep well (anyone awake?) ;) 13/14
  • 63. Maciej Lasyk, High Availability Explained Maciej Lasyk 11. Sesja Linuksowa 2014-04-05, Wrocław http://maciek.lasyk.info/sysop maciek@lasyk.info @docent-net High Availability Explained Thank you :) 14/14