SlideShare a Scribd company logo
High	Availability
Breda	Development	Meetup
Bas	Peters	- june 8,	2016
Breda Development Meetup 2016-06-08 - High Availability
Uptime
Percentile	target Max downtime	per	year
90% 36	days
99% 3.65	days
99.5% 1.83	days
99.9% 8.76	hours
99.99% 52.56	minutes
99.999% 5.25	minutes
99.9999% 31.5	seconds
HA is Redundancy
ü RAID: Disk crash? Another disk still works!
ü Virtualization: Physical host crashes? VM available on other physical host!
ü Clustering: Server crashes? Another server still works!
ü Power: Power outage? Redundant power supply!
ü Network: Switch or NIC crashes? 2nd network route available!
ü Geographical: Datacenter offline? Another DC available to perform work!
Traditional setup
router
server
end	user
Traditional setup - enhanced
router database	serverend	user application	server
Adding redundancy
router database	serverend	user
application	server	1
loadbalancer
application	server	2
Enhanced redundancy
router database	serverend	user
application	server	1
loadbalancer
application	server	2
router	(backup) loadbalancer (backup)
Database redundancy
router
end	user
application	server	1
loadbalancer
application	server	2
router	(backup) loadbalancer (backup)
database	server	1
database	server	2
Datacenter redundancy
routerend	user
application	server	1
loadbalancer application	server	2
router	(backup) loadbalancer (backup) database	server	1
database	server	2
datacenter	1
datacenter	2
States and sessions
o Multiple	requests	can	be	served	by	
different	backend	servers
o Store	session	in	database	or	noSQL cache
o Loadbalancer	can	“stick”	a	single	backend	
server	to	a	user…
o ...	but	not	in	all	cases!
app	1 app	2 app	3 app	4
1
2
3
12 3
Local storage
o Avoid	storing	meaningful	persistent	user	content	on	a	local	server
o Application	level	caching	is	useful	as	long	as	it	is	not	destructive
o Synchronization	of	contents	between	backend	servers	is	a	pain
o Use	database	for	storage	where	possible
…	There	are	possibilities	to	share	storage	amongst	backend	servers
Shared storage - NAS
o Network	Attached	Storage
o A	NAS	handles	the	complete	filesystem
o Relies	on	protocols	like:
NFS: Network	Filesystem
SMB/CIFS: Windows	File	Sharing
o Simple	to	implement
o Redundancy	is	very	hard	to	achieve,	often	single	point	of	failure
o Performance	is	mediocre	and	bottlenecks	can	occur
Shared storage - SAN
o Storage	Area	Network
o A	SAN	handles	only	the	“block	level”	part	of	the	filesystem
o Relies	on	protocols	like:
iSCSI: IP	based	SCSI
Fibre Channel: Optical	fiber	transport	protocol
AoE: ATA	over	Ethernet
o Hard	to	implement,	expensive
o Redundancy	can	be	achieved	to	avoid	single	point	of	failure
o Performance	and	scalability	is	(reasonably)	good
Shared storage – Cluster Filesystem
o Filesystem	shared	on	multiple	servers	using	special	software	/	drivers
o Windows	implementation:
DFS: Windows	Distributed	File	System		
o Linux	implementations:
HDFS: Hadoop	Distributed	Filesystem
Ceph: Object	Storage	Platform
GlusterFS: Red	Hat	Cluster	Filesystem
o Relatively	easy	to	implement
o Redundancy	can	easily	be	achieved
o Performance	and	scalability	is	(reasonably)	good
Database High Availability
o High	Availability	on	RDBMS	(relational	database	management	systems)	is	
often	the	most	difficult	thing	in	a	High	Available	setup
o Hardware	resources	and	data	need to	be	redundant
o Remember	that	it	isn’t	just	data,	it	is	constantly	changing	data
o High	Availability	means	the	operation	can	continue	uninterrupted,	not	by	
restoring	a	new/backup	server
Database HA - Replication
o Asynchronous	by	default
o One	master,	many	slaves
o No	write	scale-out	possible
o Difficult	to	recover	from	a	failover	situation
o Prone	to	inconsistency	when	not	used	properly
Database HA - Sharding
o Separate	data	over	multiple	database
back-ends	using	keyed	distribution
o Multi	master	setup	possible
o Excellent	scalability
o Redundancy	needs	to	be	obtained	through	a	complementary	methodology
o Requires	more	complex	application	logic
Database HA – Clustering I
o Synchronous	by	default
o Multi	master	setup	possible
o Write	scale-out	possible
o Near-automatic	fault	recovery
o Requires	code	level	replication	conflict	resolving
Database HA – Clustering II
Clustering	for	Microsoft	SQL	(from	2012)
o Always	On	Availability	Groups
o Each	node	requires	WSFC	(Windows	Server	Failover	Clustering)
o Asynchronous	and	synchronous	commit	mode	supported
o Up	to	8	“warm”	availability	replicas	can	be	setup
o These	replicas	can	be	used	for	read	transactions	and	backups
o Availability	group	listener	to	automatically	redirect	clients	to	the	best	available	server
o Not	a	“real”	cluster,	no	master-master	replication	possible
Database HA – Clustering III
Clustering	for	MySQL	(MariaDB)
o Galera (wsrep)	plugin	to	enable	clustering	
(included	in	MariaDB 10.1	by	default)
o Asynchronous	and	synchronous	commit	mode	supported
o Multi-master	synchronous	replication
o Read	and	write	scalability
o Automatic	membership	control,	node	joining	and	dropping
o No	listener	functionality	that	redirects	clients	to	available	nodes
Clustering – Quorum I
”A quorum is	the	minimum	number	of	members	of	a deliberative	
assembly necessary	to	conduct	the	business	of	that	group”
- Wikipedia
Clustering – Quorum II
o Node	Majority:	Each	node	that	is	available	
and	in	communication	can	vote.	The	cluster	functions	
only	with	a	majority	of	the	votes.
o When	a	network	partition	occurs,	the	nodes	in	the	minority	part	will	go	in	lockdown	to	
avoid	a	“split	brain”	situation
o When	a	network	partition	resolves,	the	minority	part	will	rejoin	the	active	cluster	after	
a	state	transfer	to	retrieve	the	data	that	was	changed	in	the	mean	time
o A	cluster	should	contain	an	odd	number	of	nodes	to	prevent	a	total	lockdown	during	a	
node	failure	or	network	partition
Clustering – Scenario 1
o Node	A	is	gracefully	stopped
o Other	nodes	receive	“leave”	message	
and	quorum	is	reduced	by	1
o Cluster	is	online
o Node	B	and	C	continue	to	serve	
requests	because	they	have	the	
majority	of	votes	(2	of	2)
Clustering – Scenario 2
o Node	A	and	B	are	gracefully	stopped
o Node	C	receive	“leave”	messages	from	
A	and	B	and	quorum	is	reduced	by	2
o Cluster	is	online
o Node	C	continues	to	serve	clients	since	
it	has	the	majority	of	votes	in	the	
quorum	(1	of	1)
Clustering – Scenario 3
o All	nodes	are	gracefully	stopped
o Cluster	is	offline
o There	is	a	potential	problem	in	starting	
the	cluster	again.	The	most	recent	(last	
stopped)	node	should	be	used	to	
bootstrap	the	cluster	or	there	is	
potential	data	loss
Clustering – Scenario 4
o Node	A	disappears	from	the	cluster	due	to	
unforeseen	circumstances
o Node	B	and	C	will	try	to	reconnect	to	A	but	will	
eventually	remove	A	from	the	cluster,	
maintaining	the	quorum	(3)
o Cluster	is	online
o Node	B	and	C	continue	to	serve	requests	
because	they	have	the	majority	of	votes
(2	of	3)
Clustering – Scenario 5
o Node	A	and	B	disappear	from	the	cluster	
due	to	unforeseen	circumstances
o Node	C	will	try	to	reconnect	to	A	and	B	
but	will	eventually	remove	both	from	the	
cluster,	maintaining	the	quorum	(3)
o Cluster	is	offline
o The	cluster	is	offline	because	Node	C	
cannot	acquire	a	majority	of	the	votes
(1	of	3)	and	will	remain	in	lockdown
Clustering – Scenario 6
o All	nodes	disappear	from	the	cluster	
due	to	unforeseen	circumstances
o Cluster	is	offline (obviously)
o This	is	a	potential	problem	as	the	Node	
with	the	most	recent	data	should	be	
used	to	bootstrap	the	cluster	again	to	
avoid	data	loss
Clustering – Scenario 7
o A	network	split	causes	Node	A,	B	and	C	
to	lose	connectivity	with	Node
D,	E	and	F
o Cluster	is	offline
o Node	A,	B	and	C	have	no	majority
(3	of	6)	and	Node	D,	E	and	F	also	have	
no	majority	(3	of	6).
All	Nodes	go	in	lockdown
Clustering – Multiple Datacenters I
DC	1 DC	2
node	1
node	2
node	3
Clustering – Multiple Datacenters II
DC	1 DC	2
node	1
node	2
node	3
node	4
Clustering – Multiple Datacenters III
DC	1 DC	2
node	1 node	2
DC	3
node	3
Clustering – Multiple Datacenters IV
DC	1 DC	2
node	1
node	2
node	3
node	4
DC	3
node	5 node	6
Health Endpoint Monitoring
o Monitor	applications	for	availability	in	a	HA	pool
o Monitor	middle-tier	services	for	availability
o Automatic	removal	of	misbehaving	endpoints	from	the	pool
o Endpoints	that	are	healthy	again	after	a	service	interruption	are	
automatically	re-added
Application Health Check
loadbalancer
Application	Node
Storage	available
Code	can	be	executed
Database	reachable
Service	A	running
Service	B	running
status	request
200	(OK)
Response	time:	50ms
Database Health Check
loadbalancer
Database	Node
Database	running
Simple	query	can	be	
executed
Local	database	node	is
healthy	cluster	node
status	request
200	(OK)
Response	time:	50ms
appserver 1
appserver 2
appserver 3
Monitoring Strategy
Loadbalancer
DB	loadbalancer
db node	1
db node	2
db node	3
DB	loadbalancer
db node	1
db node	2
db node	3appserver1appserver2
DB	node	1DB	node	3
Design Patterns for HA environments
o Safeguard	performance
o Increase	fault	tolerancy
o Improve	consistency
Queue based load leveling pattern I
o Temporal	decoupling
o Load	leveling
o Load	balancing
o Loose	coupling
tasks
service
message	queue
requests	received
at	variable	rate
messages	processed
at	a	more
consistent	rate
Queue based load leveling pattern II
When	to	use?
o Any	type	of	application	or	service	that	is	subject	to	overloading
When	not	to	use?
o Not	suitable	if	a	response	with	minimal	latency	is	expected	from	the	
application	or	service
Throttling pattern I
o Reject	or	delay	requests	to	the	application	when	a	certain	number	of	
requests	in	a	certain	amount	of	time	is	reached
o Disable	or	degrade	functionality	of	selected	nonessential	services	so	that	
essential	services	can	run	unimpeded	with	sufficient	resources
Throttling pattern II
When	to	use?
o To	ensure	that	a	system	continues	to	meet	service	level	agreements
o To	prevent	a	single	tenant	from	monopolizing	the	resources	provided	by	an	
application
o To	handle	bursts	in	activity
o To	help	cost-optimize	a	system	by	limiting	the	maximum	resource	levels	
needed	to	keep	it	functioning
Retry pattern
o Enable	the	application	to	handle	anticipated,	temporary	failures
o Transparently	retrying	an	operation	that	has	previously	failed	in	the	
expectation	that	the	cause	of	the	failure	is	transient
o Especially	useful	in	micro-service	and	cloud	architectures
Deployments
High	available	environments	bring	additional	challenges	to	software	
deployments:
o How	to	perform	atomic	releases?
o How	to	rollback	a	faulty	release	quickly?
o How	to	release	new	software	without	any	downtime?
Basic deployment
loadbalancer
application	server	1
application	server	2
database	cluster
1.	replace	application	
code	on	appserver 1		
2.	replace	application	
code	on	appserver 2		
3.	apply	database	changes
DONE!
Enhanced deployment
loadbalancer
application	server	1
application	server	2
database	cluster
1.	remove	appserver 1	
from	the	pool
3.	enable	appserver 1	in	the	pool	
and	disable	appserver 2
2.	replace	application	
code	on	appserver 1
DONE!
4.	replace	application	code	on	
appserver 2
5.	enable	appserver 2	in	the	pool
6.	apply	database	changes
A/B Deployments I
loadbalancer application	server	1 application	server	2
www.live.nl
appserver 1	- A
appserver 2	- A
www.shadow.nl
appserver 1	- B
appserver 2	- B
webserver	A
/deploy/A
webserver	A
/deploy/A
webserver	B
/deploy/B
webserver	B
/deploy/B
A/B Deployments II
loadbalancer
request	for:	
www.live.nl
“www.live.nl is	being	
served	by	pool	A”
application	server
Webserver	A	code	resides	at
/deploy/A
request	for:	
www.shadow.nl
“www.shadow.nl is	being	
served	by	pool	B”
Webserver	B	code	resides	at
/deploy/B
A/B Deployments III
loadbalancer
www.live.nl
www.shadow.nl
POOL	A è B
POOL	B è A
By	swapping	Pool	A	with	Pool	B	in	the	loadbalancer,	the	entire	backends
are	switched	instantaneously.
This	enables	seamless	deployment	without	downtime
Deployment best practices
o Never	introduce	backwards	breaking	changes	to	the	database
o Thoroughly	test	shadow-live	environment	as	it	is	the	closest	to	the	real	live	
deployment
o Maintain	a	tight	release	versioning,	based	on	semantic	versioning
o Releasing	end	of	day	and	on	a	Friday	is	not	recommended
Questions?
WWW.CMTELECOM.COM
THANKS	FOR	LISTENING!

More Related Content

ODP
MySQL HA with PaceMaker
PDF
Linux-HA with Pacemaker
PDF
MySQL High Availability Sprint: Launch the Pacemaker
PDF
MySQL HA with Pacemaker
PPTX
bdNOG 7 - Re-engineering the DNS - one resolver at a time
PDF
Galera cluster for MySQL - Introduction Slides
PPT
Pacemaker+DRBD
PDF
Improving HDFS Availability with Hadoop RPC Quality of Service
MySQL HA with PaceMaker
Linux-HA with Pacemaker
MySQL High Availability Sprint: Launch the Pacemaker
MySQL HA with Pacemaker
bdNOG 7 - Re-engineering the DNS - one resolver at a time
Galera cluster for MySQL - Introduction Slides
Pacemaker+DRBD
Improving HDFS Availability with Hadoop RPC Quality of Service

What's hot (20)

PDF
Client Drivers and Cassandra, the Right Way
PDF
Galera Replication Demystified: How Does It Work?
PDF
Introduction to Galera
PDF
Using and Benchmarking Galera in different architectures (PLUK 2012)
PDF
IETF 100: A signalling mechanism for trusted keys in the DNS
PDF
3 周彦偉-隨需而變 我所經歷的my sql架構變遷﹣周彥偉﹣acmug@2015.12台北
PDF
MySQL with DRBD/Pacemaker/Corosync on Linux
PDF
Samba as a gateway to OpenAFS
PDF
Samba4 Introduction
PDF
Repair & Recovery for your MySQL, MariaDB & MongoDB / TokuMX Clusters - Webin...
PDF
HadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated Hadoop
PDF
Comparing high availability solutions with percona xtradb cluster and percona...
PDF
MariaDB Galera Cluster - Simple, Transparent, Highly Available
PDF
Percon XtraDB Cluster in a nutshell
PDF
How To Set Up SQL Load Balancing with HAProxy - Slides
PPTX
HBaseCon 2013: How to Get the MTTR Below 1 Minute and More
PDF
Troubleshooting Kafka's socket server: from incident to resolution
PDF
9 DevOps Tips for Going in Production with Galera Cluster for MySQL - Slides
PPTX
Migrating to XtraDB Cluster
PDF
Webinar slides: Introducing Galera 3.0 - Now supporting MySQL 5.6
Client Drivers and Cassandra, the Right Way
Galera Replication Demystified: How Does It Work?
Introduction to Galera
Using and Benchmarking Galera in different architectures (PLUK 2012)
IETF 100: A signalling mechanism for trusted keys in the DNS
3 周彦偉-隨需而變 我所經歷的my sql架構變遷﹣周彥偉﹣acmug@2015.12台北
MySQL with DRBD/Pacemaker/Corosync on Linux
Samba as a gateway to OpenAFS
Samba4 Introduction
Repair & Recovery for your MySQL, MariaDB & MongoDB / TokuMX Clusters - Webin...
HadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated Hadoop
Comparing high availability solutions with percona xtradb cluster and percona...
MariaDB Galera Cluster - Simple, Transparent, Highly Available
Percon XtraDB Cluster in a nutshell
How To Set Up SQL Load Balancing with HAProxy - Slides
HBaseCon 2013: How to Get the MTTR Below 1 Minute and More
Troubleshooting Kafka's socket server: from incident to resolution
9 DevOps Tips for Going in Production with Galera Cluster for MySQL - Slides
Migrating to XtraDB Cluster
Webinar slides: Introducing Galera 3.0 - Now supporting MySQL 5.6
Ad

Similar to Breda Development Meetup 2016-06-08 - High Availability (20)

ODP
MySQL 5.7 clustering: The developer perspective
PDF
MySQL InnoDB Cluster HA Overview & Demo
PPT
MYSQL
PDF
MySQL InnoDB Cluster - New Features in 8.0 Releases - Best Practices
PPT
Clustering van IT-componenten
PDF
Moodle Moot Spain: Moodle Available and Scalable with MySQL HA - InnoDB Clust...
PPT
Hyper v r2 deep dive
PPTX
Implementing dr w. hyper v clustering
PPS
Microsoft (Virtualization 2008)
PPSX
RAC - The Savior of DBA
PPTX
Redis Clustering Advanced___31Mar2025.pptx
PDF
Critical Attributes for a High-Performance, Low-Latency Database
PDF
Big Data Streams Architectures. Why? What? How?
ODP
MySQL 5.7 Fabric: Introduction to High Availability and Sharding
PDF
Making clouds: turning opennebula into a product
PDF
OpenNebulaConf 2013 - Making Clouds: Turning OpenNebula into a Product by Car...
PDF
Making Clouds: Turning OpenNebula into a Product
PDF
SQL Server Clustering and High Availability
ODP
PoC: Using a Group Communication System to improve MySQL Replication HA
PPTX
MySQL Options in OpenStack
MySQL 5.7 clustering: The developer perspective
MySQL InnoDB Cluster HA Overview & Demo
MYSQL
MySQL InnoDB Cluster - New Features in 8.0 Releases - Best Practices
Clustering van IT-componenten
Moodle Moot Spain: Moodle Available and Scalable with MySQL HA - InnoDB Clust...
Hyper v r2 deep dive
Implementing dr w. hyper v clustering
Microsoft (Virtualization 2008)
RAC - The Savior of DBA
Redis Clustering Advanced___31Mar2025.pptx
Critical Attributes for a High-Performance, Low-Latency Database
Big Data Streams Architectures. Why? What? How?
MySQL 5.7 Fabric: Introduction to High Availability and Sharding
Making clouds: turning opennebula into a product
OpenNebulaConf 2013 - Making Clouds: Turning OpenNebula into a Product by Car...
Making Clouds: Turning OpenNebula into a Product
SQL Server Clustering and High Availability
PoC: Using a Group Communication System to improve MySQL Replication HA
MySQL Options in OpenStack
Ad

Recently uploaded (20)

PDF
Unit-1 introduction to cyber security discuss about how to secure a system
PPTX
Slides PPTX World Game (s) Eco Economic Epochs.pptx
PPTX
Introuction about ICD -10 and ICD-11 PPT.pptx
PPTX
Introduction to Information and Communication Technology
PPTX
introduction about ICD -10 & ICD-11 ppt.pptx
PPTX
Introuction about WHO-FIC in ICD-10.pptx
PDF
RPKI Status Update, presented by Makito Lay at IDNOG 10
PDF
APNIC Update, presented at PHNOG 2025 by Shane Hermoso
PPTX
innovation process that make everything different.pptx
PDF
An introduction to the IFRS (ISSB) Stndards.pdf
PPT
tcp ip networks nd ip layering assotred slides
PDF
Sims 4 Historia para lo sims 4 para jugar
PPTX
522797556-Unit-2-Temperature-measurement-1-1.pptx
PDF
SASE Traffic Flow - ZTNA Connector-1.pdf
PPTX
June-4-Sermon-Powerpoint.pptx USE THIS FOR YOUR MOTIVATION
PDF
Triggering QUIC, presented by Geoff Huston at IETF 123
PDF
Cloud-Scale Log Monitoring _ Datadog.pdf
PPTX
PptxGenJS_Demo_Chart_20250317130215833.pptx
PPTX
international classification of diseases ICD-10 review PPT.pptx
PDF
Tenda Login Guide: Access Your Router in 5 Easy Steps
Unit-1 introduction to cyber security discuss about how to secure a system
Slides PPTX World Game (s) Eco Economic Epochs.pptx
Introuction about ICD -10 and ICD-11 PPT.pptx
Introduction to Information and Communication Technology
introduction about ICD -10 & ICD-11 ppt.pptx
Introuction about WHO-FIC in ICD-10.pptx
RPKI Status Update, presented by Makito Lay at IDNOG 10
APNIC Update, presented at PHNOG 2025 by Shane Hermoso
innovation process that make everything different.pptx
An introduction to the IFRS (ISSB) Stndards.pdf
tcp ip networks nd ip layering assotred slides
Sims 4 Historia para lo sims 4 para jugar
522797556-Unit-2-Temperature-measurement-1-1.pptx
SASE Traffic Flow - ZTNA Connector-1.pdf
June-4-Sermon-Powerpoint.pptx USE THIS FOR YOUR MOTIVATION
Triggering QUIC, presented by Geoff Huston at IETF 123
Cloud-Scale Log Monitoring _ Datadog.pdf
PptxGenJS_Demo_Chart_20250317130215833.pptx
international classification of diseases ICD-10 review PPT.pptx
Tenda Login Guide: Access Your Router in 5 Easy Steps

Breda Development Meetup 2016-06-08 - High Availability