Choosing the right high availability strategy

MariaDB High
Availability
MariaDB Corp
Roadshow 2018

High
Availability
Defined
In information technology,
high availability refers to a
system or component that is
continuously operational for a
desirably long length of time.
Availability – Wikipedia
up time / total time

Approach to HA
3.7 days / year
Backup /
Restore
1
< 99.9%
52.6 min / year
Replication /
Automatic failover
3
~ 99.99%
8.8hs / year
Simple
replication /
manual
failover
2
~ 99.9%
5.3 min / year
Galera
Cluster
~ 99.999%
4 5
Other
Strategies for High Availability

An average of 80 percent of mission-critical application service
downtime is directly caused by people or process failures. The
other 20 percent is caused by technology failure, environmental
failure or a disaster
Gartner Research

High Availability Background
• High Availability isn’t always equal to long Uptime
– A system is “up” but it might not be accessible
– A system that is “down” just once, but for a long time, is NOT highly available
• High Availability rather means
– Long Mean Time Between Failures (MTBF)
– Short Mean Time To Recover (MTTR)
• High availability is:
– a system design protocol and associated implementation that ensures a certain degree of
operational continuity during a given reference period.

High Availability Components
High availability is a system design protocol and associated implementation that
ensures a certain degree of operational continuity during a reference period.
For stateful services, we
need to make sure that
data is made redundant.
It is not a replacement
for backups!
Data Redundancy
Some mechanism to
redirect traffic from the
failed server or
Datacenter to a working
one
Failover or Switchover
Solution
Availability of the
services needs to be
monitored, to take
action when there is a
failure or even to
prevent them
Monitoring and
Management

General Terms
• Single Point of Failure (SPOF)
– An element is a SPOF when its failure results in a full stop of the service as no other element
can take over (storage, WAN connection, replication channel)
– It is important to evaluate the costs for eliminating the SPOF, the likelihood that it may fail,
and the time required to bring it into service again
• Downtime
– the period of time a service is down. Planned and unplanned. Planned downtime is part of the
overall availability
• Shared vs. Local Storage
– Shared storage systems like SANs can provide built-in high availability, though this comes with
equally high costs
– Not really suitable for Disaster Recovery scenario on multiple Data Center
– Local storage comes with low cost but we need to implement ways for replicating /mirroring
data

General Terms
• Switchover
– When a manual process is used to switch from one system to a redundant or standby system in
case of a failure
• Failover
– Automatic switchover, without human intervention
• Failback
– A (often-underestimated) task to handle the recovery of a failed system and how to fail-back to
the original system after recovery

Data
Redundancy
HA for MariaDB

Replication Scheme
All nodes are masters
and applications can read
and write from/to any
node
Synchronous Replication
The Master does not
confirm transactions to
the client application until
at least one slave has
copied the change to its
relay log, and flushed it to
disk
Semi-Syncronous
Replication
The Master does not
wait for Slave, the
master writes events to
its binary log and
slaves request them
when they are ready
Asynchronous
Replication

HA Begins with Data Replication
• Replication enables data from one MariaDB server (the master) to be
replicated to one or more MariaDB servers (the slaves)
• MariaDB Replication:
– Very easy to setup:
• On master: Define a replication user
• On slave: CHANGE MASTER TO … <options>
– Used to scale out read workloads
– Provide a first level of high availability and geographic redundancy
– Allows to offload backups and analytic jobs.

Asynchronous Replication
• MariaDB Replication is asynchronous by default.
• Slave determines how much to read and from which point in the binary log
• Slave can be behind master in reading and applying changes.
– Single threaded vs parallel replication
• If the master crashes, transactions might not have been transmitted to any
slave
• Asynchronous replication is great for read scaling as adding more replicas
does not impact replication latency

Asynchronous Replication-Switch Over
1. The master server is down
2. The slave(s) server(s) is(are) updated to the last position in the relay log
3. Determine which slave server is the most suitable to promote to master
4. Point reminding slaves to the promoted server
5. Point applications to new master server
6. All steps are manual
Master and Slaves
ReadOnly Slaves
Master and Slaves
ReadOnly Slaves

Async Replication Topologies
Master and Slaves
ReadOnly Slaves
Master with Relay Slave Circular Replication

MaxScale Use Case
Asynchronous
Replication Failover
New in MaxScale v2.2
Each application server
uses only 1 connection
MaxScale identifies the “master” and
“slaves” nodes
If the “master” node fails,
a new one can be selected and promoted
MariaDB Replication + R/W split routing
Max
Scale
Master and Slaves
ReadOnly Slaves

MariaDB GTID Implementation
• Always ON since MariaDB v10.0
– Compatible w/ non-GTID replication: binary log file and position.
• Allows for better control of the replication chain.
– Slave position is recorded crash safe in the same transaction as the last successful DML statement
– Doesn’t require knowing the last binary log file name and position.
– Replication will start from the last recorded GTID
• Allows multi-master replication
– A single slave can have multiple incoming Replication Streams
• MaxScale will select active master automatically
• GTID Components:
– Domain ID: Allows to identify the logical origin of the transactions.
– Server ID: Identifies the server where the transaction originated.
– Transaction Sequence: Monotonically increasing number identifying the transaction.

Semi-synchronous Replication
• MariaDB supports semi-synchronous replication:
– The master does not confirm transactions to the client application until at least one slave has
copied the change to its relay log, and flushed it to disk.
– Eliminates data loss by securing a copy of all transactions in at least one slave.
– When a commit returns successfully, it is known that the data exists in at least two places (on the
master and at least one slave).
– Semi- synchronous has a performance impact due to the additional round trip.
• Adds the network latency to the transaction processing time

MariaDB Enhanced Semi-synchronous Replication
• One or more slaves can be defined as working semi-synchronously.
• For these slaves, the master waits until the I/O thread on one or more of the semi-synch slaves
has flushed the transaction to disk.
• This ensures that all committed transactions are at least stored in the relay log of the slave.
• If no semi-synch slave can acknowledge the transaction, the master will
downgrade to asynchronous replication after waiting for a timeout period.
Once a semi-synch slave comes back online, the master will reset back to semi-
synch replication.
• Status variable: Rpl_semi_sync_master_status

Semi-synchronous Replication – Switch Over
• The steps for a failover are the same as when using the standard replication
• A slave should be chosen among those (if many) that are be semi- synched with the master
Master and Slaves
Semi-Sync
Slave
Async Slaves
Master and Slaves
Async Slaves

Semi-Sync Replication Topologies
• Semi- synchronous replication is used between master
and backup master
• Semi- sync replication has a performance impact, but the
risk for data loss is minimized.
• This topology works well when performing master
failover
– The backup master acts as a warm-standby server
– it has the highest probability of having up-to-date data if
compared to other slaves.
Semi_sync
Asynchronous
ReadOnly/
Backup Master
ReadOnly

MariaDB Multi-Source Replication
• It enables a slave to receive transactions from
multiple sources simultaneously.
• It can be used to backup multiple servers to a
single server, to merge table shards, and
consolidate data from multiple servers to a single
server.
• GTID helps to track transactions coming from
different servers / applications.
• Note: There is not conflict resolution. Last DML
to reach the slave ‘wins’
Master 2Master 1 Master 3
Slave

Combining MariaDB Replication Features
• Replication features can be combined to form more
resilient configurations
• Example:
– Implement semi-sync circular replication to increase data
resilience
– Use GTID to avoid duplicate transactions
– Use read-only slaves for read scale out
– Use MaxScale:
• Transactions will go to active master
• Reads will be offloaded to slaves
• Fast failover
– Writes go to a single master at any given time
Semi_sync
Asynchronous
Backup Master
ReadOnly

Synchronous Replication (Galera)
• Galera Replication is a synchronous multi-master
replication plug-in that enables a true master-
master setup for InnoDB.
• Every component of the cluster (node) is a share
nothing server
• All nodes are masters and applications can read and
write from any node
– NOTE: No conflict resolution
• A minimal Galera cluster consists of 3 nodes:
– A proper cluster needs to reach a quorum (i.e. the
majority of the nodes of the cluster)
• Transactions are synchronously committed on all
nodes.
MariaDB
MariaDB
MariaDB

• PROS
– A high availability solution with synchronous
replication, failover and resynchronization
– No loss of data
– All servers have up-to-date data (no slave lag)
– Read scalability, every node has latest data available
MariaDB
MariaDB
MariaDB

• CONS
– It only supports InnoDB
– The transaction rollback rate and hence the
transaction latency, can increase with the number of
the cluster nodes
– The cluster performs as its least performing node
• an overloaded master affects the performance of
the Galera cluster
– Network latency affects transaction throughput
MariaDB
MariaDB
MariaDB

MDBE
Cluster Failover
Clustered nodes cooperate
to remain in sync
With multiple master nodes,
reads and updates* both scale*
Synchronous replication with
optimistic locking delivers high
availability with little overhead
Fast failover because all
nodes remains synchronizedMariaDB
MariaDB
MariaDB
Load Balancing
and Failover
Application /
App Server

MaxScale Use Case
MDBE Cluster
Synchronous Replication
Each application server
uses only 1 connection
MaxScale selects one node
as “master” and the other
nodes as “slaves”
If the “master” node fails,
a new one can be elected
immediately
Galera Cluster + R/W split routing
Max
Scale

MariaDB HA: MaxScale
• Re-route traffic between
master and slave(s)
• Failover / slave promotion
- NEW in v2.2
• Switchover on command - NEW
in v2.2
• Implemented for Booking.com
• Part of MaxScale release
• All slaves are in sync,
easy to promote any slave
Read / Write Splitter
Detects Active Master
Binary Log
Server

Thank you
Ulrich Moser
ulrich.moser@mariadb.com

Choosing the right high availability strategy

More Related Content

What's hot (20)

Similar to Choosing the right high availability strategy (20)

More from MariaDB plc (20)

Recently uploaded (20)

Choosing the right high availability strategy