PXC (Xtradb) Failure and Recovery

Percona XtraDB Cluster:
Failure Scenarios and their Recovery
Krunal Bauskar (PXC Lead, Percona)
Alkin Tezuysal (Sr. Technical Manager, Percona)

2
Who we are?
Krunal Bauskar
● Database enthusiast.
● Practicing databases (MySQL) for over a
decade now.
● Wide interest in data handling and
management.
● Worked on some real big data that powered
application @ Yahoo, Oracle, Teradata.
Alkin Tezuysal (@ask_dba)
● Open Source Database Evangelist
● Global Database Operations Expert
● Cloud Infrastructure Architect AWS
● Inspiring Technical and Strategic Leader
● Creative Team Builder
● Speaker, Mentor, and Coach
● Outdoor Enthusiast

3
Agenda
● Quick sniff at PXC
● Failure Scenarios and their recovery
● PXC Genie - You wish. We implement.
● Q & A

5
What is PXC ?
Auto-node
provisioning
Multi-master
Performance
tuned
Enhanced
Security
Flexible
topology
Network
protection
(Geo-distributed)

Failure Scenarios and their recovery

7
Scenario: New node fail to connect to cluster

8
Joiner log

9
Joiner log
DONOR log doesn’t have any
traces of JOINER trying to JOIN.
Administrator reviews
configuration settings like IP
address are sane and valid.

10
Joiner log
Still JOINER fails
to connect

11
Joiner log
SELinux/AppArmor

12
Joiner log
Don’t confuse this error with SST
since node is not yet offered
membership of cluster. SST comes
post membership.

13
● Solution-1:
○ Setting mode to PERMISSIVE or DISABLED

14
● Solution-1:
○ Setting mode to PERMISSIVE or DISABLED
● Solution-2:
○ Configuring policy to allow access in ENFORCING mode.
○ Related blogs
■ “Lock Down: Enforcing SELinux with Percona XtraDB Cluster”. It probs
what all permission are needed and add rules accordingly.
■ “Lock Down: Enforcing AppArmor with Percona XtraDB Cluster”
■ Using this we can continue to use SELinux in enable mode. (You can also
refer to selinux configuration on Codership site too).

15
PXC can operate with SELinux/AppArmor.

16
Scenario: Catching up cluster (SST, IST)

17
● SST: complete copy-over of data-directory
○ SST has has multiple external components SST script, XB, network aspect,
etc. Some of these are outside control of PXC process.
● IST: missing write-sets (as node is already member of cluster).
○ Intrinsic to PXC process space.

18
#1
Joiner log

19
#1
Joiner log
SST failed on DONOR

20
#1
Joiner log
SST failed on DONOR
wsrep_sst_auth
not set on DONOR

21
#1
Joiner log
wsrep_sst_auth should be set on DONOR (often user set it
on JOINER and things still fails). Post SST, JOINER will
copy-over the said user from DONOR.

22
#2
Donor log

23
#2
Donor log
Possible cause:
● Specified wsrep_sst_auth user doesn’t exit.
● Credentials are wrong.
● Insufficient privileges.

24
#3
Joiner log

25
#3
Joiner log
Trying to get old version JOINER to join from
new version DONOR. (Not supported).
Opposite is naturally allowed.

26
#4
Joiner log
Donor log

27
#4
Joiner log
Donor log
WSREP_SST: [WARNING] wsrep_node_address or
wsrep_sst_receive_address not set. Consider setting them if SST fails.

28
#5

29
#5
Faulty SSL configuration

30
PXC recommends: Same configuration on
all nodes of the cluster.
Old DONOR - New JOINER (OK)
XB is external tool and has its own set of
controllable configuration (passed through
PXC my.cnf)
SST user should be present on DONOR
Look at DONOR and JOINER log.
wsrep_sst_recieve_address/wsrep_node_
address is needed.
Advance encryption option like keyring on
DONOR and no keyring on JOINER is not
allowed.
Ensure stable n/w link between DONOR
and JOINER.
Network rules (firewall, etc..). SST uses port
4444. IST uses 4568.
Often-error are local to XB. Check the XB
log file that can give hint of error.

31
Scenario: Cluster doesn’t come up on restart

32
● All your nodes are located in same Data-Center (DC)
● DC hits power failure and all nodes are restarted.
● On restart, recovery flow is executed to recover wsrep coordinates.

33

34
Cluster still fails to come up

35
● Close look at the log shows original bootstrapping node has
safe_to_bootstrap set to 0 so it refuse to come up.
● Other nodes of cluster are left dangling (in non-primary state) in
absence of original cluster forming node.

36
● Close look at the log shows original bootstrapping node has
safe_to_bootstrap set to 0 so it refuse to come up.
● Other nodes of cluster are left dangling (in non-primary state) in
absence of original cluster forming node.
Galera/PXC expect user to identify node
that has latest data and then use that too
bootstrap. So as safety check
safe_to_bootstrap was added.

37
Identify the node
that has latest data
(look at
wsrep-recovery
co-ords)
Bootstrap
the node
Restart other
non-primary
node (if they fail
to auto-join).
set
safe_to_bootstrap
to 1 in grastate.dat
from data-directory

38
I have exact same setup but I
never face this issue. My cluster
get auto-restore on power failure.
Am I losing data or doing
something wrong ?

39
Because you have bootstrapped
your node using
wsrep_cluster_address=<node-ip>
&
pc.recovery=true (default)

40
Because you have bootstrapped
your node using
wsrep_cluster_address=<node-ip>
&
pc.recovery=true (default)
Error is observed if you have bootstrapped:
wsrep_cluster_address=”gcomm://”
OR
wsrep_cluster_address=”<node-ips>”
but pc.recovery=false

41
PXC can auto-restart on
DC failure depending on
configuration option used.

42
Scenario: Data inconsistency

43

44
● 2 kinds of inconsistencies
○ Physical inconsistency: Hardware Issues
○ Logical inconsistency: Data Issues

45
Logical inconsistency caused to cluster specific operation like locks,
RSU, wsrep_on=off, etc…

46
Logical inconsistency caused to cluster specific operation like locks,
RSU, wsrep_on=off, etc…
PXC has zero tolerance for inconsistency
and so it immediately isolate the nodes on detecting inconsistency.

47
Inconsistency
detected

48
Cluster in
healthy and
running
ISOLATED NODE
(SHUTDOWN)

49
Inconsistency
detected
Inconsistency
detected

50
shutdown
shutdownnon-prim
State marked as
UNSAFE

51
majority groupminority group

52
Minority group has
GOOD DATA

53
If there are multiple nodes in minority group, identify a
node that has latest data.

54
Set pc.bootstrap=1 on the selected node.
Single node cluster formed

55
Set pc.bootstrap=1 on the selected node.
Boot other majority node. (they will join through SST).

56
CLUSTER
RESTORED

57
shutdown
shutdownnon-prim
State marked as
UNSAFE

58
Majority group has
GOOD DATA

59
Nodes in majority group are already
SHUTDOWN. Initiate SHUTDOWN of
nodes from minority group.

60
Valid uuid can be
copied over from
a minority group
node.
Fix grastate.dat for the nodes from
majority group. (Consistency shutdown
sequence has marked STATE=UNSAFE).

61
Bootstrap the cluster using one of the
node from majority group and eventually
get other majority nodes to join.

62
Bootstrap the cluster using one of the
node from majority group and eventually
get other majority nodes to join.
Remove grastate.dat of minority group
nodes and restart them to join newly
formed cluster.

63
CLUSTER
RESTORED

64
Scenario: Another aspect of data inconsistency

65
One of the node from
minority group

66
Transaction
upto X
Transaction
upto X - 1

67
Transaction
upto X
Transaction
upto X - 1
Transaction X caused
inconsistency so it
never made it to
these nodes.

68
Transaction
upto X
Transaction
upto X - 1

69
Transaction
upto X
Transaction
upto X - 1
Membership rejected
as new coming node
has one extra
transaction than
cluster state.

70
2 node cluster is up and it started
processing transaction. Moving
the state of cluster from X -> X +
3

71
3

72
3
Node got
membership and
node joined
through IST too?

73
3
Node has
transaction upto X
and cluster says it
has transaction
upto X+3.
Node joining
doesn’t evaluate
data. It is all
dependent on
seqno.

74
User failed to remove
grastate.dat that caused all
this confusion.

75
trx-seqno=x
trx-seqno=xTransaction with same
seqno but different
update
trx-seqno=x

76
trx-seqno=x
Cluster restored just
to enter more
inconsistency (that
may detect in future).
Transaction with same
seqno but different
update
trx-seqno=x
trx-seqno=x

77
Avoid running node local operation.
If cluster enter inconsistent state carefully
follow the step-by-step guide to recover
(don’t fear SST, it is for your good).

79
Scenario: Delayed purging
Gcache
(staging area to hold
replicated
transaction)

80
Transaction
replicated and staged

81
All nodes finished
applying transaction

82
Transactions can be
removed from gcache

83
● Each node at configured interval notifies other nodes/cluster about its
transaction committed status
● This configuration is controlled by 2 conditions:
○ gcache.keep_page_size and gcache.keep_page_count
○ static limit on number of keys (1K), transactions (128),
bytes (128M).
● Accordingly each node evaluates the cluster level lowest water mark and
initiate gcache purge.

84
Each node update local
graph and evaluate
cluster purge watermark
N1_purged_upto: x+1
N2_purged_upto: x+1
N3_purged_upto: x
N1_purged_upto: x+1
N2_purged_upto: x+1
N3_purged_upto: x
N1_purged_upto: x+1
N2_purged_upto: x+1
N3_purged_upto: x

85
And accordingly all
nodes will purge local
gcache upto X.
N1_purged_upto: x+1
N2_purged_upto: x+1
N3_purged_upto: x
N1_purged_upto: x+1
N2_purged_upto: x+1
N3_purged_upto: x
N1_purged_upto: x+1
N2_purged_upto: x+1
N3_purged_upto: x
cluster-purge-water-mark=X

86
gcache page created and purged.

87
New COMMIT CUT 2360 after 2360 from 1
purging index up to 2360
releasing seqno from gcache 2360
Got commit cut from GCS: 2360

88
New COMMIT CUT 2360 after 2360 from 1
Got commit cut from GCS: 2360
Regularly each node
communicates,
committed upto
water mark and then
as per protocol
explained, purging
initiates.

90
Gcache
STOP
processing
transactionTransaction start to
pile up in gcache

91
Gcache
STOP
processing
transactionTransaction start to
pile up in gcache ● FTWRL, RSU … action that
causes node to pause and
desync.

92
● Given that one of the node is not making progress it would not emit
its transaction committed status.
● This would freeze the cluster-purge-water-mark as lowest
transaction continue to lock-down.
● This means, though other nodes are making progress, they will
continue to pile up galera cache.

93
● Given that one of the node is not making progress it would not emit
its transaction committed status.
● This would freeze the cluster-purge-water-mark as lowest
transaction continue to lock-down.
● This means, though other nodes are making progress, they will
continue to pile up galera cache.
Galera has protection against it.
If number of transactions continue to grow
beyond some hard limits it would force purge.

94
trx map size: 16511 - check if status.last_committed is incrementing
In-build mechanism to force purge.

95
trx map size: 16511 - check if status.last_committed is incrementing
Purge can get delayed
but not halt.

96
Gcache
STOP
processing
transaction
Force purge done

97
Gcache
STOP
processing
transaction
Purging means these entries
are removed from galera
maintained purge array.
(Physical removal of files
gcache.page.0000xx is
controlled by
gcache.keep_pages_size and
gcache.keep_pages_count)

98
All nodes should have same configuration.
Keep a close watch if you plan to run a backup operation
or other operation that can cause node to halt.
Monitor node is making progress by keeping watch on
wsrep_last_applied/wsrep_last_committed.

99
Scenario: Network latency and related failures

10
0

10
1

10
2
Why ?
What caused this weird behavior ?

10
3

10
4
Cluster is neither complete down nor
complete up. What’s going on ? What
is causing this weird behavior ?

10
5
All my writes are going to single
node still I am getting this conflict ?

10
6
All nodes are able to reach each other

10
7
If link between 2 of nodes is broken
then packets can be relayed through
3rd node that is reachable from both
of the nodes.

10
8
If link between 2 of nodes is broken
then packets can be relayed through
3rd node that is reachable from both
of the nodes.

10
9
Said node has flaky network
connection or say has higher latency.

11
0
Each node will monitor other nodes of the
cluster @ inactive_check_period (0.5
seconds).
If node is not reachable from given node
post peer_timeout (3S), cluster will
enable relaying of message.
If all nodes votes for said node inactivity
(suspect_timeout (5S)) it is
pronounced DEAD.
If node detects delay in response from
given node it would try to add it to
delayed list.
While suspect_timeout needs consensus.
inactive_timeout(15S) doesn’t need it.
If node doesn’t respond it is marked DEAD
Node waits for delayed_margin
before adding node to delayed_list
(1S)
Even if node becomes active again it
would take delayed_keep_period
(30S) to remove it from the list.

11
1
If node detects delay in response from
given node it would try to add it to
delayed list.
Node waits for delayed_margin
before adding node to delayed_list
(1S)
Even if node becomes active again it
would take delayed_keep_period
(30S) to remove it from the list.
Each node will monitor other nodes of the
cluster @ inactive_check_period (0.5
seconds).
If node is not reachable from given node
post peer_timeout (3S), cluster will
enable relaying of message.
If all nodes votes for said node inactivity
(suspect_timeout (5S)) it is
pronounced DEAD.
While suspect_timeout needs consensus.
inactive_timeout(15S) doesn’t need it.
If node doesn’t respond it is marked DEAD
Runtime configurable

11
2
< 1 ms 7 sec
7 sec
Latency

11
3
< 1 ms 7 sec
7 sec

11
4
< 1 ms 7 sec
7 sec
Start sysbench workload

11
5
< 1 ms 7 sec
7 sec
Start sysbench workload
Given RTT between n1 and n3 is 7 sec
each trx needs 7 sec to complete even
though it gets ACK from n2 in < 1ms

11
6
#1

11
7
● TPS hits 0 for 5 secs and then resume
back.
#1

11
8
● TPS hits 0 for 5 secs and then resume
back.
● This is because trx is waiting for ACK
from n3 that would take 7 sec but in
meantime suspect_timeout timer
goes off and marks n3 as DEAD so
workload resumes after 5 secs.
#1

11
9
● This temporarily make the complete
cluster unavailable.
● Unfortunately, protocol design
demands ACK from the farthest node
to ensure consistency.
● Of-course latency of 7 sec is not
realistic.
#1

12
0
#2

12
1
< 1 ms 2 sec
2 sec

12
2
● This time I reduced the latency from 7
to 2 sec. Because of this every 2 sec
(less 5 sec) there was some
communication between node and this
prevent n3 from being marked as
DEAD.
● Post 10 secs we reverted back latency
to original value so snag is seen for 10
secs.
#2

12
3
All my writes are going to single
node still I am getting this conflict ?
#3

12
4
Because when the view changes initial
position is re-assigned there-by purging
history from cert index. Follow up
transaction in cert that has dependency
with old trx (that got purged) faces this
conflict.
#3

12
5
Farthest node dictates how cluster would operate and so latency is
important.
Geo-Distributed cluster has milli-sec latency so timeout should be
configured to avoid marking node as UNSTABLE due to added latency.
For geo-distributed cluster segment, window settings are other param
to configure.
Flaky node are not good for overall transaction processing. (Can cause
certification failures).

12
6
Scenario: Blocking Transaction and related failures

12
7
● Fail to load a table with N rows.

12
8
● Fail to load a table with N rows.
● Why ?
○ Because PXC has limit on how much data it can wrap in write-set and
replicate across the cluster.
○ Current limit allows data transaction of size 2 G. (controlled through
wsrep_max_ws_size)
But ever imagined why is that a limitation ?

12
9
execute prepare replicate commit

13
0
execute prepare replicate commit
Transaction first
execute on local
node. During this
execution
transaction
doesn’t block
other
non-dependent
transaction
Transaction
replicate after
it has been
executed on
local node but
not yet
committed.
Replication
involves
transporting
write-set
(binlog) to
other nodes.

13
1
execute prepare replicate commit N1
apply commit N2

13
2
N2apply commit
To maintain data
consistency across the
cluster, protocol needs
transaction to commit in
same order on all the
nodes.

13
3
N2apply commit
This means even though
transaction following
largest transaction are
non-dependent and have
completed APPLY ACTION
before the largest
transaction they can’t
commit.

13
4
N2apply commit
This means even though
transaction following
largest transaction are
non-dependent and have
completed APPLY ACTION
before the largest
transaction they can’t
commit.

13
5
N2apply commit
Bigger the
transaction, bigger
backlog of small
transactions this
would eventually
cause FLOW_CONTROL

13
6

13
7

13
8
First snag appears when originating node block all
resources to replicate a long running transaction.
Second snag appears when replicating node emit
flow-control.

13
9
PXC doesn’t like long running transaction.
For load data use LOAD DATA INFILE that would cause intermediate
commit every 10K rows. Note: Random failure can cause partial data to
get committed.
DDL can block/stall complete cluster workload as they need to execute
in total-isolation. (Alternative is to use RSU but be careful at it is local
operation to the node).

14
0
One last important note
● Majority of the error are due to mis-configuration or
difference in configuration of nodes.
● PXC recommend same configuration on all nodes of
the cluster.

PXC Genie: You Wish. We implement

14
2
PXC Genie: You Wish. We implement
● Like to hear from you what you want next in PXC ?
● Any specific module that you expect improvement ?
● How can Percona help you with PXC or HA ?
● Log issue (mark them as new improvement)
https://jira.percona.com/projects/PXC/issue
● PXC forum is other way to reach us.

PXC (Xtradb) Failure and Recovery

More Related Content

What's hot (20)

Similar to PXC (Xtradb) Failure and Recovery (18)

More from Alkin Tezuysal (20)

Recently uploaded (20)

PXC (Xtradb) Failure and Recovery