SlideShare a Scribd company logo
Percona XtraDB Cluster:
Failure Scenarios and their Recovery
Krunal Bauskar (PXC Lead, Percona)
Alkin Tezuysal (Sr. Technical Manager, Percona)
2
Who we are?
Krunal Bauskar
● Database enthusiast.
● Practicing databases (MySQL) for over a
decade now.
● Wide interest in data handling and
management.
● Worked on some real big data that powered
application @ Yahoo, Oracle, Teradata.
Alkin Tezuysal (@ask_dba)
● Open Source Database Evangelist
● Global Database Operations Expert
● Cloud Infrastructure Architect AWS
● Inspiring Technical and Strategic Leader
● Creative Team Builder
● Speaker, Mentor, and Coach
● Outdoor Enthusiast
3
Agenda
● Quick sniff at PXC
● Failure Scenarios and their recovery
● PXC Genie - You wish. We implement.
● Q & A
Quick Sniff at PXC
5
What is PXC ?
Auto-node
provisioning
Multi-master
Performance
tuned
Enhanced
Security
Flexible
topology
Network
protection
(Geo-distributed)
Failure Scenarios and their recovery
7
Scenario: New node fail to connect to cluster
8
Scenario: New node fail to connect to cluster
Joiner log
9
Scenario: New node fail to connect to cluster
Joiner log
DONOR log doesn’t have any
traces of JOINER trying to JOIN.
Administrator reviews
configuration settings like IP
address are sane and valid.
10
Scenario: New node fail to connect to cluster
Joiner log
DONOR log doesn’t have any
traces of JOINER trying to JOIN.
Administrator reviews
configuration settings like IP
address are sane and valid.
Still JOINER fails
to connect
11
Scenario: New node fail to connect to cluster
Joiner log
DONOR log doesn’t have any
traces of JOINER trying to JOIN.
Administrator reviews
configuration settings like IP
address are sane and valid.
SELinux/AppArmor
12
Scenario: New node fail to connect to cluster
Joiner log
Don’t confuse this error with SST
since node is not yet offered
membership of cluster. SST comes
post membership.
13
Scenario: New node fail to connect to cluster
● Solution-1:
○ Setting mode to PERMISSIVE or DISABLED
14
Scenario: New node fail to connect to cluster
● Solution-1:
○ Setting mode to PERMISSIVE or DISABLED
● Solution-2:
○ Configuring policy to allow access in ENFORCING mode.
○ Related blogs
■ “Lock Down: Enforcing SELinux with Percona XtraDB Cluster”. It probs
what all permission are needed and add rules accordingly.
■ “Lock Down: Enforcing AppArmor with Percona XtraDB Cluster”
■ Using this we can continue to use SELinux in enable mode. (You can also
refer to selinux configuration on Codership site too).
15
Scenario: New node fail to connect to cluster
PXC can operate with SELinux/AppArmor.
16
Scenario: Catching up cluster (SST, IST)
17
Scenario: Catching up cluster (SST, IST)
● SST: complete copy-over of data-directory
○ SST has has multiple external components SST script, XB, network aspect,
etc. Some of these are outside control of PXC process.
● IST: missing write-sets (as node is already member of cluster).
○ Intrinsic to PXC process space.
18
Scenario: Catching up cluster (SST, IST)
#1
Joiner log
19
Scenario: Catching up cluster (SST, IST)
#1
Joiner log
SST failed on DONOR
20
Scenario: Catching up cluster (SST, IST)
#1
Joiner log
SST failed on DONOR
wsrep_sst_auth
not set on DONOR
21
Scenario: Catching up cluster (SST, IST)
#1
Joiner log
wsrep_sst_auth should be set on DONOR (often user set it
on JOINER and things still fails). Post SST, JOINER will
copy-over the said user from DONOR.
22
Scenario: Catching up cluster (SST, IST)
#2
Donor log
23
Scenario: Catching up cluster (SST, IST)
#2
Donor log
Possible cause:
● Specified wsrep_sst_auth user doesn’t exit.
● Credentials are wrong.
● Insufficient privileges.
24
Scenario: Catching up cluster (SST, IST)
#3
Joiner log
25
Scenario: Catching up cluster (SST, IST)
#3
Joiner log
Trying to get old version JOINER to join from
new version DONOR. (Not supported).
Opposite is naturally allowed.
26
Scenario: Catching up cluster (SST, IST)
#4
Joiner log
Donor log
27
Scenario: Catching up cluster (SST, IST)
#4
Joiner log
Donor log
WSREP_SST: [WARNING] wsrep_node_address or
wsrep_sst_receive_address not set. Consider setting them if SST fails.
28
Scenario: Catching up cluster (SST, IST)
#5
29
Scenario: Catching up cluster (SST, IST)
#5
Faulty SSL configuration
30
Scenario: Catching up cluster (SST, IST)
PXC recommends: Same configuration on
all nodes of the cluster.
Old DONOR - New JOINER (OK)
XB is external tool and has its own set of
controllable configuration (passed through
PXC my.cnf)
SST user should be present on DONOR
Look at DONOR and JOINER log.
wsrep_sst_recieve_address/wsrep_node_
address is needed.
Advance encryption option like keyring on
DONOR and no keyring on JOINER is not
allowed.
Ensure stable n/w link between DONOR
and JOINER.
Network rules (firewall, etc..). SST uses port
4444. IST uses 4568.
Often-error are local to XB. Check the XB
log file that can give hint of error.
31
Scenario: Cluster doesn’t come up on restart
32
Scenario: Cluster doesn’t come up on restart
● All your nodes are located in same Data-Center (DC)
● DC hits power failure and all nodes are restarted.
● On restart, recovery flow is executed to recover wsrep coordinates.
33
Scenario: Cluster doesn’t come up on restart
● All your nodes are located in same Data-Center (DC)
● DC hits power failure and all nodes are restarted.
● On restart, recovery flow is executed to recover wsrep coordinates.
34
Scenario: Cluster doesn’t come up on restart
● All your nodes are located in same Data-Center (DC)
● DC hits power failure and all nodes are restarted.
● On restart, recovery flow is executed to recover wsrep coordinates.
Cluster still fails to come up
35
Scenario: Cluster doesn’t come up on restart
● Close look at the log shows original bootstrapping node has
safe_to_bootstrap set to 0 so it refuse to come up.
● Other nodes of cluster are left dangling (in non-primary state) in
absence of original cluster forming node.
36
Scenario: Cluster doesn’t come up on restart
● Close look at the log shows original bootstrapping node has
safe_to_bootstrap set to 0 so it refuse to come up.
● Other nodes of cluster are left dangling (in non-primary state) in
absence of original cluster forming node.
Galera/PXC expect user to identify node
that has latest data and then use that too
bootstrap. So as safety check
safe_to_bootstrap was added.
37
Scenario: Cluster doesn’t come up on restart
Identify the node
that has latest data
(look at
wsrep-recovery
co-ords)
Bootstrap
the node
Restart other
non-primary
node (if they fail
to auto-join).
set
safe_to_bootstrap
to 1 in grastate.dat
from data-directory
38
Scenario: Cluster doesn’t come up on restart
I have exact same setup but I
never face this issue. My cluster
get auto-restore on power failure.
Am I losing data or doing
something wrong ?
39
Scenario: Cluster doesn’t come up on restart
Because you have bootstrapped
your node using
wsrep_cluster_address=<node-ip>
&
pc.recovery=true (default)
40
Scenario: Cluster doesn’t come up on restart
Because you have bootstrapped
your node using
wsrep_cluster_address=<node-ip>
&
pc.recovery=true (default)
Error is observed if you have bootstrapped:
wsrep_cluster_address=”gcomm://”
OR
wsrep_cluster_address=”<node-ips>”
but pc.recovery=false
41
Scenario: Cluster doesn’t come up on restart
PXC can auto-restart on
DC failure depending on
configuration option used.
42
Scenario: Data inconsistency
43
Scenario: Data inconsistency
44
Scenario: Data inconsistency
● 2 kinds of inconsistencies
○ Physical inconsistency: Hardware Issues
○ Logical inconsistency: Data Issues
45
Scenario: Data inconsistency
● 2 kinds of inconsistencies
○ Physical inconsistency: Hardware Issues
○ Logical inconsistency: Data Issues
Logical inconsistency caused to cluster specific operation like locks,
RSU, wsrep_on=off, etc…
46
Scenario: Data inconsistency
● 2 kinds of inconsistencies
○ Physical inconsistency: Hardware Issues
○ Logical inconsistency: Data Issues
Logical inconsistency caused to cluster specific operation like locks,
RSU, wsrep_on=off, etc…
PXC has zero tolerance for inconsistency
and so it immediately isolate the nodes on detecting inconsistency.
47
Scenario: Data inconsistency
Inconsistency
detected
48
Scenario: Data inconsistency
Cluster in
healthy and
running
ISOLATED NODE
(SHUTDOWN)
49
Scenario: Data inconsistency
Inconsistency
detected
Inconsistency
detected
50
Scenario: Data inconsistency
shutdown
shutdownnon-prim
State marked as
UNSAFE
51
Scenario: Data inconsistency
majority groupminority group
52
Scenario: Data inconsistency
majority groupminority group
Minority group has
GOOD DATA
53
Scenario: Data inconsistency
If there are multiple nodes in minority group, identify a
node that has latest data.
54
Scenario: Data inconsistency
If there are multiple nodes in minority group, identify a
node that has latest data.
Set pc.bootstrap=1 on the selected node.
Single node cluster formed
55
Scenario: Data inconsistency
If there are multiple nodes in minority group, identify a
node that has latest data.
Set pc.bootstrap=1 on the selected node.
Boot other majority node. (they will join through SST).
56
Scenario: Data inconsistency
CLUSTER
RESTORED
57
Scenario: Data inconsistency
shutdown
shutdownnon-prim
State marked as
UNSAFE
58
Scenario: Data inconsistency
majority groupminority group
Majority group has
GOOD DATA
59
Scenario: Data inconsistency
Nodes in majority group are already
SHUTDOWN. Initiate SHUTDOWN of
nodes from minority group.
60
Scenario: Data inconsistency
Valid uuid can be
copied over from
a minority group
node.
Nodes in majority group are already
SHUTDOWN. Initiate SHUTDOWN of
nodes from minority group.
Fix grastate.dat for the nodes from
majority group. (Consistency shutdown
sequence has marked STATE=UNSAFE).
61
Scenario: Data inconsistency
Nodes in majority group are already
SHUTDOWN. Initiate SHUTDOWN of
nodes from minority group.
Fix grastate.dat for the nodes from
majority group. (Consistency shutdown
sequence has marked STATE=UNSAFE).
Bootstrap the cluster using one of the
node from majority group and eventually
get other majority nodes to join.
62
Scenario: Data inconsistency
Nodes in majority group are already
SHUTDOWN. Initiate SHUTDOWN of
nodes from minority group.
Fix grastate.dat for the nodes from
majority group. (Consistency shutdown
sequence has marked STATE=UNSAFE).
Bootstrap the cluster using one of the
node from majority group and eventually
get other majority nodes to join.
Remove grastate.dat of minority group
nodes and restart them to join newly
formed cluster.
63
Scenario: Data inconsistency
CLUSTER
RESTORED
64
Scenario: Another aspect of data inconsistency
65
Scenario: Another aspect of data inconsistency
One of the node from
minority group
66
Scenario: Another aspect of data inconsistency
Transaction
upto X
Transaction
upto X - 1
67
Scenario: Another aspect of data inconsistency
Transaction
upto X
Transaction
upto X - 1
Transaction X caused
inconsistency so it
never made it to
these nodes.
68
Scenario: Another aspect of data inconsistency
Transaction
upto X
Transaction
upto X - 1
69
Scenario: Another aspect of data inconsistency
Transaction
upto X
Transaction
upto X - 1
Membership rejected
as new coming node
has one extra
transaction than
cluster state.
70
Scenario: Another aspect of data inconsistency
2 node cluster is up and it started
processing transaction. Moving
the state of cluster from X -> X +
3
71
Scenario: Another aspect of data inconsistency
2 node cluster is up and it started
processing transaction. Moving
the state of cluster from X -> X +
3
72
Scenario: Another aspect of data inconsistency
2 node cluster is up and it started
processing transaction. Moving
the state of cluster from X -> X +
3
Node got
membership and
node joined
through IST too?
73
Scenario: Another aspect of data inconsistency
2 node cluster is up and it started
processing transaction. Moving
the state of cluster from X -> X +
3
Node has
transaction upto X
and cluster says it
has transaction
upto X+3.
Node joining
doesn’t evaluate
data. It is all
dependent on
seqno.
74
Scenario: Another aspect of data inconsistency
User failed to remove
grastate.dat that caused all
this confusion.
75
Scenario: Another aspect of data inconsistency
trx-seqno=x
trx-seqno=xTransaction with same
seqno but different
update
trx-seqno=x
76
Scenario: Another aspect of data inconsistency
trx-seqno=x
Cluster restored just
to enter more
inconsistency (that
may detect in future).
Transaction with same
seqno but different
update
trx-seqno=x
trx-seqno=x
77
Scenario: Cluster doesn’t come up on restart
Avoid running node local operation.
If cluster enter inconsistent state carefully
follow the step-by-step guide to recover
(don’t fear SST, it is for your good).
78
Scenario: Delayed purging
79
Scenario: Delayed purging
Gcache
(staging area to hold
replicated
transaction)
80
Scenario: Delayed purging
Transaction
replicated and staged
81
Scenario: Delayed purging
All nodes finished
applying transaction
82
Scenario: Delayed purging
Transactions can be
removed from gcache
83
Scenario: Delayed purging
● Each node at configured interval notifies other nodes/cluster about its
transaction committed status
● This configuration is controlled by 2 conditions:
○ gcache.keep_page_size and gcache.keep_page_count
○ static limit on number of keys (1K), transactions (128),
bytes (128M).
● Accordingly each node evaluates the cluster level lowest water mark and
initiate gcache purge.
84
Scenario: Delayed purging
Each node update local
graph and evaluate
cluster purge watermark
N1_purged_upto: x+1
N2_purged_upto: x+1
N3_purged_upto: x
N1_purged_upto: x+1
N2_purged_upto: x+1
N3_purged_upto: x
N1_purged_upto: x+1
N2_purged_upto: x+1
N3_purged_upto: x
85
Scenario: Delayed purging
And accordingly all
nodes will purge local
gcache upto X.
N1_purged_upto: x+1
N2_purged_upto: x+1
N3_purged_upto: x
N1_purged_upto: x+1
N2_purged_upto: x+1
N3_purged_upto: x
N1_purged_upto: x+1
N2_purged_upto: x+1
N3_purged_upto: x
cluster-purge-water-mark=X
cluster-purge-water-mark=X
cluster-purge-water-mark=X
86
Scenario: Delayed purging
gcache page created and purged.
87
Scenario: Delayed purging
New COMMIT CUT 2360 after 2360 from 1
purging index up to 2360
releasing seqno from gcache 2360
Got commit cut from GCS: 2360
88
Scenario: Delayed purging
New COMMIT CUT 2360 after 2360 from 1
purging index up to 2360
releasing seqno from gcache 2360
Got commit cut from GCS: 2360
Regularly each node
communicates,
committed upto
water mark and then
as per protocol
explained, purging
initiates.
89
Scenario: Delayed purging
90
Scenario: Delayed purging
Gcache
STOP
processing
transactionTransaction start to
pile up in gcache
91
Scenario: Delayed purging
Gcache
STOP
processing
transactionTransaction start to
pile up in gcache ● FTWRL, RSU … action that
causes node to pause and
desync.
92
Scenario: Delayed purging
● Given that one of the node is not making progress it would not emit
its transaction committed status.
● This would freeze the cluster-purge-water-mark as lowest
transaction continue to lock-down.
● This means, though other nodes are making progress, they will
continue to pile up galera cache.
93
Scenario: Delayed purging
● Given that one of the node is not making progress it would not emit
its transaction committed status.
● This would freeze the cluster-purge-water-mark as lowest
transaction continue to lock-down.
● This means, though other nodes are making progress, they will
continue to pile up galera cache.
Galera has protection against it.
If number of transactions continue to grow
beyond some hard limits it would force purge.
94
Scenario: Delayed purging
trx map size: 16511 - check if status.last_committed is incrementing
purging index up to 11264
releasing seqno from gcache 11264
In-build mechanism to force purge.
95
Scenario: Delayed purging
trx map size: 16511 - check if status.last_committed is incrementing
purging index up to 11264
releasing seqno from gcache 11264
Purge can get delayed
but not halt.
96
Scenario: Delayed purging
Gcache
STOP
processing
transaction
Force purge done
97
Scenario: Delayed purging
Gcache
STOP
processing
transaction
Purging means these entries
are removed from galera
maintained purge array.
(Physical removal of files
gcache.page.0000xx is
controlled by
gcache.keep_pages_size and
gcache.keep_pages_count)
98
Scenario: Delayed purging
All nodes should have same configuration.
Keep a close watch if you plan to run a backup operation
or other operation that can cause node to halt.
Monitor node is making progress by keeping watch on
wsrep_last_applied/wsrep_last_committed.
99
Scenario: Network latency and related failures
10
0
Scenario: Network latency and related failures
10
1
Scenario: Network latency and related failures
10
2
Scenario: Network latency and related failures
Why ?
What caused this weird behavior ?
10
3
Scenario: Network latency and related failures
10
4
Scenario: Network latency and related failures
Cluster is neither complete down nor
complete up. What’s going on ? What
is causing this weird behavior ?
10
5
Scenario: Network latency and related failures
All my writes are going to single
node still I am getting this conflict ?
10
6
Scenario: Network latency and related failures
All nodes are able to reach each other
10
7
Scenario: Network latency and related failures
If link between 2 of nodes is broken
then packets can be relayed through
3rd node that is reachable from both
of the nodes.
10
8
Scenario: Network latency and related failures
If link between 2 of nodes is broken
then packets can be relayed through
3rd node that is reachable from both
of the nodes.
10
9
Scenario: Network latency and related failures
Said node has flaky network
connection or say has higher latency.
11
0
Scenario: Network latency and related failures
Each node will monitor other nodes of the
cluster @ inactive_check_period (0.5
seconds).
If node is not reachable from given node
post peer_timeout (3S), cluster will
enable relaying of message.
If all nodes votes for said node inactivity
(suspect_timeout (5S)) it is
pronounced DEAD.
If node detects delay in response from
given node it would try to add it to
delayed list.
While suspect_timeout needs consensus.
inactive_timeout(15S) doesn’t need it.
If node doesn’t respond it is marked DEAD
Node waits for delayed_margin
before adding node to delayed_list
(1S)
Even if node becomes active again it
would take delayed_keep_period
(30S) to remove it from the list.
11
1
Scenario: Network latency and related failures
If node detects delay in response from
given node it would try to add it to
delayed list.
Node waits for delayed_margin
before adding node to delayed_list
(1S)
Even if node becomes active again it
would take delayed_keep_period
(30S) to remove it from the list.
Each node will monitor other nodes of the
cluster @ inactive_check_period (0.5
seconds).
If node is not reachable from given node
post peer_timeout (3S), cluster will
enable relaying of message.
If all nodes votes for said node inactivity
(suspect_timeout (5S)) it is
pronounced DEAD.
While suspect_timeout needs consensus.
inactive_timeout(15S) doesn’t need it.
If node doesn’t respond it is marked DEAD
Runtime configurable
11
2
Scenario: Network latency and related failures
< 1 ms 7 sec
7 sec
Latency
11
3
Scenario: Network latency and related failures
< 1 ms 7 sec
7 sec
11
4
Scenario: Network latency and related failures
< 1 ms 7 sec
7 sec
Start sysbench workload
11
5
Scenario: Network latency and related failures
< 1 ms 7 sec
7 sec
Start sysbench workload
Given RTT between n1 and n3 is 7 sec
each trx needs 7 sec to complete even
though it gets ACK from n2 in < 1ms
11
6
Scenario: Network latency and related failures
#1
11
7
Scenario: Network latency and related failures
● TPS hits 0 for 5 secs and then resume
back.
#1
11
8
Scenario: Network latency and related failures
● TPS hits 0 for 5 secs and then resume
back.
● This is because trx is waiting for ACK
from n3 that would take 7 sec but in
meantime suspect_timeout timer
goes off and marks n3 as DEAD so
workload resumes after 5 secs.
#1
11
9
Scenario: Network latency and related failures
● This temporarily make the complete
cluster unavailable.
● Unfortunately, protocol design
demands ACK from the farthest node
to ensure consistency.
● Of-course latency of 7 sec is not
realistic.
#1
12
0
Scenario: Network latency and related failures
#2
12
1
Scenario: Network latency and related failures
< 1 ms 2 sec
2 sec
12
2
Scenario: Network latency and related failures
● This time I reduced the latency from 7
to 2 sec. Because of this every 2 sec
(less 5 sec) there was some
communication between node and this
prevent n3 from being marked as
DEAD.
● Post 10 secs we reverted back latency
to original value so snag is seen for 10
secs.
#2
12
3
Scenario: Network latency and related failures
All my writes are going to single
node still I am getting this conflict ?
#3
12
4
Scenario: Network latency and related failures
Because when the view changes initial
position is re-assigned there-by purging
history from cert index. Follow up
transaction in cert that has dependency
with old trx (that got purged) faces this
conflict.
#3
12
5
Scenario: Network latency and related failures
Farthest node dictates how cluster would operate and so latency is
important.
Geo-Distributed cluster has milli-sec latency so timeout should be
configured to avoid marking node as UNSTABLE due to added latency.
For geo-distributed cluster segment, window settings are other param
to configure.
Flaky node are not good for overall transaction processing. (Can cause
certification failures).
12
6
Scenario: Blocking Transaction and related failures
12
7
Scenario: Blocking Transaction and related failures
● Fail to load a table with N rows.
12
8
Scenario: Blocking Transaction and related failures
● Fail to load a table with N rows.
● Why ?
○ Because PXC has limit on how much data it can wrap in write-set and
replicate across the cluster.
○ Current limit allows data transaction of size 2 G. (controlled through
wsrep_max_ws_size)
But ever imagined why is that a limitation ?
12
9
Scenario: Blocking Transaction and related failures
execute prepare replicate commit
13
0
Scenario: Blocking Transaction and related failures
execute prepare replicate commit
Transaction first
execute on local
node. During this
execution
transaction
doesn’t block
other
non-dependent
transaction
Transaction
replicate after
it has been
executed on
local node but
not yet
committed.
Replication
involves
transporting
write-set
(binlog) to
other nodes.
13
1
Scenario: Blocking Transaction and related failures
execute prepare replicate commit N1
apply commit N2
13
2
Scenario: Blocking Transaction and related failures
execute prepare replicate commit N1
N2apply commit
To maintain data
consistency across the
cluster, protocol needs
transaction to commit in
same order on all the
nodes.
13
3
Scenario: Blocking Transaction and related failures
execute prepare replicate commit N1
N2apply commit
This means even though
transaction following
largest transaction are
non-dependent and have
completed APPLY ACTION
before the largest
transaction they can’t
commit.
13
4
Scenario: Blocking Transaction and related failures
execute prepare replicate commit N1
N2apply commit
This means even though
transaction following
largest transaction are
non-dependent and have
completed APPLY ACTION
before the largest
transaction they can’t
commit.
13
5
Scenario: Blocking Transaction and related failures
execute prepare replicate commit N1
N2apply commit
Bigger the
transaction, bigger
backlog of small
transactions this
would eventually
cause FLOW_CONTROL
13
6
Scenario: Blocking Transaction and related failures
13
7
Scenario: Blocking Transaction and related failures
13
8
Scenario: Blocking Transaction and related failures
First snag appears when originating node block all
resources to replicate a long running transaction.
Second snag appears when replicating node emit
flow-control.
13
9
Scenario: Network latency and related failures
PXC doesn’t like long running transaction.
For load data use LOAD DATA INFILE that would cause intermediate
commit every 10K rows. Note: Random failure can cause partial data to
get committed.
DDL can block/stall complete cluster workload as they need to execute
in total-isolation. (Alternative is to use RSU but be careful at it is local
operation to the node).
14
0
One last important note
● Majority of the error are due to mis-configuration or
difference in configuration of nodes.
● PXC recommend same configuration on all nodes of
the cluster.
PXC Genie: You Wish. We implement
14
2
PXC Genie: You Wish. We implement
● Like to hear from you what you want next in PXC ?
● Any specific module that you expect improvement ?
● How can Percona help you with PXC or HA ?
● Log issue (mark them as new improvement)
https://jira.percona.com/projects/PXC/issue
● PXC forum is other way to reach us.
Questions and Answer
14
4
Thank You Sponsors!!

More Related Content

PDF
了解Oracle rac brain split resolution
PDF
Percona XtraDB 集群内部
PDF
[CONFidence 2016]: Alex Plaskett, Georgi Geshev - QNX: 99 Problems but a Micr...
PPTX
Oracle Clusterware and Private Network Considerations - Practical Performance...
PPTX
Introduction to Deep learning
PPT
Wait Events 10g
PDF
Disk reports predicted failure event
PPTX
DataStax NYC Java Meetup: Cassandra with Java
了解Oracle rac brain split resolution
Percona XtraDB 集群内部
[CONFidence 2016]: Alex Plaskett, Georgi Geshev - QNX: 99 Problems but a Micr...
Oracle Clusterware and Private Network Considerations - Practical Performance...
Introduction to Deep learning
Wait Events 10g
Disk reports predicted failure event
DataStax NYC Java Meetup: Cassandra with Java

What's hot (20)

PDF
还原Oracle中真实的cache recovery
PDF
Advanced rac troubleshooting
PDF
Riyaj real world performance issues rac focus
PPTX
Fatkulin presentation
PDF
Enabling a Secure Multi-Tenant Environment for HPC
PDF
Experiences in Providing Secure Mult-Tenant Lustre Access to OpenStack
PPTX
Introduction to Apache ZooKeeper
PDF
Oracle Clusterware Node Management and Voting Disks
PDF
Hack.lu 2016
PPT
Cloug Troubleshooting Oracle 11g Rac 101 Tips And Tricks
PPTX
Data Mining with Splunk
PPTX
Montreal User Group - Cloning Cassandra
PDF
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2
PDF
Futex Scaling for Multi-core Systems
PPTX
Odv oracle customer_demo
PDF
Mobile Programming - Network Universitas Budi Luhur
PPT
Zookeeper Introduce
PDF
Mobile Programming - 3 UDP
PDF
Lab 1 my sql tutorial
PPT
Troubleshooting SQL Server 2000 Virtual Server /Service Pack ...
还原Oracle中真实的cache recovery
Advanced rac troubleshooting
Riyaj real world performance issues rac focus
Fatkulin presentation
Enabling a Secure Multi-Tenant Environment for HPC
Experiences in Providing Secure Mult-Tenant Lustre Access to OpenStack
Introduction to Apache ZooKeeper
Oracle Clusterware Node Management and Voting Disks
Hack.lu 2016
Cloug Troubleshooting Oracle 11g Rac 101 Tips And Tricks
Data Mining with Splunk
Montreal User Group - Cloning Cassandra
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2
Futex Scaling for Multi-core Systems
Odv oracle customer_demo
Mobile Programming - Network Universitas Budi Luhur
Zookeeper Introduce
Mobile Programming - 3 UDP
Lab 1 my sql tutorial
Troubleshooting SQL Server 2000 Virtual Server /Service Pack ...
Ad

Similar to PXC (Xtradb) Failure and Recovery (18)

PDF
Percona Xtradb Cluster (pxc) 101 percona university 2019
PDF
Comparing high availability solutions with percona xtradb cluster and percona...
PPTX
Oracle real application clusters system tests with demo
PDF
Percon XtraDB Cluster in a nutshell
PDF
Advanced percona xtra db cluster in a nutshell... la suite plsc2016
PDF
Advanced Percona XtraDB Cluster in a nutshell... la suite
PDF
20230511 - PGConf Nepal - Clustering in PostgreSQL_ Because one database serv...
PDF
Clustering in PostgreSQL - Because one database server is never enough (and n...
PPTX
20240515 - Chicago PUG - Clustering in PostgreSQL: Because one database serve...
PDF
Rac introduction
PDF
Elasticsearch cluster deep dive
PDF
PGConf.ASIA 2019 Bali - Patroni in 2019 - Alexander Kukushkin
PDF
Percona XtraDB Cluster
PDF
An introduction to_rac_system_test_planning_methods
PPTX
Migrating to XtraDB Cluster
PDF
Percona XtraDB 集群安装与配置
PPTX
Failover cluster
PDF
Sasha Vaniachine - Debugging Percona XtraDB Cluster Stall
Percona Xtradb Cluster (pxc) 101 percona university 2019
Comparing high availability solutions with percona xtradb cluster and percona...
Oracle real application clusters system tests with demo
Percon XtraDB Cluster in a nutshell
Advanced percona xtra db cluster in a nutshell... la suite plsc2016
Advanced Percona XtraDB Cluster in a nutshell... la suite
20230511 - PGConf Nepal - Clustering in PostgreSQL_ Because one database serv...
Clustering in PostgreSQL - Because one database server is never enough (and n...
20240515 - Chicago PUG - Clustering in PostgreSQL: Because one database serve...
Rac introduction
Elasticsearch cluster deep dive
PGConf.ASIA 2019 Bali - Patroni in 2019 - Alexander Kukushkin
Percona XtraDB Cluster
An introduction to_rac_system_test_planning_methods
Migrating to XtraDB Cluster
Percona XtraDB 集群安装与配置
Failover cluster
Sasha Vaniachine - Debugging Percona XtraDB Cluster Stall
Ad

More from Alkin Tezuysal (20)

PDF
Boosting MySQL with Vector Search -THE VECTOR SEARCH CONFERENCE 2025 .pdf
PDF
Unified Observability - Alkin Tezuysal - FOSSASIA Summit March 2025 .pdf
PDF
Boosting MySQL with Vector Search Scale22X 2025.pdf
PDF
Boosting MySQL with Vector Search Fosdem 2025.pdf
PDF
London MySQL Day - Lightning Talk Dec 2024.pdf
PDF
Design and Modeling with MySQL and PostgreSQL - Percona University Istanbul S...
PDF
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
PPTX
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
PDF
FOSSASIA - MySQL Cookbook 4e Journey APR 2023.pdf
PDF
MySQL Ecosystem in 2023 - FOSSASIA'23 - Alkin.pptx.pdf
PDF
How OLTP to OLAP Archival Demystified
PDF
MySQL Cookbook: Recipes for Developers, Alkin Tezuysal and Sveta Smirnova - P...
PDF
My first 90 days with ClickHouse.pdf
PDF
KubeCon_NA_2021
PDF
Integrating best of breed open source tools to vitess orchestrator pleu21
PDF
Vitess: Scalable Database Architecture - Kubernetes Community Days Africa Ap...
PDF
How to shard MariaDB like a pro - FOSDEM 2021
PDF
Vitess - Data on Kubernetes
PDF
MySQL Ecosystem in 2020
PDF
Introduction to Vitess on Kubernetes for MySQL - Webinar
Boosting MySQL with Vector Search -THE VECTOR SEARCH CONFERENCE 2025 .pdf
Unified Observability - Alkin Tezuysal - FOSSASIA Summit March 2025 .pdf
Boosting MySQL with Vector Search Scale22X 2025.pdf
Boosting MySQL with Vector Search Fosdem 2025.pdf
London MySQL Day - Lightning Talk Dec 2024.pdf
Design and Modeling with MySQL and PostgreSQL - Percona University Istanbul S...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
FOSSASIA - MySQL Cookbook 4e Journey APR 2023.pdf
MySQL Ecosystem in 2023 - FOSSASIA'23 - Alkin.pptx.pdf
How OLTP to OLAP Archival Demystified
MySQL Cookbook: Recipes for Developers, Alkin Tezuysal and Sveta Smirnova - P...
My first 90 days with ClickHouse.pdf
KubeCon_NA_2021
Integrating best of breed open source tools to vitess orchestrator pleu21
Vitess: Scalable Database Architecture - Kubernetes Community Days Africa Ap...
How to shard MariaDB like a pro - FOSDEM 2021
Vitess - Data on Kubernetes
MySQL Ecosystem in 2020
Introduction to Vitess on Kubernetes for MySQL - Webinar

Recently uploaded (20)

PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Machine Learning_overview_presentation.pptx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
Big Data Technologies - Introduction.pptx
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Electronic commerce courselecture one. Pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
A comparative analysis of optical character recognition models for extracting...
PPTX
1. Introduction to Computer Programming.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Encapsulation_ Review paper, used for researhc scholars
Machine Learning_overview_presentation.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
SOPHOS-XG Firewall Administrator PPT.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
MIND Revenue Release Quarter 2 2025 Press Release
Big Data Technologies - Introduction.pptx
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Digital-Transformation-Roadmap-for-Companies.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
Machine learning based COVID-19 study performance prediction
Programs and apps: productivity, graphics, security and other tools
Electronic commerce courselecture one. Pdf
Network Security Unit 5.pdf for BCA BBA.
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Building Integrated photovoltaic BIPV_UPV.pdf
cuic standard and advanced reporting.pdf
A comparative analysis of optical character recognition models for extracting...
1. Introduction to Computer Programming.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025

PXC (Xtradb) Failure and Recovery

  • 1. Percona XtraDB Cluster: Failure Scenarios and their Recovery Krunal Bauskar (PXC Lead, Percona) Alkin Tezuysal (Sr. Technical Manager, Percona)
  • 2. 2 Who we are? Krunal Bauskar ● Database enthusiast. ● Practicing databases (MySQL) for over a decade now. ● Wide interest in data handling and management. ● Worked on some real big data that powered application @ Yahoo, Oracle, Teradata. Alkin Tezuysal (@ask_dba) ● Open Source Database Evangelist ● Global Database Operations Expert ● Cloud Infrastructure Architect AWS ● Inspiring Technical and Strategic Leader ● Creative Team Builder ● Speaker, Mentor, and Coach ● Outdoor Enthusiast
  • 3. 3 Agenda ● Quick sniff at PXC ● Failure Scenarios and their recovery ● PXC Genie - You wish. We implement. ● Q & A
  • 5. 5 What is PXC ? Auto-node provisioning Multi-master Performance tuned Enhanced Security Flexible topology Network protection (Geo-distributed)
  • 6. Failure Scenarios and their recovery
  • 7. 7 Scenario: New node fail to connect to cluster
  • 8. 8 Scenario: New node fail to connect to cluster Joiner log
  • 9. 9 Scenario: New node fail to connect to cluster Joiner log DONOR log doesn’t have any traces of JOINER trying to JOIN. Administrator reviews configuration settings like IP address are sane and valid.
  • 10. 10 Scenario: New node fail to connect to cluster Joiner log DONOR log doesn’t have any traces of JOINER trying to JOIN. Administrator reviews configuration settings like IP address are sane and valid. Still JOINER fails to connect
  • 11. 11 Scenario: New node fail to connect to cluster Joiner log DONOR log doesn’t have any traces of JOINER trying to JOIN. Administrator reviews configuration settings like IP address are sane and valid. SELinux/AppArmor
  • 12. 12 Scenario: New node fail to connect to cluster Joiner log Don’t confuse this error with SST since node is not yet offered membership of cluster. SST comes post membership.
  • 13. 13 Scenario: New node fail to connect to cluster ● Solution-1: ○ Setting mode to PERMISSIVE or DISABLED
  • 14. 14 Scenario: New node fail to connect to cluster ● Solution-1: ○ Setting mode to PERMISSIVE or DISABLED ● Solution-2: ○ Configuring policy to allow access in ENFORCING mode. ○ Related blogs ■ “Lock Down: Enforcing SELinux with Percona XtraDB Cluster”. It probs what all permission are needed and add rules accordingly. ■ “Lock Down: Enforcing AppArmor with Percona XtraDB Cluster” ■ Using this we can continue to use SELinux in enable mode. (You can also refer to selinux configuration on Codership site too).
  • 15. 15 Scenario: New node fail to connect to cluster PXC can operate with SELinux/AppArmor.
  • 16. 16 Scenario: Catching up cluster (SST, IST)
  • 17. 17 Scenario: Catching up cluster (SST, IST) ● SST: complete copy-over of data-directory ○ SST has has multiple external components SST script, XB, network aspect, etc. Some of these are outside control of PXC process. ● IST: missing write-sets (as node is already member of cluster). ○ Intrinsic to PXC process space.
  • 18. 18 Scenario: Catching up cluster (SST, IST) #1 Joiner log
  • 19. 19 Scenario: Catching up cluster (SST, IST) #1 Joiner log SST failed on DONOR
  • 20. 20 Scenario: Catching up cluster (SST, IST) #1 Joiner log SST failed on DONOR wsrep_sst_auth not set on DONOR
  • 21. 21 Scenario: Catching up cluster (SST, IST) #1 Joiner log wsrep_sst_auth should be set on DONOR (often user set it on JOINER and things still fails). Post SST, JOINER will copy-over the said user from DONOR.
  • 22. 22 Scenario: Catching up cluster (SST, IST) #2 Donor log
  • 23. 23 Scenario: Catching up cluster (SST, IST) #2 Donor log Possible cause: ● Specified wsrep_sst_auth user doesn’t exit. ● Credentials are wrong. ● Insufficient privileges.
  • 24. 24 Scenario: Catching up cluster (SST, IST) #3 Joiner log
  • 25. 25 Scenario: Catching up cluster (SST, IST) #3 Joiner log Trying to get old version JOINER to join from new version DONOR. (Not supported). Opposite is naturally allowed.
  • 26. 26 Scenario: Catching up cluster (SST, IST) #4 Joiner log Donor log
  • 27. 27 Scenario: Catching up cluster (SST, IST) #4 Joiner log Donor log WSREP_SST: [WARNING] wsrep_node_address or wsrep_sst_receive_address not set. Consider setting them if SST fails.
  • 28. 28 Scenario: Catching up cluster (SST, IST) #5
  • 29. 29 Scenario: Catching up cluster (SST, IST) #5 Faulty SSL configuration
  • 30. 30 Scenario: Catching up cluster (SST, IST) PXC recommends: Same configuration on all nodes of the cluster. Old DONOR - New JOINER (OK) XB is external tool and has its own set of controllable configuration (passed through PXC my.cnf) SST user should be present on DONOR Look at DONOR and JOINER log. wsrep_sst_recieve_address/wsrep_node_ address is needed. Advance encryption option like keyring on DONOR and no keyring on JOINER is not allowed. Ensure stable n/w link between DONOR and JOINER. Network rules (firewall, etc..). SST uses port 4444. IST uses 4568. Often-error are local to XB. Check the XB log file that can give hint of error.
  • 31. 31 Scenario: Cluster doesn’t come up on restart
  • 32. 32 Scenario: Cluster doesn’t come up on restart ● All your nodes are located in same Data-Center (DC) ● DC hits power failure and all nodes are restarted. ● On restart, recovery flow is executed to recover wsrep coordinates.
  • 33. 33 Scenario: Cluster doesn’t come up on restart ● All your nodes are located in same Data-Center (DC) ● DC hits power failure and all nodes are restarted. ● On restart, recovery flow is executed to recover wsrep coordinates.
  • 34. 34 Scenario: Cluster doesn’t come up on restart ● All your nodes are located in same Data-Center (DC) ● DC hits power failure and all nodes are restarted. ● On restart, recovery flow is executed to recover wsrep coordinates. Cluster still fails to come up
  • 35. 35 Scenario: Cluster doesn’t come up on restart ● Close look at the log shows original bootstrapping node has safe_to_bootstrap set to 0 so it refuse to come up. ● Other nodes of cluster are left dangling (in non-primary state) in absence of original cluster forming node.
  • 36. 36 Scenario: Cluster doesn’t come up on restart ● Close look at the log shows original bootstrapping node has safe_to_bootstrap set to 0 so it refuse to come up. ● Other nodes of cluster are left dangling (in non-primary state) in absence of original cluster forming node. Galera/PXC expect user to identify node that has latest data and then use that too bootstrap. So as safety check safe_to_bootstrap was added.
  • 37. 37 Scenario: Cluster doesn’t come up on restart Identify the node that has latest data (look at wsrep-recovery co-ords) Bootstrap the node Restart other non-primary node (if they fail to auto-join). set safe_to_bootstrap to 1 in grastate.dat from data-directory
  • 38. 38 Scenario: Cluster doesn’t come up on restart I have exact same setup but I never face this issue. My cluster get auto-restore on power failure. Am I losing data or doing something wrong ?
  • 39. 39 Scenario: Cluster doesn’t come up on restart Because you have bootstrapped your node using wsrep_cluster_address=<node-ip> & pc.recovery=true (default)
  • 40. 40 Scenario: Cluster doesn’t come up on restart Because you have bootstrapped your node using wsrep_cluster_address=<node-ip> & pc.recovery=true (default) Error is observed if you have bootstrapped: wsrep_cluster_address=”gcomm://” OR wsrep_cluster_address=”<node-ips>” but pc.recovery=false
  • 41. 41 Scenario: Cluster doesn’t come up on restart PXC can auto-restart on DC failure depending on configuration option used.
  • 44. 44 Scenario: Data inconsistency ● 2 kinds of inconsistencies ○ Physical inconsistency: Hardware Issues ○ Logical inconsistency: Data Issues
  • 45. 45 Scenario: Data inconsistency ● 2 kinds of inconsistencies ○ Physical inconsistency: Hardware Issues ○ Logical inconsistency: Data Issues Logical inconsistency caused to cluster specific operation like locks, RSU, wsrep_on=off, etc…
  • 46. 46 Scenario: Data inconsistency ● 2 kinds of inconsistencies ○ Physical inconsistency: Hardware Issues ○ Logical inconsistency: Data Issues Logical inconsistency caused to cluster specific operation like locks, RSU, wsrep_on=off, etc… PXC has zero tolerance for inconsistency and so it immediately isolate the nodes on detecting inconsistency.
  • 48. 48 Scenario: Data inconsistency Cluster in healthy and running ISOLATED NODE (SHUTDOWN)
  • 52. 52 Scenario: Data inconsistency majority groupminority group Minority group has GOOD DATA
  • 53. 53 Scenario: Data inconsistency If there are multiple nodes in minority group, identify a node that has latest data.
  • 54. 54 Scenario: Data inconsistency If there are multiple nodes in minority group, identify a node that has latest data. Set pc.bootstrap=1 on the selected node. Single node cluster formed
  • 55. 55 Scenario: Data inconsistency If there are multiple nodes in minority group, identify a node that has latest data. Set pc.bootstrap=1 on the selected node. Boot other majority node. (they will join through SST).
  • 58. 58 Scenario: Data inconsistency majority groupminority group Majority group has GOOD DATA
  • 59. 59 Scenario: Data inconsistency Nodes in majority group are already SHUTDOWN. Initiate SHUTDOWN of nodes from minority group.
  • 60. 60 Scenario: Data inconsistency Valid uuid can be copied over from a minority group node. Nodes in majority group are already SHUTDOWN. Initiate SHUTDOWN of nodes from minority group. Fix grastate.dat for the nodes from majority group. (Consistency shutdown sequence has marked STATE=UNSAFE).
  • 61. 61 Scenario: Data inconsistency Nodes in majority group are already SHUTDOWN. Initiate SHUTDOWN of nodes from minority group. Fix grastate.dat for the nodes from majority group. (Consistency shutdown sequence has marked STATE=UNSAFE). Bootstrap the cluster using one of the node from majority group and eventually get other majority nodes to join.
  • 62. 62 Scenario: Data inconsistency Nodes in majority group are already SHUTDOWN. Initiate SHUTDOWN of nodes from minority group. Fix grastate.dat for the nodes from majority group. (Consistency shutdown sequence has marked STATE=UNSAFE). Bootstrap the cluster using one of the node from majority group and eventually get other majority nodes to join. Remove grastate.dat of minority group nodes and restart them to join newly formed cluster.
  • 64. 64 Scenario: Another aspect of data inconsistency
  • 65. 65 Scenario: Another aspect of data inconsistency One of the node from minority group
  • 66. 66 Scenario: Another aspect of data inconsistency Transaction upto X Transaction upto X - 1
  • 67. 67 Scenario: Another aspect of data inconsistency Transaction upto X Transaction upto X - 1 Transaction X caused inconsistency so it never made it to these nodes.
  • 68. 68 Scenario: Another aspect of data inconsistency Transaction upto X Transaction upto X - 1
  • 69. 69 Scenario: Another aspect of data inconsistency Transaction upto X Transaction upto X - 1 Membership rejected as new coming node has one extra transaction than cluster state.
  • 70. 70 Scenario: Another aspect of data inconsistency 2 node cluster is up and it started processing transaction. Moving the state of cluster from X -> X + 3
  • 71. 71 Scenario: Another aspect of data inconsistency 2 node cluster is up and it started processing transaction. Moving the state of cluster from X -> X + 3
  • 72. 72 Scenario: Another aspect of data inconsistency 2 node cluster is up and it started processing transaction. Moving the state of cluster from X -> X + 3 Node got membership and node joined through IST too?
  • 73. 73 Scenario: Another aspect of data inconsistency 2 node cluster is up and it started processing transaction. Moving the state of cluster from X -> X + 3 Node has transaction upto X and cluster says it has transaction upto X+3. Node joining doesn’t evaluate data. It is all dependent on seqno.
  • 74. 74 Scenario: Another aspect of data inconsistency User failed to remove grastate.dat that caused all this confusion.
  • 75. 75 Scenario: Another aspect of data inconsistency trx-seqno=x trx-seqno=xTransaction with same seqno but different update trx-seqno=x
  • 76. 76 Scenario: Another aspect of data inconsistency trx-seqno=x Cluster restored just to enter more inconsistency (that may detect in future). Transaction with same seqno but different update trx-seqno=x trx-seqno=x
  • 77. 77 Scenario: Cluster doesn’t come up on restart Avoid running node local operation. If cluster enter inconsistent state carefully follow the step-by-step guide to recover (don’t fear SST, it is for your good).
  • 79. 79 Scenario: Delayed purging Gcache (staging area to hold replicated transaction)
  • 81. 81 Scenario: Delayed purging All nodes finished applying transaction
  • 82. 82 Scenario: Delayed purging Transactions can be removed from gcache
  • 83. 83 Scenario: Delayed purging ● Each node at configured interval notifies other nodes/cluster about its transaction committed status ● This configuration is controlled by 2 conditions: ○ gcache.keep_page_size and gcache.keep_page_count ○ static limit on number of keys (1K), transactions (128), bytes (128M). ● Accordingly each node evaluates the cluster level lowest water mark and initiate gcache purge.
  • 84. 84 Scenario: Delayed purging Each node update local graph and evaluate cluster purge watermark N1_purged_upto: x+1 N2_purged_upto: x+1 N3_purged_upto: x N1_purged_upto: x+1 N2_purged_upto: x+1 N3_purged_upto: x N1_purged_upto: x+1 N2_purged_upto: x+1 N3_purged_upto: x
  • 85. 85 Scenario: Delayed purging And accordingly all nodes will purge local gcache upto X. N1_purged_upto: x+1 N2_purged_upto: x+1 N3_purged_upto: x N1_purged_upto: x+1 N2_purged_upto: x+1 N3_purged_upto: x N1_purged_upto: x+1 N2_purged_upto: x+1 N3_purged_upto: x cluster-purge-water-mark=X cluster-purge-water-mark=X cluster-purge-water-mark=X
  • 86. 86 Scenario: Delayed purging gcache page created and purged.
  • 87. 87 Scenario: Delayed purging New COMMIT CUT 2360 after 2360 from 1 purging index up to 2360 releasing seqno from gcache 2360 Got commit cut from GCS: 2360
  • 88. 88 Scenario: Delayed purging New COMMIT CUT 2360 after 2360 from 1 purging index up to 2360 releasing seqno from gcache 2360 Got commit cut from GCS: 2360 Regularly each node communicates, committed upto water mark and then as per protocol explained, purging initiates.
  • 91. 91 Scenario: Delayed purging Gcache STOP processing transactionTransaction start to pile up in gcache ● FTWRL, RSU … action that causes node to pause and desync.
  • 92. 92 Scenario: Delayed purging ● Given that one of the node is not making progress it would not emit its transaction committed status. ● This would freeze the cluster-purge-water-mark as lowest transaction continue to lock-down. ● This means, though other nodes are making progress, they will continue to pile up galera cache.
  • 93. 93 Scenario: Delayed purging ● Given that one of the node is not making progress it would not emit its transaction committed status. ● This would freeze the cluster-purge-water-mark as lowest transaction continue to lock-down. ● This means, though other nodes are making progress, they will continue to pile up galera cache. Galera has protection against it. If number of transactions continue to grow beyond some hard limits it would force purge.
  • 94. 94 Scenario: Delayed purging trx map size: 16511 - check if status.last_committed is incrementing purging index up to 11264 releasing seqno from gcache 11264 In-build mechanism to force purge.
  • 95. 95 Scenario: Delayed purging trx map size: 16511 - check if status.last_committed is incrementing purging index up to 11264 releasing seqno from gcache 11264 Purge can get delayed but not halt.
  • 97. 97 Scenario: Delayed purging Gcache STOP processing transaction Purging means these entries are removed from galera maintained purge array. (Physical removal of files gcache.page.0000xx is controlled by gcache.keep_pages_size and gcache.keep_pages_count)
  • 98. 98 Scenario: Delayed purging All nodes should have same configuration. Keep a close watch if you plan to run a backup operation or other operation that can cause node to halt. Monitor node is making progress by keeping watch on wsrep_last_applied/wsrep_last_committed.
  • 99. 99 Scenario: Network latency and related failures
  • 100. 10 0 Scenario: Network latency and related failures
  • 101. 10 1 Scenario: Network latency and related failures
  • 102. 10 2 Scenario: Network latency and related failures Why ? What caused this weird behavior ?
  • 103. 10 3 Scenario: Network latency and related failures
  • 104. 10 4 Scenario: Network latency and related failures Cluster is neither complete down nor complete up. What’s going on ? What is causing this weird behavior ?
  • 105. 10 5 Scenario: Network latency and related failures All my writes are going to single node still I am getting this conflict ?
  • 106. 10 6 Scenario: Network latency and related failures All nodes are able to reach each other
  • 107. 10 7 Scenario: Network latency and related failures If link between 2 of nodes is broken then packets can be relayed through 3rd node that is reachable from both of the nodes.
  • 108. 10 8 Scenario: Network latency and related failures If link between 2 of nodes is broken then packets can be relayed through 3rd node that is reachable from both of the nodes.
  • 109. 10 9 Scenario: Network latency and related failures Said node has flaky network connection or say has higher latency.
  • 110. 11 0 Scenario: Network latency and related failures Each node will monitor other nodes of the cluster @ inactive_check_period (0.5 seconds). If node is not reachable from given node post peer_timeout (3S), cluster will enable relaying of message. If all nodes votes for said node inactivity (suspect_timeout (5S)) it is pronounced DEAD. If node detects delay in response from given node it would try to add it to delayed list. While suspect_timeout needs consensus. inactive_timeout(15S) doesn’t need it. If node doesn’t respond it is marked DEAD Node waits for delayed_margin before adding node to delayed_list (1S) Even if node becomes active again it would take delayed_keep_period (30S) to remove it from the list.
  • 111. 11 1 Scenario: Network latency and related failures If node detects delay in response from given node it would try to add it to delayed list. Node waits for delayed_margin before adding node to delayed_list (1S) Even if node becomes active again it would take delayed_keep_period (30S) to remove it from the list. Each node will monitor other nodes of the cluster @ inactive_check_period (0.5 seconds). If node is not reachable from given node post peer_timeout (3S), cluster will enable relaying of message. If all nodes votes for said node inactivity (suspect_timeout (5S)) it is pronounced DEAD. While suspect_timeout needs consensus. inactive_timeout(15S) doesn’t need it. If node doesn’t respond it is marked DEAD Runtime configurable
  • 112. 11 2 Scenario: Network latency and related failures < 1 ms 7 sec 7 sec Latency
  • 113. 11 3 Scenario: Network latency and related failures < 1 ms 7 sec 7 sec
  • 114. 11 4 Scenario: Network latency and related failures < 1 ms 7 sec 7 sec Start sysbench workload
  • 115. 11 5 Scenario: Network latency and related failures < 1 ms 7 sec 7 sec Start sysbench workload Given RTT between n1 and n3 is 7 sec each trx needs 7 sec to complete even though it gets ACK from n2 in < 1ms
  • 116. 11 6 Scenario: Network latency and related failures #1
  • 117. 11 7 Scenario: Network latency and related failures ● TPS hits 0 for 5 secs and then resume back. #1
  • 118. 11 8 Scenario: Network latency and related failures ● TPS hits 0 for 5 secs and then resume back. ● This is because trx is waiting for ACK from n3 that would take 7 sec but in meantime suspect_timeout timer goes off and marks n3 as DEAD so workload resumes after 5 secs. #1
  • 119. 11 9 Scenario: Network latency and related failures ● This temporarily make the complete cluster unavailable. ● Unfortunately, protocol design demands ACK from the farthest node to ensure consistency. ● Of-course latency of 7 sec is not realistic. #1
  • 120. 12 0 Scenario: Network latency and related failures #2
  • 121. 12 1 Scenario: Network latency and related failures < 1 ms 2 sec 2 sec
  • 122. 12 2 Scenario: Network latency and related failures ● This time I reduced the latency from 7 to 2 sec. Because of this every 2 sec (less 5 sec) there was some communication between node and this prevent n3 from being marked as DEAD. ● Post 10 secs we reverted back latency to original value so snag is seen for 10 secs. #2
  • 123. 12 3 Scenario: Network latency and related failures All my writes are going to single node still I am getting this conflict ? #3
  • 124. 12 4 Scenario: Network latency and related failures Because when the view changes initial position is re-assigned there-by purging history from cert index. Follow up transaction in cert that has dependency with old trx (that got purged) faces this conflict. #3
  • 125. 12 5 Scenario: Network latency and related failures Farthest node dictates how cluster would operate and so latency is important. Geo-Distributed cluster has milli-sec latency so timeout should be configured to avoid marking node as UNSTABLE due to added latency. For geo-distributed cluster segment, window settings are other param to configure. Flaky node are not good for overall transaction processing. (Can cause certification failures).
  • 126. 12 6 Scenario: Blocking Transaction and related failures
  • 127. 12 7 Scenario: Blocking Transaction and related failures ● Fail to load a table with N rows.
  • 128. 12 8 Scenario: Blocking Transaction and related failures ● Fail to load a table with N rows. ● Why ? ○ Because PXC has limit on how much data it can wrap in write-set and replicate across the cluster. ○ Current limit allows data transaction of size 2 G. (controlled through wsrep_max_ws_size) But ever imagined why is that a limitation ?
  • 129. 12 9 Scenario: Blocking Transaction and related failures execute prepare replicate commit
  • 130. 13 0 Scenario: Blocking Transaction and related failures execute prepare replicate commit Transaction first execute on local node. During this execution transaction doesn’t block other non-dependent transaction Transaction replicate after it has been executed on local node but not yet committed. Replication involves transporting write-set (binlog) to other nodes.
  • 131. 13 1 Scenario: Blocking Transaction and related failures execute prepare replicate commit N1 apply commit N2
  • 132. 13 2 Scenario: Blocking Transaction and related failures execute prepare replicate commit N1 N2apply commit To maintain data consistency across the cluster, protocol needs transaction to commit in same order on all the nodes.
  • 133. 13 3 Scenario: Blocking Transaction and related failures execute prepare replicate commit N1 N2apply commit This means even though transaction following largest transaction are non-dependent and have completed APPLY ACTION before the largest transaction they can’t commit.
  • 134. 13 4 Scenario: Blocking Transaction and related failures execute prepare replicate commit N1 N2apply commit This means even though transaction following largest transaction are non-dependent and have completed APPLY ACTION before the largest transaction they can’t commit.
  • 135. 13 5 Scenario: Blocking Transaction and related failures execute prepare replicate commit N1 N2apply commit Bigger the transaction, bigger backlog of small transactions this would eventually cause FLOW_CONTROL
  • 136. 13 6 Scenario: Blocking Transaction and related failures
  • 137. 13 7 Scenario: Blocking Transaction and related failures
  • 138. 13 8 Scenario: Blocking Transaction and related failures First snag appears when originating node block all resources to replicate a long running transaction. Second snag appears when replicating node emit flow-control.
  • 139. 13 9 Scenario: Network latency and related failures PXC doesn’t like long running transaction. For load data use LOAD DATA INFILE that would cause intermediate commit every 10K rows. Note: Random failure can cause partial data to get committed. DDL can block/stall complete cluster workload as they need to execute in total-isolation. (Alternative is to use RSU but be careful at it is local operation to the node).
  • 140. 14 0 One last important note ● Majority of the error are due to mis-configuration or difference in configuration of nodes. ● PXC recommend same configuration on all nodes of the cluster.
  • 141. PXC Genie: You Wish. We implement
  • 142. 14 2 PXC Genie: You Wish. We implement ● Like to hear from you what you want next in PXC ? ● Any specific module that you expect improvement ? ● How can Percona help you with PXC or HA ? ● Log issue (mark them as new improvement) https://jira.percona.com/projects/PXC/issue ● PXC forum is other way to reach us.