SlideShare a Scribd company logo
Running GTID
replication in
production
Balazs Pocze, DBA@Gawker
Peter Boros, Principal Architect@Percona
Who we are
• Gawker Media: World’s largest independent media company
• We got about ~67M monthly uniques in US, ~107M uniques worldwide.
• We run on Kinja platform (http://kinja.com)
• our blogs are: gawker.com, jalopnik.com, jezebel.com, lifehacker.com,
gizmodo.com, deadspin.com, io9.com, kotaku.com
2
GTID replication in a nutshell
Traditional replication
● The MySQL server writes data which will be replicated to binary log
● For replica the following information is the minimum to know
• Binary log file name
• Position in binary log
4
● If a replica has to be moved to a different server you have to
ensure that all of the servers are in the same state.
● You have to stop all the writes (set global read_only & set global
super_read_only) and move the replica to the different database
master (show master status …) as repositioning the slave. After
the server is moved, the writes can be resumed.
Traditional replication caveats
5
GTID in a nutshell
● Behind the scenes it is the same as traditional replication
• filename, position
● GTID replication
• uuid:seqno 6
GTID in a nutshell
● UUID identifies the server
● Seqno identified the Nth transaction from that server
• This is harder to find than just seeking to a byte offset
• The binlog containing the GTID needs to be scanned
● Nodes can be repositioned easily (anywhere in replication hierarchy)
7
Gawker’s environment
Gawker’s environment
9
● 2 datacenter operation
● One of the DC’s are ‘more equal’ than the other - there’s the ‘active
master’
○ All the writes happen there, and it replicates to the other master, as
well as secondaries
● The replicas has to be moveable between the masters and the
secondaries ...
Gawker’s environment
http://bit.ly/kinjadb
10
Gawker’s environment
11
● We don’t fix broken MySQL instances, when there is a problem we drop
and recreate them as fresh clones
● The backup, and slave creation uses the same codebase
● All the operations are defined as ansible playbooks, they are called from
python wrappers, and they could be managed from a jenkins instance
○ Backup
Pre-GTID master maintenance
12
● Put site to read-only mode (time!)
● Put the database to read-only mode (SET GLOBAL READ_ONLY=1)
● On secondary master (SHOW MASTER STATUS - 2x)
● On replicas: (CHANGE MASTER TO … )
● Failover applications to write to the new master
● On replicas: (CHANGE MASTER TO … MASTER_AUTO_POSITION=1)
● Failover applications to write to the new master*
*At this moment we still have to disable writes, but that is about for 30 seconds
GTID master maintenance
13
Caveats
Common failures with GTID: Errant transactions
● Executed GTID set of the about to be promoted node is not the subset
of the current master’s executed GTID set
• Use GTID_SUBSET() in pre-flight checks
• mysql> select gtid_subset('b0bb2e56-6121-11e5-9e7c-d73eafb37531:1-29,
8e3648e4-bc14-11e3-8d4c-0800272864ba:1-9','b0bb2e56-6121-11e5-9e7c-
d73eafb37531:1-30, 8e3648e4-bc14-11e3-8d4c-0800272864ba:1-7') as
slave_is_subset;
+-----------------+
| slave_is_subset |
+-----------------+
| 0 |
+-----------------+
15
Common failures with GTID: Errant transactions
● select gtid_subtract('b0bb2e56-6121-11e5-9e7c-
d73eafb37531:1-29, 8e3648e4-bc14-11e3-8d4c-
0800272864ba:1-9','b0bb2e56-6121-11e5-9e7c-
d73eafb37531:1-30,8e3648e4-bc14-11e3-8d4c-0800272864ba:1-
7') as errant_transactions;
+------------------------------------------+
| errant_transactions |
+------------------------------------------+
| 8e3648e4-bc14-11e3-8d4c-0800272864ba:8-9 |
+------------------------------------------+
16
Common failures with GTID: Errant transactions
● select gtid_subtract('b0bb2e56-6121-11e5-9e7c-
d73eafb37531:1-29, 8e3648e4-bc14-11e3-8d4c-
0800272864ba:1-9','b0bb2e56-6121-11e5-9e7c-
d73eafb37531:1-30,8e3648e4-bc14-11e3-8d4c-0800272864ba:1-
7') as errant_transactions;
+------------------------------------------+
| errant_transactions |
+------------------------------------------+
| 8e3648e4-bc14-11e3-8d4c-0800272864ba:8-9 |
+------------------------------------------+
17
Common failures with GTID: you still have tell
where to start replication after rebuild
● CHANGE MASTER TO … MASTER_AUTO_POSITION=1 is not enough.
● First you have to set global gtid_purged=... (xtrabackup gives this
information) (because xtrabackup is our friend!)
18
[root@slave01.xyz /var/lib/mysql]# cat xtrabackup_slave_info
SET GLOBAL gtid_purged='075d81d6-8d7f-11e3-9d88-b4b52f517ce4:1-
1075752934, cd56d1ad-1e82-11e5-947b-b4b52f51dbf8:1-1030088732,
e907792a-8417-11e3-a037-b4b52f51dbf8:1-25858698180';
CHANGE MASTER TO MASTER_AUTO_POSITION=1
Common failures with GTID: Server UUID changes
after rebuild
● Xtrabackup doesn’t back up auto.cnf
• This is good for rebuilding slaves
• Can come as a surprise if the master is restored
● A write on a freshly rebuilt node introduces a new UUID
• The workaround is to restore the UUID in auto.cnf as well 19
Common failures with GTID: There are transactions
where you don’t expect them
● Crashed table repairs will appear as new transactions
• STOP SLAVE
• RESET MASTER
• SET GLOBAL gtid_purged=<...>
• START SLAVE 20
The not so obvious failure
● We had the opposite of errant transactions: missing transactions from
the slave, we called it GTID holes
• http://bit.ly/gtidholefill
● The corresponding event in the master was empty
● Slaves with such holes couldn’t be repositioned to other masters: they
wanted a transaction which is no longer in the binlogs
21
The not so obvious failure: takeaways
● The issue was present with non-GTID replication as well
• But was never spotted
● Most transactions were on frequently updated data, they meant
occasional inconsistency
● Good example that it’s harder to lose transactions with GTID replication
● Fix: in mf_iocache2.c:
22
GTID holes by default: MTS
● The aforementioned GTID holes can be intermittently present any time
if multi-threaded slave is used
● Having holes in a GTID sequence can be a valid state with MTS and STOP
SLAVE
● We reverted to non-multithreaded slave during debugging to slave that
the holes left is a side effect of MTS, but it wasn’t
● (@MySQL 5.6 MTS doesn’t really solves anything at all if the schemas
not equally written - which is true in our case)
23
Practical things
Skipping replication event
● Injecting empty transactions
● In SQL
• set gtid_next=’<gtid>’; begin; commit; set gtid_next=’automatic’;
• http://bit.ly/gtidholefill
● Some tools exist for this (pt-slave-restart) 25
Consistency checks: pt-table-checksum
● Checksumming multiple levels of replication with pt-table-checksum
● ROW based replication limitation
● Workaround is checking on individual levels
● That introduces extra writes on second level slaves
• Luckily, GTID is not as error prone regarding consistency issues 26
Thanks!

More Related Content

PDF
Yahoo: Experiences with MySQL GTID and Multi Threaded Replication
PDF
MySQL Database Replication - A Guide by RapidValue Solutions
PDF
MySQL 5.6 GTID in a nutshell
PDF
MySQL 5.6 Replication Webinar
PDF
Pseudo GTID and Easy MySQL Replication Topology Management
PDF
M|18 Under the Hood: Galera Cluster
PPTX
M|18 Battle of the Online Schema Change Methods
PDF
Almost Perfect Service Discovery and Failover with ProxySQL and Orchestrator
Yahoo: Experiences with MySQL GTID and Multi Threaded Replication
MySQL Database Replication - A Guide by RapidValue Solutions
MySQL 5.6 GTID in a nutshell
MySQL 5.6 Replication Webinar
Pseudo GTID and Easy MySQL Replication Topology Management
M|18 Under the Hood: Galera Cluster
M|18 Battle of the Online Schema Change Methods
Almost Perfect Service Discovery and Failover with ProxySQL and Orchestrator

What's hot (20)

PDF
The Full MySQL and MariaDB Parallel Replication Tutorial
PPTX
M|18 Deep Dive: InnoDB Transactions and Replication
PDF
MySQL Parallel Replication (LOGICAL_CLOCK): all the 5.7 (and some of the 8.0)...
PDF
MySQL GTID Concepts, Implementation and troubleshooting
PDF
MySQL Scalability and Reliability for Replicated Environment
PDF
Best practices for MySQL High Availability
PDF
MySQL Parallel Replication by Booking.com
PDF
MySQL Multi-Source Replication for PL2016
PDF
The consequences of sync_binlog != 1
PDF
Demystifying MySQL Replication Crash Safety
PDF
MySQL/MariaDB Parallel Replication: inventory, use-case and limitations
PDF
MySQL Parallel Replication: inventory, use-case and limitations
PDF
MySQL Parallel Replication: inventory, use-cases and limitations
PDF
Demystifying MySQL Replication Crash Safety
PDF
MySQL Parallel Replication: inventory, use-case and limitations
PDF
MySQL Scalability and Reliability for Replicated Environment
PDF
Demystifying MySQL Replication Crash Safety
PPTX
M|18 Writing Stored Procedures in the Real World
PDF
MySQL Sandbox 3
PDF
Riding the Binlog: an in Deep Dissection of the Replication Stream
The Full MySQL and MariaDB Parallel Replication Tutorial
M|18 Deep Dive: InnoDB Transactions and Replication
MySQL Parallel Replication (LOGICAL_CLOCK): all the 5.7 (and some of the 8.0)...
MySQL GTID Concepts, Implementation and troubleshooting
MySQL Scalability and Reliability for Replicated Environment
Best practices for MySQL High Availability
MySQL Parallel Replication by Booking.com
MySQL Multi-Source Replication for PL2016
The consequences of sync_binlog != 1
Demystifying MySQL Replication Crash Safety
MySQL/MariaDB Parallel Replication: inventory, use-case and limitations
MySQL Parallel Replication: inventory, use-case and limitations
MySQL Parallel Replication: inventory, use-cases and limitations
Demystifying MySQL Replication Crash Safety
MySQL Parallel Replication: inventory, use-case and limitations
MySQL Scalability and Reliability for Replicated Environment
Demystifying MySQL Replication Crash Safety
M|18 Writing Stored Procedures in the Real World
MySQL Sandbox 3
Riding the Binlog: an in Deep Dissection of the Replication Stream
Ad

Similar to Running gtid replication in production (20)

PDF
Errant GTIDs breaking replication @ Percona Live 2019
PPTX
Migrate your EOL MySQL servers to HA Complaint GR Cluster / InnoDB Cluster Wi...
PDF
Replication skeptic
PPTX
MySQL Replication Overview -- PHPTek 2016
PDF
MySQL Replication Update -- Zendcon 2016
ODP
MySQL 5.6 Global Transaction Identifier - Use case: Failover
PDF
Replication Tips & Trick for SMUG
PDF
MySQL Replication Troubleshooting for Oracle DBAs
PDF
Replication Tips & Tricks
ODP
MySQL 101 PHPTek 2017
PDF
MySQL Replication Basics -Ohio Linux Fest 2016
PDF
MySQL User Camp: GTIDs
PDF
Keith Larson Replication
PDF
GTIDs Explained
PPTX
ConFoo MySQL Replication Evolution : From Simple to Group Replication
PDF
Pseudo gtid & easy replication topology management
PPTX
MySQL Replication Evolution -- Confoo Montreal 2017
PDF
MySQL Time Machine by replicating into HBase - Slides from Percona Live Amste...
PPTX
Consistency between Engine and Binlog under Reduced Durability
PDF
Galera Cluster 3.0 Features
Errant GTIDs breaking replication @ Percona Live 2019
Migrate your EOL MySQL servers to HA Complaint GR Cluster / InnoDB Cluster Wi...
Replication skeptic
MySQL Replication Overview -- PHPTek 2016
MySQL Replication Update -- Zendcon 2016
MySQL 5.6 Global Transaction Identifier - Use case: Failover
Replication Tips & Trick for SMUG
MySQL Replication Troubleshooting for Oracle DBAs
Replication Tips & Tricks
MySQL 101 PHPTek 2017
MySQL Replication Basics -Ohio Linux Fest 2016
MySQL User Camp: GTIDs
Keith Larson Replication
GTIDs Explained
ConFoo MySQL Replication Evolution : From Simple to Group Replication
Pseudo gtid & easy replication topology management
MySQL Replication Evolution -- Confoo Montreal 2017
MySQL Time Machine by replicating into HBase - Slides from Percona Live Amste...
Consistency between Engine and Binlog under Reduced Durability
Galera Cluster 3.0 Features
Ad

Recently uploaded (20)

PDF
Transcultural that can help you someday.
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
Business Analytics and business intelligence.pdf
PPTX
Leprosy and NLEP programme community medicine
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Managing Community Partner Relationships
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PPTX
modul_python (1).pptx for professional and student
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Transcultural that can help you someday.
STERILIZATION AND DISINFECTION-1.ppthhhbx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
IBA_Chapter_11_Slides_Final_Accessible.pptx
Business Analytics and business intelligence.pdf
Leprosy and NLEP programme community medicine
.pdf is not working space design for the following data for the following dat...
Managing Community Partner Relationships
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Introduction to Knowledge Engineering Part 1
IB Computer Science - Internal Assessment.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
modul_python (1).pptx for professional and student
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg

Running gtid replication in production

  • 1. Running GTID replication in production Balazs Pocze, DBA@Gawker Peter Boros, Principal Architect@Percona
  • 2. Who we are • Gawker Media: World’s largest independent media company • We got about ~67M monthly uniques in US, ~107M uniques worldwide. • We run on Kinja platform (http://kinja.com) • our blogs are: gawker.com, jalopnik.com, jezebel.com, lifehacker.com, gizmodo.com, deadspin.com, io9.com, kotaku.com 2
  • 3. GTID replication in a nutshell
  • 4. Traditional replication ● The MySQL server writes data which will be replicated to binary log ● For replica the following information is the minimum to know • Binary log file name • Position in binary log 4
  • 5. ● If a replica has to be moved to a different server you have to ensure that all of the servers are in the same state. ● You have to stop all the writes (set global read_only & set global super_read_only) and move the replica to the different database master (show master status …) as repositioning the slave. After the server is moved, the writes can be resumed. Traditional replication caveats 5
  • 6. GTID in a nutshell ● Behind the scenes it is the same as traditional replication • filename, position ● GTID replication • uuid:seqno 6
  • 7. GTID in a nutshell ● UUID identifies the server ● Seqno identified the Nth transaction from that server • This is harder to find than just seeking to a byte offset • The binlog containing the GTID needs to be scanned ● Nodes can be repositioned easily (anywhere in replication hierarchy) 7
  • 9. Gawker’s environment 9 ● 2 datacenter operation ● One of the DC’s are ‘more equal’ than the other - there’s the ‘active master’ ○ All the writes happen there, and it replicates to the other master, as well as secondaries ● The replicas has to be moveable between the masters and the secondaries ...
  • 11. Gawker’s environment 11 ● We don’t fix broken MySQL instances, when there is a problem we drop and recreate them as fresh clones ● The backup, and slave creation uses the same codebase ● All the operations are defined as ansible playbooks, they are called from python wrappers, and they could be managed from a jenkins instance ○ Backup
  • 12. Pre-GTID master maintenance 12 ● Put site to read-only mode (time!) ● Put the database to read-only mode (SET GLOBAL READ_ONLY=1) ● On secondary master (SHOW MASTER STATUS - 2x) ● On replicas: (CHANGE MASTER TO … ) ● Failover applications to write to the new master
  • 13. ● On replicas: (CHANGE MASTER TO … MASTER_AUTO_POSITION=1) ● Failover applications to write to the new master* *At this moment we still have to disable writes, but that is about for 30 seconds GTID master maintenance 13
  • 15. Common failures with GTID: Errant transactions ● Executed GTID set of the about to be promoted node is not the subset of the current master’s executed GTID set • Use GTID_SUBSET() in pre-flight checks • mysql> select gtid_subset('b0bb2e56-6121-11e5-9e7c-d73eafb37531:1-29, 8e3648e4-bc14-11e3-8d4c-0800272864ba:1-9','b0bb2e56-6121-11e5-9e7c- d73eafb37531:1-30, 8e3648e4-bc14-11e3-8d4c-0800272864ba:1-7') as slave_is_subset; +-----------------+ | slave_is_subset | +-----------------+ | 0 | +-----------------+ 15
  • 16. Common failures with GTID: Errant transactions ● select gtid_subtract('b0bb2e56-6121-11e5-9e7c- d73eafb37531:1-29, 8e3648e4-bc14-11e3-8d4c- 0800272864ba:1-9','b0bb2e56-6121-11e5-9e7c- d73eafb37531:1-30,8e3648e4-bc14-11e3-8d4c-0800272864ba:1- 7') as errant_transactions; +------------------------------------------+ | errant_transactions | +------------------------------------------+ | 8e3648e4-bc14-11e3-8d4c-0800272864ba:8-9 | +------------------------------------------+ 16
  • 17. Common failures with GTID: Errant transactions ● select gtid_subtract('b0bb2e56-6121-11e5-9e7c- d73eafb37531:1-29, 8e3648e4-bc14-11e3-8d4c- 0800272864ba:1-9','b0bb2e56-6121-11e5-9e7c- d73eafb37531:1-30,8e3648e4-bc14-11e3-8d4c-0800272864ba:1- 7') as errant_transactions; +------------------------------------------+ | errant_transactions | +------------------------------------------+ | 8e3648e4-bc14-11e3-8d4c-0800272864ba:8-9 | +------------------------------------------+ 17
  • 18. Common failures with GTID: you still have tell where to start replication after rebuild ● CHANGE MASTER TO … MASTER_AUTO_POSITION=1 is not enough. ● First you have to set global gtid_purged=... (xtrabackup gives this information) (because xtrabackup is our friend!) 18 [root@slave01.xyz /var/lib/mysql]# cat xtrabackup_slave_info SET GLOBAL gtid_purged='075d81d6-8d7f-11e3-9d88-b4b52f517ce4:1- 1075752934, cd56d1ad-1e82-11e5-947b-b4b52f51dbf8:1-1030088732, e907792a-8417-11e3-a037-b4b52f51dbf8:1-25858698180'; CHANGE MASTER TO MASTER_AUTO_POSITION=1
  • 19. Common failures with GTID: Server UUID changes after rebuild ● Xtrabackup doesn’t back up auto.cnf • This is good for rebuilding slaves • Can come as a surprise if the master is restored ● A write on a freshly rebuilt node introduces a new UUID • The workaround is to restore the UUID in auto.cnf as well 19
  • 20. Common failures with GTID: There are transactions where you don’t expect them ● Crashed table repairs will appear as new transactions • STOP SLAVE • RESET MASTER • SET GLOBAL gtid_purged=<...> • START SLAVE 20
  • 21. The not so obvious failure ● We had the opposite of errant transactions: missing transactions from the slave, we called it GTID holes • http://bit.ly/gtidholefill ● The corresponding event in the master was empty ● Slaves with such holes couldn’t be repositioned to other masters: they wanted a transaction which is no longer in the binlogs 21
  • 22. The not so obvious failure: takeaways ● The issue was present with non-GTID replication as well • But was never spotted ● Most transactions were on frequently updated data, they meant occasional inconsistency ● Good example that it’s harder to lose transactions with GTID replication ● Fix: in mf_iocache2.c: 22
  • 23. GTID holes by default: MTS ● The aforementioned GTID holes can be intermittently present any time if multi-threaded slave is used ● Having holes in a GTID sequence can be a valid state with MTS and STOP SLAVE ● We reverted to non-multithreaded slave during debugging to slave that the holes left is a side effect of MTS, but it wasn’t ● (@MySQL 5.6 MTS doesn’t really solves anything at all if the schemas not equally written - which is true in our case) 23
  • 25. Skipping replication event ● Injecting empty transactions ● In SQL • set gtid_next=’<gtid>’; begin; commit; set gtid_next=’automatic’; • http://bit.ly/gtidholefill ● Some tools exist for this (pt-slave-restart) 25
  • 26. Consistency checks: pt-table-checksum ● Checksumming multiple levels of replication with pt-table-checksum ● ROW based replication limitation ● Workaround is checking on individual levels ● That introduces extra writes on second level slaves • Luckily, GTID is not as error prone regarding consistency issues 26