SlideShare a Scribd company logo
Multi-master replication for
Postgres
K. Knizhnik, C. Pan, S. Kelvich
Design objectives
Implementation/internals
Tests
Configuration
Roadmap
Contents
2
Design objectives
3
We want:
Fault-tolerance in easy way
OLTP-style load
Compatibility with standalone postgres
Possibility to reuse as metadata storage for sharded cluster
Design objectives
4
Replication:
Identical replicated data on all nodes
Possibility to have local tables
Writes allowed to any node
=> Easy to use
=> We need to take care about proper isolation
Design objectives
5
Transaction manager. We want:
Avoid single point of failure.
+: Spanner, Cockroach, Clock-SI
—: Pg-XL, ...
Avoid network communication for Read-Only transactions
+: HANA, Spanner, Cockroach, Clock-SI
—: Pg-XL, ...
Design objectives
6
Fault tolerance.
Paxos. Distributed consensus, low level.
Raft. Complete state-machine replication solution with failure
detector on timeouts and autorecovery. But all writes are
proxied to one node.
2PC. Blocks in case of node and coordinator failure. Postgres
already support 2pc.
3PC-like. Extra message between "P"and "C". 3PC, Paxos
commit, E3PC.
Design objectives
7
Summary.
No performance penalty for reads.
Tx can be issued to any node.
No special actions required in case of failure.
Design objectives
8
github.com/postgrespro/postgres_cluster
Patched version of Postgres 9.6
Transaction Manager API + Deadlock detection API.
Logical decoding of 2PC transactions.
Mmts extension.
Transaction Manager implementation (Clock-SI)
Logical replication protocol/client
Hooks on transaction commit and transforms it into 2PC.
Bunch of bgworkers.
Implementation
9
Mmts uses logical replication/decoding.
In-core support and extension by 2ndQuadrant.
Very flexible:
Can skip tables
Replication between different versions
Logical messages
Implementation
10
BE – backend, WS – Walsender, Arb – Arbiter, WR – Walreceiver
Implementation
11
Transaction Manager.
Clock-SI algorithm (MS research)
Make use of CSN instead of running lists. (we track xid-csn
correspondence in extension, but there is ongoing work to have
CSN in-core by Heikki and Alexander)
Implementation
12
DDL replication.
Statement-based.
Happily, postgres support 2PC for almost all DDL (alter enum
already fixed in -master)
CREATE TABLE AS, CREATE MATVIEW, etc – tricky, mixes
DDL and DML.
Temp tables are tricky – shouldn’t be replicted.
Depends on environment (search_path, auth, etc.)
Implementation
13
Postgres compatibility.
almost FULLY compatible with pg.
162 of 166 regressions tests pass as is.
1 test is using prepared statement inside CREATE TABLE AS
(CTA).
3 tests are using CTA(CTA(TEMP TABLE)).
Some obvious way to abuse statement based replication, e.g.
write function that create table with name based on current
timestamp.
Also sequences can add pain.
Implementation
14
Automatic recovery: normal work
Implementation
15
Automatic recovery: network split
Implementation
16
Automatic recovery: recovery process
Implementation
17
Automatic recovery: normal work again
Implementation
18
Not that hard:
Install mmts extension
Postgres:
max_prepared_transactions
wal_level = logical
max_worker_processes, max_replication_slots,
max_wal_senders
shared_preload_libraries = ’multimaster’
Multimaster extension:
multimaster.node_id = ...
multimaster.conn_strings = ’...’
Configuration
19
We want:
Test cluster liveness against network problems, restarts,
timeshifts, etc.
Sound like Jepsen. But unfortunately it uses ssh on precreated
vm’s/servers. That’s okay for single test, but painful for CI.
No sane way of testing network split with processes, i.e.
postgres TAP test framework is not helpful with that.
Testing
20
So we are using python unittest with docker.
3-5 containers is _way_ faster to start than vm’s.
takes 10 seconds to compile mmts extension, init and start
cluster.
failure injection via docker.exec (iptables, shift time, etc).
compatible with Travis-CI.
Testing
21
Testing itself: attach clients to each node of cluster and start
abusing nodes.
Client: bank-like test case. Transfer money between accounts
with concurrent total balance calculation.
Testing
22
Failures injected:
Node stop-start
Node kill-start
Node in network partition
Edge network split (a.k.a. majority rings)
Shift time
Change clock speed on nodes with libfaketime *
* – not yet implemented.
Testing
23
Performance.
Read-only tx speed is the same as in standalone postgres.
Commit takes more time (two net roundtrips).
Logical decoding slows down big transactions – but that
should be fixed, patch on commitfest.
Testing
24
Release a public beta
Try to commit twophase decoding patch to pg
Try to commit transation manager patch to pg
Raise discussion about replication/decoding of catalog content
Roadmap
25

More Related Content

PDF
Distributed Postgres
PDF
Postgres clusters
PPTX
Eventually, Scylla Chooses Consistency
PDF
Logical Replication in PostgreSQL - FLOSSUK 2016
PDF
pgDay Asia 2016 - Swapping Pacemaker-Corosync for repmgr (1)
PPTX
Scylla Summit 2022: Making Schema Changes Safe with Raft
PDF
GitLab PostgresMortem: Lessons Learned
PDF
Training Slides: Basics 104: Simple Tungsten Clustering Deployments
Distributed Postgres
Postgres clusters
Eventually, Scylla Chooses Consistency
Logical Replication in PostgreSQL - FLOSSUK 2016
pgDay Asia 2016 - Swapping Pacemaker-Corosync for repmgr (1)
Scylla Summit 2022: Making Schema Changes Safe with Raft
GitLab PostgresMortem: Lessons Learned
Training Slides: Basics 104: Simple Tungsten Clustering Deployments

What's hot (20)

PDF
Dw tpain - Gordon Klok
PDF
High-Performance Networking Using eBPF, XDP, and io_uring
PDF
Streaming huge databases using logical decoding
PDF
Kafka on ZFS: Better Living Through Filesystems
PDF
Object Compaction in Cloud for High Yield
PDF
Webinar Slides: Migrating to Galera Cluster
PDF
Percona XtraDB 集群安装与配置
PDF
On The Building Of A PostgreSQL Cluster
PDF
Demystifying postgres logical replication percona live sc
PPTX
Building Spark as Service in Cloud
PDF
Whoops! I Rewrote It in Rust
PPTX
Streaming Replication Made Easy in v9.3
PDF
Continuous Go Profiling & Observability
PDF
Data Structures for High Resolution, Real-time Telemetry at Scale
PDF
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
PDF
HBaseCon2017 Transactions in HBase
PDF
Tips and Tricks for Operating Apache Kafka
PPTX
Am I reading GC logs Correctly?
PDF
Shenandoah GC: Java Without The Garbage Collection Hiccups (Christine Flood)
PDF
Using eBPF to Measure the k8s Cluster Health
Dw tpain - Gordon Klok
High-Performance Networking Using eBPF, XDP, and io_uring
Streaming huge databases using logical decoding
Kafka on ZFS: Better Living Through Filesystems
Object Compaction in Cloud for High Yield
Webinar Slides: Migrating to Galera Cluster
Percona XtraDB 集群安装与配置
On The Building Of A PostgreSQL Cluster
Demystifying postgres logical replication percona live sc
Building Spark as Service in Cloud
Whoops! I Rewrote It in Rust
Streaming Replication Made Easy in v9.3
Continuous Go Profiling & Observability
Data Structures for High Resolution, Real-time Telemetry at Scale
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
HBaseCon2017 Transactions in HBase
Tips and Tricks for Operating Apache Kafka
Am I reading GC logs Correctly?
Shenandoah GC: Java Without The Garbage Collection Hiccups (Christine Flood)
Using eBPF to Measure the k8s Cluster Health
Ad

Viewers also liked (20)

PDF
Managing thousands of databases
PDF
Streaming Replication (Keynote @ PostgreSQL Conference 2009 Japan)
PDF
Enterprise PostgreSQL - EDB's answer to conventional Databases
PDF
Streaming replication in practice
PDF
Teaching PostgreSQL to new people
PDF
Flexible Indexing with Postgres
 
PDF
Active/Active Database Solutions with Log Based Replication in xDB 6.0
 
PDF
Postgres-XC as a Key Value Store Compared To MongoDB
PDF
How the Postgres Query Optimizer Works
 
PDF
Keepalived & HA-Proxy as an alternative to commercial loadbalancer - August 2014
PDF
Postgres-XC: Symmetric PostgreSQL Cluster
PDF
Developing PostgreSQL Performance, Simon Riggs
PPTX
X-DB Replication Server and MMR
PDF
Gbroccolo pgconfeu2016 pgnfs
PDF
PostgresOpen 2013 A Comparison of PostgreSQL Encryption Options
PPTX
kafka for db as postgres
PDF
Как сделать высоконагруженный сервис, не зная количество нагрузки / Олег Обле...
PDF
PostgreSQL replication from setup to advanced features.
ODP
Logical replication with pglogical
Managing thousands of databases
Streaming Replication (Keynote @ PostgreSQL Conference 2009 Japan)
Enterprise PostgreSQL - EDB's answer to conventional Databases
Streaming replication in practice
Teaching PostgreSQL to new people
Flexible Indexing with Postgres
 
Active/Active Database Solutions with Log Based Replication in xDB 6.0
 
Postgres-XC as a Key Value Store Compared To MongoDB
How the Postgres Query Optimizer Works
 
Keepalived & HA-Proxy as an alternative to commercial loadbalancer - August 2014
Postgres-XC: Symmetric PostgreSQL Cluster
Developing PostgreSQL Performance, Simon Riggs
X-DB Replication Server and MMR
Gbroccolo pgconfeu2016 pgnfs
PostgresOpen 2013 A Comparison of PostgreSQL Encryption Options
kafka for db as postgres
Как сделать высоконагруженный сервис, не зная количество нагрузки / Олег Обле...
PostgreSQL replication from setup to advanced features.
Logical replication with pglogical
Ad

Similar to Multimaster (20)

PDF
Introduction to Galera Cluster
PPT
10 Multicore 07
ODP
Introduction to LAVA Workload Scheduler
PDF
Deep Dive on Amazon EC2 Instances (March 2017)
PPTX
Modern processors
PDF
Kubeinvaders & Chaos Engineering practices for Kubernetes-1.pdf
PDF
Low Latency Execution For Apache Spark
PDF
Direct Code Execution - LinuxCon Japan 2014
PPTX
Adventures in Thread-per-Core Async with Redpanda and Seastar
PDF
Quantifying Container Runtime Performance: OSCON 2017 Open Container Day
PPTX
CPN302 your-linux-ami-optimization-and-performance
DOC
weblogic perfomence tuning
PDF
Porting a Streaming Pipeline from Scala to Rust
PDF
ContainerDays Boston 2015: "CoreOS: Building the Layers of the Scalable Clust...
PDF
Container Performance Analysis Brendan Gregg, Netflix
PPT
Migration To Multi Core - Parallel Programming Models
PPTX
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
PDF
Container Performance Analysis
PDF
Network Programming: Data Plane Development Kit (DPDK)
PDF
An introduction to_rac_system_test_planning_methods
Introduction to Galera Cluster
10 Multicore 07
Introduction to LAVA Workload Scheduler
Deep Dive on Amazon EC2 Instances (March 2017)
Modern processors
Kubeinvaders & Chaos Engineering practices for Kubernetes-1.pdf
Low Latency Execution For Apache Spark
Direct Code Execution - LinuxCon Japan 2014
Adventures in Thread-per-Core Async with Redpanda and Seastar
Quantifying Container Runtime Performance: OSCON 2017 Open Container Day
CPN302 your-linux-ami-optimization-and-performance
weblogic perfomence tuning
Porting a Streaming Pipeline from Scala to Rust
ContainerDays Boston 2015: "CoreOS: Building the Layers of the Scalable Clust...
Container Performance Analysis Brendan Gregg, Netflix
Migration To Multi Core - Parallel Programming Models
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
Container Performance Analysis
Network Programming: Data Plane Development Kit (DPDK)
An introduction to_rac_system_test_planning_methods

Recently uploaded (20)

PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
cloud_computing_Infrastucture_as_cloud_p
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
Chapter 5: Probability Theory and Statistics
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PDF
Hindi spoken digit analysis for native and non-native speakers
PPTX
1. Introduction to Computer Programming.pptx
PDF
Hybrid model detection and classification of lung cancer
PPTX
OMC Textile Division Presentation 2021.pptx
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
A novel scalable deep ensemble learning framework for big data classification...
PDF
Mushroom cultivation and it's methods.pdf
PDF
DP Operators-handbook-extract for the Mautical Institute
PDF
Approach and Philosophy of On baking technology
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Assigned Numbers - 2025 - Bluetooth® Document
SOPHOS-XG Firewall Administrator PPT.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
cloud_computing_Infrastucture_as_cloud_p
Programs and apps: productivity, graphics, security and other tools
1 - Historical Antecedents, Social Consideration.pdf
MIND Revenue Release Quarter 2 2025 Press Release
Chapter 5: Probability Theory and Statistics
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
Hindi spoken digit analysis for native and non-native speakers
1. Introduction to Computer Programming.pptx
Hybrid model detection and classification of lung cancer
OMC Textile Division Presentation 2021.pptx
Group 1 Presentation -Planning and Decision Making .pptx
A novel scalable deep ensemble learning framework for big data classification...
Mushroom cultivation and it's methods.pdf
DP Operators-handbook-extract for the Mautical Institute
Approach and Philosophy of On baking technology
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf

Multimaster

  • 1. Multi-master replication for Postgres K. Knizhnik, C. Pan, S. Kelvich
  • 4. We want: Fault-tolerance in easy way OLTP-style load Compatibility with standalone postgres Possibility to reuse as metadata storage for sharded cluster Design objectives 4
  • 5. Replication: Identical replicated data on all nodes Possibility to have local tables Writes allowed to any node => Easy to use => We need to take care about proper isolation Design objectives 5
  • 6. Transaction manager. We want: Avoid single point of failure. +: Spanner, Cockroach, Clock-SI —: Pg-XL, ... Avoid network communication for Read-Only transactions +: HANA, Spanner, Cockroach, Clock-SI —: Pg-XL, ... Design objectives 6
  • 7. Fault tolerance. Paxos. Distributed consensus, low level. Raft. Complete state-machine replication solution with failure detector on timeouts and autorecovery. But all writes are proxied to one node. 2PC. Blocks in case of node and coordinator failure. Postgres already support 2pc. 3PC-like. Extra message between "P"and "C". 3PC, Paxos commit, E3PC. Design objectives 7
  • 8. Summary. No performance penalty for reads. Tx can be issued to any node. No special actions required in case of failure. Design objectives 8
  • 9. github.com/postgrespro/postgres_cluster Patched version of Postgres 9.6 Transaction Manager API + Deadlock detection API. Logical decoding of 2PC transactions. Mmts extension. Transaction Manager implementation (Clock-SI) Logical replication protocol/client Hooks on transaction commit and transforms it into 2PC. Bunch of bgworkers. Implementation 9
  • 10. Mmts uses logical replication/decoding. In-core support and extension by 2ndQuadrant. Very flexible: Can skip tables Replication between different versions Logical messages Implementation 10
  • 11. BE – backend, WS – Walsender, Arb – Arbiter, WR – Walreceiver Implementation 11
  • 12. Transaction Manager. Clock-SI algorithm (MS research) Make use of CSN instead of running lists. (we track xid-csn correspondence in extension, but there is ongoing work to have CSN in-core by Heikki and Alexander) Implementation 12
  • 13. DDL replication. Statement-based. Happily, postgres support 2PC for almost all DDL (alter enum already fixed in -master) CREATE TABLE AS, CREATE MATVIEW, etc – tricky, mixes DDL and DML. Temp tables are tricky – shouldn’t be replicted. Depends on environment (search_path, auth, etc.) Implementation 13
  • 14. Postgres compatibility. almost FULLY compatible with pg. 162 of 166 regressions tests pass as is. 1 test is using prepared statement inside CREATE TABLE AS (CTA). 3 tests are using CTA(CTA(TEMP TABLE)). Some obvious way to abuse statement based replication, e.g. write function that create table with name based on current timestamp. Also sequences can add pain. Implementation 14
  • 15. Automatic recovery: normal work Implementation 15
  • 16. Automatic recovery: network split Implementation 16
  • 17. Automatic recovery: recovery process Implementation 17
  • 18. Automatic recovery: normal work again Implementation 18
  • 19. Not that hard: Install mmts extension Postgres: max_prepared_transactions wal_level = logical max_worker_processes, max_replication_slots, max_wal_senders shared_preload_libraries = ’multimaster’ Multimaster extension: multimaster.node_id = ... multimaster.conn_strings = ’...’ Configuration 19
  • 20. We want: Test cluster liveness against network problems, restarts, timeshifts, etc. Sound like Jepsen. But unfortunately it uses ssh on precreated vm’s/servers. That’s okay for single test, but painful for CI. No sane way of testing network split with processes, i.e. postgres TAP test framework is not helpful with that. Testing 20
  • 21. So we are using python unittest with docker. 3-5 containers is _way_ faster to start than vm’s. takes 10 seconds to compile mmts extension, init and start cluster. failure injection via docker.exec (iptables, shift time, etc). compatible with Travis-CI. Testing 21
  • 22. Testing itself: attach clients to each node of cluster and start abusing nodes. Client: bank-like test case. Transfer money between accounts with concurrent total balance calculation. Testing 22
  • 23. Failures injected: Node stop-start Node kill-start Node in network partition Edge network split (a.k.a. majority rings) Shift time Change clock speed on nodes with libfaketime * * – not yet implemented. Testing 23
  • 24. Performance. Read-only tx speed is the same as in standalone postgres. Commit takes more time (two net roundtrips). Logical decoding slows down big transactions – but that should be fixed, patch on commitfest. Testing 24
  • 25. Release a public beta Try to commit twophase decoding patch to pg Try to commit transation manager patch to pg Raise discussion about replication/decoding of catalog content Roadmap 25