PostgreSQL: Replication slot supporting failover

PostgreSQL: Replication slot supporting failover

Logical decoding and logical replication have received more and more attention in the PostgreSQL ecosystem. This means we need it to work well with the production HA system-but it turns out there is a problem. The replication slot is not synchronized to the backup machine, so once the master fails, the original replication slot cannot continue to be used after the backup machine becomes the master.

There is a patch (replication slot that supports failover) to change this status quo by using WAL archiving or streaming replication to synchronize the creation and update of the replication slot to the standby machine. This allows the logical decoding client to seamlessly switch to the new host promoted by the failover and continue replaying without any consistency issues.

Logical decoding and copy slots

The logical decoding introduced in 9.4 allows the client to stream changes to the receiving end row by row in a consistent transaction commit order-it can be another PostgreSQL instance, or a message queue and search engine of your choice, etc. application.

In order to stream data, the client connects to the replication slot on the server. The slot ensures that the server retains the WAL required for decoding, and (for logical slots) also prevents the deletion of the system table rows of the old version that may be needed to understand the WAL.

The problem of failover

Most production PostgreSQL deployments rely on streaming replication and/or WAL archive-based replication ("physical replication") as part of its high availability and failover capabilities. Unlike most server states, the replication slot is not replicated from the master server to its replica. When the master server fails and promotes a replica, the replication slots from the master server will be lost. Because they have no slots to connect to, the logical replication client cannot continue to transmit.

Please don't be nervous! Just create a slot with the same name on the copy.

Things are not that simple. When creating a logical replication slot, you can only use the database history view (relative to the future position when the slot is created) to create it. Therefore, if the client is not completely kept up with the relay on the old primary server, or the first thing to do on the new primary server is not to create a replacement slot, the client will lose some changes. This is the problem that the slot should prevent, and for some applications, this may be a critical problem.

Two key reasons why slots cannot be created "back in time" are WAL reservations and (for logical slots) the emptying of system tables. We also cannot export past snapshots, but this is just a problem encountered by new clients. WAL retention is the simplest: the standby server may discard the WAL segments corresponding to the database changes, because the logical decoding client on the primary server has not yet replayed these changes. If a failover occurs, these changes cannot be replayed to the client.

Another problem is system tables-logical decoding needs to use historical definitions of tables, types, etc. to interpret the data in the WAL. For this reason, it uses deleted rows for a MVCC-based time traversal.

If the VACUUM on the primary server deletes those deleted rows and marks these spaces as reusable free space, then the backup server will replay these changes, and we can no longer understand the contents of the WAL.

Failover slot

Failover slots solve these problems by synchronizing slot creation, deletion, and location updates of replication servers. This is done through the WAL stream, just like other operations. If a failover slot is created (the failover option pg_create_logical_replication_slot is turned on), then its creation will be recorded in the WAL. When the client tells the server that it can release resources that are no longer needed, the same is true for subsequent location updates.

If the primary server fails and the backup server becomes the primary, the logical decoding client can reconnect to the backup server and continue to operate as if nothing happened.

If a DNS update or IP switch is performed at the same time as the master is promoted, the logical decoding client even thinks that at most a host restart has occurred.

About physical slots

PostgreSQL's block-level ("physical") replication also has replication slots. They can be used to fix WAL reservations, providing a more fine-grained mechanism than wal_keep_segments, but the cost is also unbounded. You can create physical failover slots, just as you create logical failover slots.

Does this mean that pglogical can be used for failover to logical copy?

The original intention of failover slot design is not to support failover to logical copy. They exist to allow logical replication to follow physical failover.

Supporting failover to a logical copy is a completely unrelated issue. There are some related limitations in the PostgreSQL kernel, such as the current lack of support for logical decoding of sequence position advancement. The failover slot is neither helpful nor weak in this regard. They just provide a way to integrate logical replication into HA solutions and into existing mature and established infrastructure models.

Guess you like

Origin blog.csdn.net/yang_z_1/article/details/112620005