Jingdong Technology's Redis cross-data center two-way synchronization optimization practice

1. Background

Based on business development and strategic deployment, the company needs to implement unitized deployment in multiple data centers. On the one hand, it can achieve disaster recovery in multiple data centers, and on the other hand, it can increase the speed of user request access. If it is necessary to ensure disaster recovery of multiple data centers or realize users’ nearby access, each data center needs to have a consistent full amount of data. If the user’s nearby reading and writing is really realized, that is, real business remote multi-active is realized. Data synchronization is the basis of remote multi-active , which requires two-way synchronization of data between multiple data centers.

2. Problems encountered by native redis

1. Does not support dual-master synchronization

Native redis does not provide a master-master synchronization mechanism across computer rooms, and only supports master-slave synchronization; if only the master-slave data synchronization mechanism of redis is used, the master node and slave node can only be deployed in different computer rooms. When the computer room where the master node is located fails, the slave node can be upgraded to the master node, and the application can continue to provide external services. However, in this mode, if you want to write data, you can only write it through the master node, and the remote computer room cannot realize nearby writing, so you can’t achieve real multi-active in different places, only backup and disaster recovery. Moreover, when the computer room fails over, manual intervention of operation and maintenance is required.

Therefore, in order to realize the master-master synchronization mechanism, a synchronization tool is required to simulate the slave node mode to synchronize the data in the local computer room to other computer rooms, and the same is true for other computer rooms. At the same time, using synchronization tools to synchronize data across data centers will encounter the following problems.

data loopback

The data loopback means that the data written in the nearby computer room of A is synchronized to the computer room of B through the synchronization tool, and then synchronized back to the computer room of A through the synchronization tool of computer room B. Therefore, during the synchronization process, it is necessary to identify the data written locally or the data synchronized from other data centers. Only the data written locally needs to be synchronized to other data centers.

idempotence

The commands in the synchronization process may be repeatedly synchronized due to reasons such as breakpoint resumes. In this case, it is necessary to ensure that the same command is executed multiple times to ensure idempotence.

multiple write conflicts

Take the double write conflict as an example, as shown in the following figure:

Remote multi-active double-write conflict

DC1 writes set a 1, and DC2 writes at the same time set a 2. When these two commands are synchronized to the other computer room through the synchronization tool, the final a saved in DC1 is 2, and the a saved in DC2 is 1. That is to say, the final data of the two computer rooms is inconsistent. .

2. Breakpoint resume

For scenarios such as instantaneous disconnection and reconnection, slave node restart, etc., in order to improve the master-slave synchronization efficiency in this scenario, redis adds a ring copy buffer to the master node. When the master node writes data to the slave node, it also copies the buffer A piece of data is also written in the zone. When the slave node disconnects and reconnects, it only needs to send the incremental data added during the disconnection period to the slave node through the copy buffer, which avoids full synchronization and improves these Synchronization efficiency in the scene.

However, the memory copy buffer is generally not too large, and the current default setting for production is 64M. In the cross-data center synchronization scenario, the network environment is complex, and the frequency and duration of disconnections may be more frequent and longer than those in the same computer room; at the same time , Synchronizing data across data centers is also for disaster recovery at the computer room level, so it is required to support longer breakpoint resumes, and it is obviously not a good idea to increase the size of the memory copy buffer infinitely.

Let's take a look at our optimization work to support redis cross-data center synchronization.

3. Redis node transformation

In order to support remote multi-active scenarios, we have optimized and transformed the native redis code, mainly including the following aspects:

1. Extend the RESP protocol

In order to support more efficient resuming of breakpoints and to solve the data loopback problem, we have added an id to each command (mostly write commands) that need to be synchronized to the slave node in the redis master node, and extended the RESP protocol. A protocol-like protocol is added at the head of each related command #{id}\r\n.

The data written by the local business client still follows the native RESP protocol. After the master node executes the command, the write command synchronized to the slave node will be extended before synchronization, and the header id protocol will be added; the non-local business client (that is, from Data written by other data centers) uses the extended RESP protocol.

2. Write commands to write logs in real time

In order to support longer breakpoint resuming and tolerate long-term computer room-level failures, the write commands written by the local business client will be sequentially written to the log file after the protocol is extended, and the corresponding index file will be generated at the same time; in order to reduce The size of the log file, and improving the efficiency of resuming the log file through breakpoints, the data synchronized from other data centers will not be written into the log file.

3. Synchronous process transformation

Native redis data synchronization is divided into full synchronization and partial synchronization, and each master node has a memory ring copy buffer; the initial synchronization uses full synchronization, and partial synchronization is used when resuming a breakpoint, that is, first try to copy the buffer from the master node ring If the synchronization is successful, the incremental data synchronization can be performed after the data in the buffer is synchronized. If it is unsuccessful, it is still necessary to perform full synchronization first and then incremental synchronization.

Since full synchronization needs to generate a sub-process and generate an RDB file in the sub-process, it has a relatively large impact on the performance of the master node, and we should minimize the number of full synchronizations.

In order to reduce the number of full synchronizations, we have modified the redis synchronization process. When the synchronization cannot be completed using the ring replication buffer during partial synchronization, try to use the log rlog for synchronization first. If the synchronization is successful, the data in the log will be synchronized. Incremental synchronization can be performed, otherwise full synchronization is required first.

4. rLog log design

It is divided into index files and log files, both of which are written sequentially to improve performance. After testing, it is consistent with the persistence performance of native redis when aof is enabled; however, rlog will be deleted regularly. In order to prevent the infinite expansion of aof files, native redis will periodically pass through subprocesses Execute aof file rewriting, which has relatively high performance on the master node, so in essence, the performance of rlog on redis will be smaller than that of aof.

Both the index file and the log file file name are the id of the first command saved in the file.

Index files and log files are first written to the memory buffer, and then written to the operating system buffer in batches, and the operating system buffer is periodically refreshed every second to actually fall into the disk file. Compared with the aof file buffer, we pre-allocate and optimize the rlog buffer to improve performance.

1. Index file format

The format of the index file is as follows, and the index data corresponding to each command contains three parts:

Index file ridx format

  • pos: the offset of the first byte of the command in the corresponding log file relative to the start position of the log file
  • len: the length of the command
  • offset: the cumulative offset of the first byte of this command in the replication buffer of the master node

2. Log file splitting

In order to prevent the infinite expansion of a single file, redis will periodically split the file when writing the file. The split is based on two dimensions, namely file size and time.

The default split thresholds are respectively, when the log file size reaches 128M or at the same time every hour and the number of log entries is greater than 10w, write a new log file and index file.

In each cycle processing, when all the data in the memory buffer is written into the file, it is judged whether the log file splitting condition is met. If so, a log file splitting flag is added. In the next cycle processing, the memory buffer Before data is written into the file, close the current index file and log, and create a new index file and log file at the same time.

3. Log file deletion

In order to prevent the number of log files from growing indefinitely and consuming disk storage space, and because no log rewriting is done, resuming transmission through too many files is inefficient and meaningless, so redis regularly performs log files and corresponding index files delete.

By default, log files are kept for one day at most, and redis periodically deletes log files and index files that are older than one day, that is, it can tolerate computer room-level failures of up to one day, otherwise full synchronization of data in the computer room is required.

When resuming a breakpoint upload, if data needs to be synchronized from the log file, the log file deletion logic will be temporarily disabled before the synchronization starts, and it will return to normal after the synchronization is completed to avoid the situation where the synchronized data is deleted.

Five, redis data synchronization

1. Breakpoint resume

As mentioned above, in order to tolerate longer computer room-level failures, improve cross-data center disaster recovery capabilities, and improve computer room failure recovery efficiency, we have modified the redis synchronization process. When partial synchronization cannot use the ring replication buffer to complete the synchronization When adding, first try to use the log rlog for synchronization. The flow chart is as follows:

Partial replication flow chart of multi-active redis instance

First of all, after the synchronization tool connects to the master node, in addition to sending the authentication, it is necessary replconf capato inform the master node through the command that it has the ability to continue the transmission through the rlog breakpoint.

  1. The slave node sends it first psync runId offset. If it is the first start, send psync ? -1 first, and the master node will return a runId and offset

  2. If it can be synchronized through the copy buffer, the master node returns to the slave node+CONTINUE runId

  3. If it cannot be synchronized by the replication buffer, the master node returns to the slave node+LPSYNC

  4. If it is received from the node +CONTINUE, just continue to receive incremental data, and continue to update offset and command id

  5. If the slave node receives it +LPSYNC, the slave node continues to send to the master nodeLPSYNC runId id

  6. After the master node receives the LPSYNC command, if it can continue to synchronize data through rlog, it will send it to the slave node +LCONTINUE runId;

    After receiving from the node +LCONTINUE, the offset can be set to LONG_LONG_MIN, or the subsequent data does not update the offset; continue to receive incremental data synchronized through rlog;

    After the incremental data transmitted through rlog synchronization is completed, the master node will send lcommit offseta command ;

    In the process of parsing data, when the slave node receives the lcommit command, it updates the local offset, and the subsequent incremental data continues to increase the offset. At the same time, the lcommit command does not need to be synchronized to the peer (identify by id<0, all id<0 Commands do not need to be synchronized to the peer)

  7. If not, the master node returns to the slave node at this time +FULLRESYNC runId offset; follow-up full synchronization;

2. Idempotence

In order to improve performance, the migration tool does not save the synchronization offset offset and id to zk in real time, but synchronizes to zk periodically (every second by default), so when the breakpoint resumes, the migration tool obtains from zk the synchronization before disconnection Offset, try to continue to synchronize data with the master node, some data may be sent repeatedly in the middle, so in order to ensure data consistency, it is necessary to ensure that the command is executed multiple times with idempotency.

In order to ensure that redis commands are idempotent, some non-idempotent commands in redis have been modified. The specific design and modified commands are as follows:

image-20210209153842554

Note: The list type command has not been modified yet, and it is not idempotent

3. Data loopback processing

Data loopback mainly means that when the data read by the synchronization tool from redis in computer room A is synchronized to computer room B for writing through MQ, the synchronization tool in computer room B gets it again and synchronizes it to computer room A again, resulting in the problem of data cyclic replication.

For the data synchronized to the slave node and the migration tool, an id field will be added to the header, and the data from different sources or the data that does not need to be synchronized to the remote end will be identified and distinguished by the id; the data written by the local business client needs to be synchronized to the remote end In the end data center, the allocation id is greater than 0; the data allocation id from other data centers is less than 0; some command data allocation ids that are only used for master-slave heartbeat interaction are also less than 0.

After the synchronization tool parses the data, it filters out commands with an id less than 0, and only needs to write data with an id greater than 0 to the remote end, that is, the data written by the local business client. Data from other data centers are not written back to the remote data center.

4. Expired and eliminated data

At present, expiration and elimination are independently processed by the redis nodes of each data center, and the data deleted by expiration and elimination are not synchronized; that is, the deletion commands generated by expiration and elimination have an ID of less than 0 and are filtered out by the synchronization tool.

sync problems

Why not sync the past? Because the data stored in the hash table in the memory does not mark the source of the data center, the expired and eliminated data may come from other data centers. If the data from other data centers is expired or eliminated and synchronized to other remote data centers , there will be a data double-write conflict scenario. Double-write conflicts can lead to data inconsistencies.

out of sync problems

For expired data, out-of-sync deletion may lead to inconsistencies in data display in different data centers, but it will be consistent eventually, and dirty reads will not occur;

For eliminated data, the current asynchronous deletion scheme, if eliminated, will lead to data inconsistency in different data centers; currently only through operation and maintenance methods, such as sufficient pre-allocation, and timely attention to memory usage alarms, to avoid the phenomenon of eliminated data occur.

5. Data Migration

In the redis cluster mode, generally when horizontal expansion occurs to increase the number of master nodes in the cluster, slot and data migration is required.

Data migration in the redis cluster takes the slot as the dimension to migrate, migrate all the data in the slot from the source node to the target node, and then mark the slot number as the responsibility of the new target node. At the same time, each time a Key is migrated, it will be stored in the source node To delete, replace the migrate command with the del command; at the same time, the data migration is realized by sending the restore command to the target node in the source node.

Our data migration strategy is still that each data center independently completes the expansion and data migration work, and the del and restore commands generated during the migration process are not synchronized across data centers; the replaced del command and the restore command sent to the target node are both Assign an id less than 0, so it will be filtered out by the synchronization tool during the synchronization process.

Six, redis performance

After testing, the redis multi-active instance (rlog log is enabled by default) has basically the same performance as the native redis instance (aof persistence is enabled), as shown in the following figure:

image-20210304172030103

Note: The above chart uses redis benchmark for pressure testing. During pressure testing, the client and server are on the same machine

7. Items to be optimized

1. Multiple write conflicts

Multiple data centers write at the same time, and the key conflict problem has not been resolved yet.

The follow-up solution is to use the CRDT protocol; CRDT (Conflict-Free Replicated Data Type) is a theoretical summary of the final consensus algorithm of various basic data structures, which can automatically merge according to certain rules, resolve conflicts, and achieve a strong final consistent effect.

The current solution is to split the data written into different computer rooms by the business to ensure that there will be no conflicts.

2. List type idempotency

Among the five basic types, most operations of the list type are non-idempotent, and no idempotent transformation and optimization has been done for the time being. It is not recommended to use or the business itself guarantees the idempotence of data operations using list.

3. Consistency of expired and eliminated data

As mentioned above, if the eliminated data is not synchronized across data centers, it will lead to data inconsistency. If the data is synchronized, multiple write conflicts of the same key may occur, and data inconsistency may also occur.

The current solution is to reasonably estimate the required memory capacity for the business in advance, sufficiently pre-allocate, and pay attention to memory usage alarms in a timely manner, so as to avoid the occurrence of obsolete data.


The author of this article: Jingdong Technology Luo Ming
For more technical best practices & innovations, please pay attention to the WeChat official account of "Jingdong Technology Technology Talk"

insert image description here

Supongo que te gusta

Origin blog.csdn.net/JDDTechTalk/article/details/116924818
Recomendado
Clasificación