Computer room disaster recovery solution based on MySQL multi-channel master-master replication

The article introduces a variety of MySQL high-availability technologies, and introduces the process and precautions for selecting multi-channel master-master replication technology according to your own needs.

Author: Xu Liang, currently the senior database manager of China Mobile Smart Home Operation Center, has many years of experience in database operation and maintenance optimization, and has served as senior DBA in Huawei and first-tier Internet companies. Currently, he is mainly responsible for database stability, disaster recovery optimization, remote multi-activity and other related work under the scale-based value operation scenario of China Mobile Smart Home.

Produced by the Aikeson open source community. Original content may not be used without authorization. Please contact the editor and indicate the source for reprinting.

This article is approximately 2,800 words long and should take approximately 7 minutes to read.

Background introduction

In the era of cloud-network integration and big data, data has become an important factor of production. Especially after the occurrence of man-made or natural disasters such as Prism Gate, Eternal Blue, and the Wenchuan Earthquake that caused large-scale data loss and leakage, China has successively promulgated a series of laws and regulations to regulate the data security protection conditions of various organizations. Restrictions, such as the "Cybersecurity Law of the People's Republic of China" promulgated in 2016, the "Data Security Law" passed by the National People's Congress in 2021, etc.

When a disaster occurs, disaster recovery backup can ensure that data is not lost. To achieve application disaster recovery, a key is to use real-time synchronization and replication of the database, so that when a computer room failure or problem occurs in location A, it can be smoothly and quickly migrated to location B. Although there is a certain delay in this remote data replication and synchronization, it can basically meet the needs of business continuity.

Basic overview of disaster recovery

Definition of disaster recovery

Disaster recovery refers to ensuring that when various unknown disasters occur in the data center, there will be no or less data loss, and at the same time, the IT business system can run uninterrupted or be quickly switched and restored.

Metrics of Disaster

Two important indicators to evaluate the reliability of a disaster recovery system are RTO and RPO.

RTO (Recovery Time Objective) Recovery Time Objective. RTO refers to the time between the moment after a disaster occurs, from the moment when the system is down and the business is suspended, to the moment when the system is restored to the point where it can support the operations of the business department and the business resumes operations. RTO can simply be described as the recovery time a business can tolerate.

RPO (Recovery Point Objective) Recovery Point Objective. RPO refers to the ability of the disaster recovery system to restore data to the point in time before the disaster occurs. It is an indicator of how much production data an enterprise will lose after a disaster occurs. RPO can be simply described as the maximum amount of data loss an enterprise can tolerate.

RTO targets the loss of service time, and RPO targets the loss of data. They are the two main indicators for measuring the disaster recovery system, but they are not necessarily related.

Disaster recovery level classification

The national standard (GB/T 20988-2007), which was officially implemented on November 1, 2007, is the first national standard for my country's disaster backup and recovery industry.

grade illustrate
Level 1 Basic level. Backup media is stored off-site to ensure security and regular verification.
Level 2 Backup site support. Network and business processing systems can be deployed to the backup center within a predetermined time.
Level 3 Electronic transfer and partial device support. The disaster recovery center is equipped with some business processing and network equipment and has some communication links.
Level 4 Electronic transfer and full device support. Data is transmitted in batches at regular intervals, and the network/system is always ready. Warm backup center mode.
Level 5 Real-time data transfer and full device support. Remote replication technology is used to achieve real-time replication of data, the network has automatic or centralized switching capabilities, and the business processing system is ready or running.
Level 6 Zero data loss and remote cluster support. Real-time data backup, zero loss, system/application remote clustering, automatic switching, users can access the main and backup centers at the same time.

The relationship between disasters and RTO and RPO

Disaster recovery capability level RTO RPO
1 more than 2 days 1 day to 7 days
2 24 hours later 1 day to 7 days
3 more than 12 hours several hours to 1 hour
4 a few hours to 2 days several hours to 1 hour
5 Minutes to 2 days 0 to 30 minutes
6 few minutes 0

Disaster recovery in three centers in two places

Three centers in two places can combine local high availability, a same-city disaster recovery center, and an off-site disaster recovery center to improve availability and business continuity. Key businesses often use "three centers in two places" (i.e., production data center, same-city disaster recovery center, and off-site disaster recovery center). Disaster recovery center) construction plan. In this mode, multiple data centers have an active-standby relationship, and the response to disasters and switching cycles are flexibly handled according to abnormal situations, which can achieve better overall RTO and RPO goals.

MySQL common master-slave form

MySQL itself has its own master-slave replication function, which solves several key issues: data consistency, checkpoint mechanism, reliable network transmission, etc., which can help us achieve high-availability switching and read-write separation.

One master and one slave

One master and one slave can provide a standby database, and failover can be performed after the master database fails to avoid data loss.

One master, many slaves

The common master-slave architecture of one master and multiple slaves is simple and effective to use. It can not only achieve HA, but also separate reading and writing, thus improving the concurrency capability of the cluster.

Multiple masters and one slave

Multiple masters and one slave can back up multiple MySQL databases to a server with better storage performance to facilitate unified analysis and processing.

Dual master replication

Dual-master replication, that is, mutual master-slave replication, each master is both the master and the slave of another server. In this way, changes made by either party will be applied to the database of the other party through replication. Only one can be the master and the other is the backup at the same time. No special configuration is required when the instance is actively maintained and master-slave switchover is performed. Second-level switching facilitates daily upgrades and maintenance.

Cascade replication

In cascade replication mode, the data synchronization of some slaves does not connect to the master node, but to the slave node. If the master node has too many slave nodes, part of the performance will be lost for replication. At this time, 3 to 5 slave nodes can be connected to the master node, and other slave nodes can be connected to the slave nodes as secondary or tertiary nodes. This can not only alleviate pressure on the master node and has no negative impact on data consistency.

MySQL master-slave replication in three centers in two places

Advantages and Disadvantages of Common MySQL High Availability Solutions

Compared with the current mainstream database high-availability solutions, each has its own advantages and disadvantages, but they are not simple and easy to use in terms of supporting remote disaster recovery:

High availability solution Advantage Disadvantages
Master-Slave + Keepalived Deployment is simple, and there is no problem of selecting the master after the master instance goes down. After switching between one master and multiple slaves, other slave instances need to be reconfigured to connect to the new master.
MHA It supports one master and multiple slaves, and data inconsistency will not be caused when the master service crashes. SSH has security risks and is not officially maintained.
Group Replication MGR No delay, strong data consistency Strongly dependent on the network, it can only be used in GTID mode. Large transactions and DDL operations have the risk of blocking.
MySQL InnoDB Cluster To compensate for the inability of Group Replication to provide middleware with automated failover capabilities. There are many components and few mature cases.
Orchestrator Supports one master and multiple slaves, solving the single point problem of management nodes, and supports command line and web interface management replication. The functions are complex and difficult to integrate into your own system.

MySQL master-slave initialization message

By grabbing messages and analyzing the code, we found that in the process of establishing synchronization channels between the MySQL slave database and the master database, network connection establishment, authorization, instance uniqueness, clock, character set, binlog configuration verification, etc. are respectively performed. During the instance uniqueness verification process, the slave library will obtain the server id of the master library.

MySQL binlog log structure

MySQL's master-slave replication is based on binlog files, and binlog files are composed of multiple binlog events. The overall structure of binlog events consists of three parts: head+data+footer. The head contains the server ID of the database instance that generated the event. In master-slave replication, it is an important basis for distinguishing whether the event is generated for its own instance.

Previously, the server id of the peer main library of the master-slave pipeline could be obtained through the master-slave initialization message. At this time, it is compared with the server id of the event received by the slave library from the pipeline to identify whether the event was generated by the current peer main library.

Two places and three centers MySQL master-slave solution 1

The construction of three centers in two places is relatively easy, but daily drills and data return configurations are cumbersome and error-prone. This solution establishes MySQL master-master replication in the computer room. At this time, no complicated commands are required for master-slave switching, and only read_only needs to be set. Master-master replication is also established in the computer room in the same city, which facilitates disaster recovery drills and switchback without complex configuration. In the same way, master-master replication is also established with MySQL in three centers in two places to facilitate drills and switchbacks. This solution uses native MySQL replication and is highly mature; it does not introduce too many third-party components and has the potential for large-scale operation and maintenance. However, when the native MySQL master-slave has master-master replication on multiple links, replication loop problems will occur, resulting in data conflicts and inconsistencies.

Two places and three centers MySQL master-slave solution 2

In order to solve the replication loop problem, on the host room boundary node instance, this solution uses the server id of the peer main library to determine whether it is the same as the server id of the event, and restricts the IDC1 boundary MySQL replication logic to only synchronize the adjacent master in the pipeline. The generated binlog logs and cascaded master logs are discarded, and a synchronization pipe only synchronizes the log of a single master to solve the loop problem. Other nodes do not need to enable this function.

Boundary node MySQL replication logic code patch

This patch is based on the community version MySQL 5.7.40 upgrade, modifying the sys_vars.cc file, adding the replicate_server_mode configuration item (default is 0), compatible with the original replication mode, when configured as 1, master-slave Synchronization only synchronizes binlog events generated by the peer master in the pipeline.

Modifylog_event.cc the Log_event::do_shall_skip function of the file, and ignore it if the server_id of the current event is different from the server_id of the master library at the opposite end of the channel, and only synchronize Events generated by the peer main library avoid the problem of data loop when multi-channel main library is used.

Summarize

This MySQL data synchronization solution optimizes the log synchronization mechanism of MySQL itself, introduces multi-channel master-master replication technology, and reduces the complexity of data synchronization relationship adjustment during disaster recovery drills and switchbacks in the computer room; each channel only synchronizes the binlog of the adjacent main database event, solves the data loop problem and supports disaster recovery in two places and three centers for key businesses; there is no need to introduce third-party HA, synchronization and other components, reducing related software, hardware and network requirements; the patch code size is within 100 lines, and only needs to be deployed at the computer room boundary Nodes are upgraded and risks are controllable. It is mature, low-cost, simple and reliable in large-scale instance operation and maintenance scenarios, and can be quickly integrated with the company's one-click switching platform. In the future, it will also have the ability to support higher-level disaster recovery requirements such as three locations and five centers. However, this solution does not support multi-level cascading replication, nor does it support more granular and flexible control capabilities at the column and record levels.

For more technical articles, please visit:https://opensource.actionsky.com/

About SQLE

SQLE is a comprehensive SQL quality management platform that covers SQL auditing and management from development to production environments. It supports mainstream open source, commercial, and domestic databases, provides process automation capabilities for development and operation and maintenance, improves online efficiency, and improves data quality.

SQLE get

type address
Repository https://github.com/actiontech/sqle
document https://actiontech.github.io/sqle-docs/
release news https://github.com/actiontech/sqle/releases
Data audit plug-in development documentation https://actiontech.github.io/sqle-docs/docs/dev-manual/plugins/howtouse

Guess you like

Origin blog.csdn.net/ActionTech/article/details/134134429