In application disaster recovery, do MySQL data tables need to be synchronized across clouds?

A background

The important goal of a disaster recovery system is to ensure the "continuity" of system data and services. When the system fails, the disaster recovery system can quickly restore the service and ensure the validity of the data. In order to prevent natural disasters, man-made disasters, and force majeure, corresponding IT systems are established in the same city or in different places. The core work is data synchronization.

In the application-layer disaster recovery scenario, this article discusses which data tables need to be synchronized across clouds and which data tables do not need to be synchronized across clouds. Through a specific case, help readers better sort out the methods of synchronization tables and filter tables to meet the business disaster recovery requirements of the application layer.

Two related terms

The scenario discussed in this article is based on Alibaba Cloud's application layer disaster recovery, involving the following key terms:
RDS MySQL: MySQL version is one of the most popular open source databases in the world, as an open source software combination LAMP (Linux + Apache + MySQL + Perl) /PHP/Python), which is widely used in various application scenarios. Alibaba Cloud RDS MySQL version provides stable and extreme database performance through deep kernel optimization and exclusive instances. At the same time, the flexible deployment architecture and product form can meet the database requirements in different scenarios.

DTS: Data Transmission Service supports data transmission between data sources such as relational databases (MySQL, etc.), NoSQL, and big data (OLAP). It is a data transmission service that integrates data migration, data subscription and real-time data synchronization. Data transmission is committed to solving the problem of long-distance, millisecond-level asynchronous data transmission in public cloud and hybrid cloud scenarios. Use data transmission to easily build a secure, scalable, and highly available (disaster tolerance) data architecture.

ASR: ASR-DR (Apsara Stack Resilience Disaster Recovery) is a cloud product that provides disaster recovery functions and supports RDS MySQL disaster recovery management. ASR is a graphical interactive switching tool developed in order to quickly implement disaster tolerance switching and reduce RTO as much as possible when a disaster occurs.

Synchronization tables: This article specifically refers to the RDS MySQL database and data tables, which tables must be backed up from one cloud to another, that is, cross-cloud synchronization.

Filtering tables: This article specifically refers to the RDS MySQL database and data tables, which tables cannot or are not needed, backed up from one cloud to another.

Application configuration table: This article specifically refers to the data table of the application layer in RDS MySQL. This table records the relevant configuration information of the application layer, such as IP, domain name, and on/off status of scheduled tasks.

Sequence: Globally unique serial number ID, which is widely used in distributed systems and can be used for transaction serial numbers, user IDs, etc. It is of great significance in many aspects such as searching, storing data, accelerating retrieval speed and so on. This ID is often the primary key of the database and requires a globally unique, high-concurrency, fault-tolerant single point of failure. In order to improve performance, the application layer usually obtains a batch of serial numbers (for example, 10,000) from the database each time, and stores them in the application memory to avoid frequent access to the database. After the serial numbers in the memory are used, a new batch of serial numbers is obtained from the database again.

Three key technical issues about the filter table in the application of disaster recovery

Why do we need to sort out the filter table that does not do cross-cloud synchronization?

Non-disaster tolerant applications

  • Resource limitations of the backup center: In actual projects, due to the resource limitations of the backup center, application systems cannot be deployed in the backup center. Therefore, databases and data tables corresponding to non-disaster-tolerant applications do not need to be synchronized.
  • Operation and maintenance temporary backup database and backup table do not need to be synchronized: in daily operation and maintenance, DBAs usually make temporary backups when making changes to the database. The database or data table that is temporarily backed up is backed up in the background by the Alibaba Cloud RDS MySQL cluster itself, and there is no need for users to perform cross-cloud synchronization again. In this way, the bandwidth of the synchronization link and the management workload of disaster tolerance switching can be reduced.
  • Applications that do not support disaster tolerance: The building of disaster tolerance capabilities of cloud products is a continuous process. Some cloud products do not yet have disaster tolerance capabilities at the project delivery stage, but users' applications rely on these designated cloud products. Therefore, this part of the application is temporarily unable to perform disaster recovery drills, and the corresponding databases and data tables can also be temporarily not synchronized. After the cloud products that the entire process of the application depends on support disaster tolerance, data synchronization can be carried out.

Different configuration table

  • Application configuration method: In order to manage the code and configuration separately, the application system usually stores and manages the configuration parameters separately. Common configuration forms include configuration files, RDS MySQL databases, and dedicated configuration centers. The dedicated configuration center also uses RDS MySQL to store data in the background. The more taboo way is to hard-code configuration parameters in the code, such as IP, domain name, etc.
  • Environmental parameters: When the application software uses cloud products such as RDS MySQL, OSS, SLB and other products, it needs to connect through IP, domain name, account password, and AK/SK.
  • Application parameters: Some functions can only be executed in an application in a center, and these function switches are controlled by certain field values ​​in the data table. For example, certain timed tasks will periodically make batch calls with external agencies. If the timed tasks of the two centers run at the same time, it may cause the repeated execution of the batch processing of the external organization, which depends on whether the external organization can support the repeated execution of the same batch processing task. The configuration tables of these timing tasks need to be configured separately in the dual centers.
  • The configuration method of intra-city disaster tolerance: The environmental parameters of point 2 are the same by default. The distance between the dual centers of a cloud in the same city is relatively short (less than 100 kilometers), and the application is deployed in the two availability zones of a cloud, and the cloud product connection information is the same. Therefore, when the application software is deployed, the same environmental parameters are accessed. In this scenario, there are relatively few environmental parameters that need to be sorted out.
  • The configuration method of remote disaster recovery: The environmental parameters of point 2 are different. The dual centers of the two clouds in the same city are far away (more than 100 kilometers), and the applications are deployed in the two availability zones of the two clouds, and the cloud product connection information is different. Therefore, when the application software is deployed, it accesses different environmental parameters. In this scenario, each application needs to sort out the different environmental parameters separately. The data table where the different environmental parameters are located cannot be synchronized across clouds, otherwise the application system deployment will fail.

Business tables that need to be double-written

  • Dual-write scenarios: a) Business traffic is processed at the dual centers at the same time, called application-layer dual-active, and data tables need to be written to the dual centers at the same time. b) Record microservice call logs during application runtime. Ideally, applications should only record data in the database when there is business traffic being processed. In actual projects, there are special circumstances in the business. In the application of the backup center, even if there is no traffic request, some logs will be written regularly, such as the microservice call log, the scheduled task log, the update of the global unique serial number Sequence when the application starts, etc. Wait. In the dual-write scenario, both the RDS MySQL in the primary center and the standby center must have read and write permissions.
  • In-city active-active scenario: In the active-active architecture of one cloud in the same city, the main center and standby center provide unified cloud product connection information to the application layer, and applications have write permissions to RDS MySQL.
  • Remote active/standby scenario: In the active/standby architecture of two remote clouds, the primary center RDS MySQL provides read and write permissions to the application layer, while the standby center RDS MySQL provides read-only permissions to the application layer. This permission strategy cannot meet the double write requirement in point 1. Therefore, for double-written tables, the filter tables need to be sorted according to the application dimensions.

How to sort out data tables that are not synchronized across clouds?

In the project, you will find that application software developers pay more attention to the realization of business logic, and their understanding of the best practices and disaster tolerance capabilities of cloud products may be different from our expectations. The combing filter table is mainly implemented by application developers, and there are several common problems in the combing process.

  • During design and development, what should developers do to reduce unsynchronized filter tables during disaster recovery?
  • During the deployment and operation and maintenance period, from which perspectives should the operation and maintenance personnel ensure the integrity and correctness of the filter table?

If you sort out errors, what impact will it have on the application-layer disaster recovery exercise?

In the project, it is often limited by the constraints of the construction period and the stable operation of the production system. Even if the application developers and cloud platform manufacturers know the best practices of design and development, it is difficult to complete the transformation within a time limit. Therefore, during the deployment and operation and maintenance period, sorting out the filter table and preparing an emergency plan are the key tasks of the disaster recovery exercise.

Let's analyze, if you sort out the filter table errors, what impact might it have on the disaster recovery of the application layer?

Impact on non-disaster-tolerant applications:

  • Almost no effect. As analyzed earlier, it is recommended that non-disaster-tolerant applications do not need to perform data backup, or the backup center application should not be used for production purposes on backing up data.

Impact on disaster recovery applications:

  • After the backup center deploys the application, the application fails to start, and the wrong environmental parameters can be identified at this time. The countermeasure is to stop the synchronization of the corresponding data table and continue deployment after correcting the read and write permissions.
  • When testing the functions of the backup center application, focus on the background timing tasks and non-business requests to write RDS MySQL scenarios, and modify the list of filter tables during the testing phase.
  • Perform disaster recovery switching exercises during the runtime of the production system. In the remote disaster recovery architecture, the wrong list of filter tables may lead to errors in the write conflict of the database primary key, and then the write business failure problem. At this time, it can be recovered by means of emergency plan, emergency stop or adding synchronization function or modifying data table field value, and restarting the application method. Correct the list of filter tables before the next exercise. This scenario will be briefly explained later in this article with a case.

Four design asynchronous data tables in application disaster recovery

Earlier we have introduced the necessity of which tables are not synchronized in the application disaster recovery. In this section, we will discuss how to sort out and set up filter tables. The following analysis is an ideal situation, there will be some differences in actual projects.

Cloud platform perspective

  • Understand the capabilities of cloud platforms: At present, mainstream cloud platform vendors have RDS MySQL products, but each vendor's RDS MySQL has different disaster tolerance capabilities in multi-availability zones in the same city and multi-regions in different places. Users need to understand that the data synchronization capability of each cloud vendor is done automatically in the background in both the same city and different places? Or use tools (such as Alibaba Cloud's DTS)? Or is it done manually?
  • How to configure filter tables: Alibaba Cloud DTS products support configuring which databases and data tables are not synchronized when creating a synchronization link for an RDS MySQL instance.
  • Automatic configuration of the filter table function: During the disaster recovery exercise, the main switch and standby switch will be involved. Therefore, the corresponding data synchronization direction is reversed, which we call forward synchronization and reverse synchronization. When the synchronization direction is reversed, the disaster recovery switching platform needs to support automatic configuration of the filter table. Alibaba Cloud ASR-DR supports saving the list of filter tables when a synchronization link is created for the first time, and ASR-DR automatically configures the filter table for the new link every time the synchronization direction is switched.

The following is the public information document of Alibaba Cloud data transmission service DTS product.

Application layer perspective

Next, we will focus on several stages of application developers and analyze how to effectively deliver application software based on cloud disaster recovery.

1. Design phase:

  • Design ideas based on cloud disaster tolerance. Consider that the application will be deployed in two or more clouds in the future, possibly on cloud platforms of different vendors. Therefore, in the early disaster recovery architecture based on the IOE architecture, the data layer synchronization completed by professional storage hardware will not be applicable in the multi-cloud scenario, and the expensive license of Oracle is also unacceptable for many enterprises.
  • Consider reserving identification parameters for each cloud and each center to indicate which cloud the current configuration applies to. The configuration center uniformly manages which cloud parameters take effect in the current operating environment, and the application code does not need to pay attention to which cloud it is running on.
  • Identify the functions of which scenes can only be run on one of the clouds, and arrange switches for these functions. Set the switch to be dynamically configurable and effective through the configuration center. Focus on timed tasks.
  • It is recommended that the operation of these function switches be placed on the white screen interface, so as to allow operation and maintenance personnel to quickly operate during the limited and urgent time of disaster tolerance switching, instead of calling people around and asking people to close a certain timed task in which library, Which field of which table controls the switch.
  • Record the list of filtering tables and update them in time.

2. Development stage:

  • Use the configuration center first to save the parameters. In actual projects, there are many ways to save the configuration, including configuration center, configuration files, RDS MySQL, and even directly encoding an address, account password in the code. Alibaba Cloud EDAS products provide the configuration center function, which supports dynamic configuration, static configuration, and dynamic push after configuration changes, without requiring application restarts to take effect.
  • The address of the configuration center itself can be recorded in the configuration file of the application, and the configuration file and the application can be packaged and released together. Because the configuration center service rarely changes after deployment.
  • If you cannot use the configuration center temporarily, you must use RDS MySQL to manage the configuration. It is recommended that the configuration of recording different cloud environment parameters be placed in a separate data table, and the configuration of separately providing function switches should also be placed in a separate data table, not coupled with the business table. The advantage is that it reduces the difficulty of managing the filter table. Focus on the domain name, IP, account password, AK/SK of cloud products.

3. Deployment phase:

  • Operation and maintenance personnel and developers, confirm the reasons for the selection of each filter table, and what is the business basis behind it? Pay attention to whether there are more filter tables.
  • Log in to each database and check whether the disaster recovery switching platform ASR-DR has set up the filter table as expected. When there are hundreds of filter tables, omissions or errors are prone to occur.
  • Create conditions to verify business functions in advance on the standby center, focus on whether the filter table scenario meets expectations, and focus on whether the timing tasks are only run on one center.

4. Operation and maintenance stage:

  • Configuration changes are performed simultaneously on the filter tables on both clouds. When the filter step table is changed on the main center, such as adding fields or adjusting the field type, the backup center cannot perceive it, and you need to manually make the same modification on the backup center. Otherwise, after the disaster recovery is switched to the backup center, application errors will be caused because the table is not updated.
  • The filter table is restored to the synchronization table. Early sorting out the list of filter tables was wrong, and more filter tables were configured, and later verification needed to be synchronized. It is necessary to re-synchronize the full amount of data in the data table, and modify the flag of whether this table is synchronized on the disaster recovery management platform ASR-DR.
  • The synchronization table is changed to a filter table. The early synchronized table, due to business adjustments, does not need to be synchronized again in the future. It is necessary to modify the flag of whether this table is synchronized on the disaster recovery management platform ASR-DR and on the disaster recovery management platform ASR-DR.

The following figure shows the configuration logic description of the synchronization table and the filter table in the remote disaster recovery master/backup architecture.

Five cases

In the following analysis of a remote disaster recovery project, due to the wrong sorting of the filter list list, business abnormal problems and handling experience are caused, so that readers can have a better sense of whether the data tables need to be synchronized across clouds.

(1) Problem description

After the ASR-DR of Alibaba Cloud disaster recovery platform performs a disaster recovery switch for an application (RDS MySQL read and write permissions are switched from Cloud A to Cloud B), when the business request is in the standby center (Cloud B), a business error is reported and the database prompts " Primary key conflict".

(2) Problem analysis

We analyze the problem location process according to the sequence of problem handling.

1. Analyze the database error "primary key conflict":

  • Confirm that the conflicting field value is the transaction serial number ID. Check the business data sheet to confirm that the transaction information of this ID already exists.

2. Analyze the business request path:

  • Analyze whether double writing is caused by abnormal access layer traffic scheduling. In the active and standby architecture of remote disaster recovery, through the global load balancing device GSLB control at the access layer, it is ensured that only the main center has service request traffic, and the standby center has no service request traffic. Therefore, the suspicion of primary key conflicts caused by double writing of dual-center services can be eliminated.
  • Analyze whether the main center application layer cache delays writing data after the active/standby switch. In the active/standby architecture, the disaster recovery platform ASR-DR platform will ensure that the RDS MySQL database permissions of the primary center are set to read-only before opening the read and write permissions for RDS MySQL to applications in the standby center. Even if the application layer of the main center has cache delayed writing, after the disaster tolerance switch, the main center application has no permission to write data, and there will be no double-write scenario. Eliminate this suspicion.
  • Analyze whether the serial number has been used before the disaster recovery switchover. Log in to the database of the main center and check that the current available range of the serial number field is [90000000000, 18446744073709551615], indicating that the serial number less than 90000000000 has been used. The current serial number 80000000000 that prompts the primary key conflict has a corresponding transaction record in the business table. Therefore, confirm that this transaction record number has been used in the main center.
  • Analyze the record of the serial number obtained by the preparation center application. It can be seen from the application log that the standby center application obtained the latest serial number once when it was started for the first time, and then did not obtain the latest serial number from the database later. At the same time, check the memory value of the application and find that the standby center is currently using the serial number range [80000000000, 80000009999]. Obviously this is an expired serial number.

Problem conclusion: The standby center application uses an expired transaction serial number ID, which causes a "primary key conflict" when writing to the database.

3. Analyze the problem introduction process:

  • Analyze the process of obtaining the serial number of the application: When the application is started for the first time, it obtains 10,000 available serial numbers from the database, and updates the memory value of the database and the application.
  • Analyze the data synchronization mechanism on the primary and standby centers: As the data table xx_table that manages the globally unique serial number, the data synchronization tool DTS can ensure that the data is synchronized between the two centers in real time, and when the database serial number is updated, the database is added The lock prevents inconsistencies. Theoretically, it will not happen that the same serial number is obtained on the active and standby centers.
  • Analyze whether the contents of the data table xx_table on the primary and backup centers are consistent: it is found that the available range of the serial number on the primary center is [90000000000, 18446744073709551615], and the available range of the serial number on the backup center is [80000010000, 18446744073709551615]. The two are not consistent, indicating that the data table is not synchronized.
  • Check the data synchronization tool DTS: it works normally and no errors or malfunctions are found.
  • Check the list of filter tables: The data table xx_table that manages the globally unique serial number should be synchronized across clouds, but it is configured as a filter table, causing the data to fail to synchronize.
  • Check the filtering table sorting process: In the preparation stage before the disaster recovery exercise, after the operation and maintenance personnel deploy the application in the backup center, the business personnel verify that the function transaction fails. The reason for the failure is that the application fails to write to the database after obtaining the serial number when the application starts, indicating that there is no write permission, so the transaction function initialization fails. In the active/standby architecture, by default, the primary center application has read and write permissions for RDS MySQL, and the standby center has read-only permissions for RDS MySQL. The backup center needs some permissions when it starts. Therefore, the business personnel added the data table xx_table, which manages the globally unique serial number, to the list of unsynchronized filter tables. As a result, this table is not synchronized from the primary center to the backup center.

Problem conclusion: The data table xx_table that manages the globally unique serial number is incorrectly added to the list of filter tables that do not synchronize across clouds

Emergency measures

  • Manually correct the valid range of the serial number in the data table xx_table of the backup center to the correct [90000000000, 18446744073709551615].
  • Restart the application software of the standby center to trigger the application to obtain the serial number again.

improvement measures

  • Synchronizing data: The data table xx_table that manages the globally unique serial number needs to be synchronized. Remove xx_table from the filter table list to ensure that the effective serial number range of the primary and secondary centers is consistent.
  • Application modification: When the backup center has read-only access to RDS MySQL, the serial number update is allowed to fail, and the application initialization succeeds. After the disaster tolerance switchover, after the backup center obtains the RDS MySQ read and write permissions, it is triggered by the business request to obtain the latest serial number on demand again.
  • Test effect:
    • After the main center and standby center synchronize data, disconnect the synchronization link and manually set the standby center database to read-only.
    • Redeploy the transformed application and verify that the application starts successfully and the business request fails (in line with expectations) in the read-only mode.
    • Manually set the standby center database to read and write, and the business request is successful, and check whether the application succeeds in obtaining a valid serial number again.
    • Reconfigure the data synchronization link between the main center and the backup center.
  • Disaster recovery drill: Perform another drill to verify the entire business scenario.

Before improvement

After improvement

Six summary

  • Disaster tolerance drills are the starting point for the discovery of systemic problems, not the end point. Regular drills are needed to preserve the disaster tolerance capabilities of the system.
  • Cloud platform disaster tolerance is not equivalent to application disaster tolerance, and application-level disaster tolerance is a systematic engineering.
  • Through exercises to check engineering capabilities, technically including cloud platforms, applications and networks; processes include fault judgment, disaster tolerance decision-making, handover operations, emergency plans, etc.

Original link

This article is the original content of Alibaba Cloud and may not be reproduced without permission.

Guess you like

Origin blog.csdn.net/weixin_43970890/article/details/114879409