In application disaster recovery, do MySQL data tables need to be synchronized across clouds?

a background

An important goal of a disaster recovery system is to ensure the "continuity" of system data and services. When the system fails, the disaster recovery system can quickly restore services and ensure the validity of data. In order to prevent natural and man-made disasters and force majeure, corresponding IT systems are established in the same city or in different places. The core work is data synchronization.

This paper selects the application layer disaster recovery scenario to discuss which data tables need to be synchronized across the cloud and which data tables do not need to be synchronized across the cloud. Through a specific case, it helps readers to better sort out the methods of synchronizing tables and filtering tables to meet the business disaster recovery requirements of the application layer.

Two related terms

The scenario discussed in this article is application layer disaster recovery built on Alibaba Cloud, involving the following key terms:
RDS MySQL: MySQL version is one of the most popular open source databases in the world. As an open source software combination LAMP (Linux + Apache + MySQL + Perl) /PHP/Python), which is widely used in various application scenarios. Alibaba Cloud RDS for MySQL provides stable and extreme database performance through in-depth kernel optimization and exclusive instances. At the same time, flexible deployment architecture and product form can meet database requirements in different scenarios.

DTS: Data Transmission Service supports data transmission between data sources such as relational databases (MySQL, etc.), NoSQL, and big data (OLAP). It is a data transmission service that integrates data migration, data subscription and real-time data synchronization. Data transmission is committed to solving the problem of long-distance, millisecond-level asynchronous data transmission in public cloud and hybrid cloud scenarios. Easily build secure, scalable, and highly available (disaster-tolerant) data architectures using data transport.

ASR: ASR-DR (Apsara Stack Resilience Disaster Recovery) is a cloud product that provides disaster recovery functions and supports disaster recovery management of RDS MySQL. ASR is a graphical interaction-based switching tool developed to quickly realize disaster recovery switching and reduce RTO as much as possible when a disaster occurs.

Synchronization table: This article specifically refers to the RDS MySQL database and data table, which tables must be backed up from one cloud to another cloud, that is, cross-cloud synchronization.

Filter table: This article specifically refers to the RDS MySQL database and data tables, which tables cannot or do not need to be backed up from one cloud to another.

Application configuration table: This article specifically refers to the data table of the application layer in RDS MySQL. This table records the relevant configuration information of the application layer, such as IP, domain name, switch status of scheduled tasks, and so on.

Sequence: Globally unique serial number ID, which is widely used in distributed systems and can be used for transaction serial numbers, user IDs, etc. It is of great significance in many aspects such as searching, storing data, speeding up retrieval speed and so on. This ID is often the primary key of the database, which is required to be globally unique, support high concurrency, and tolerate a single point of failure. In order to improve performance, the application layer usually obtains a batch of serial numbers (such as 10,000) from the database each time, and stores them in the application memory for use to avoid frequent access to the database. After using the serial numbers in the memory, obtain a new batch of serial numbers from the database again.

Three Key Technical Issues About Filter Tables in Application Disaster Recovery

Why do you need to sort out filter tables that do not sync across clouds?

Non-disaster recovery applications

  • Resource limitations of the backup center: In actual projects, due to the resource limitations of the backup center, the application system cannot be deployed in the backup center, so the databases and data tables corresponding to non-disaster-tolerant applications do not need to be synchronized.
  • The temporary backup database and backup table do not need to be synchronized for operation and maintenance: During daily operation and maintenance, DBAs usually make temporary backups when making changes to the database. Temporarily backed up databases or data tables, since the Alibaba Cloud RDS MySQL cluster itself has been backed up in the background, users do not need to perform cross-cloud synchronization again. In this way, the bandwidth of the synchronization link and the management workload of the DR switch can be reduced.
  • Applications that do not support disaster recovery: The building of disaster recovery capabilities of cloud products is a continuous process. Some cloud products do not have disaster recovery capabilities temporarily during the project delivery stage, but user applications rely on these specified cloud products. Therefore, this part of the application cannot do disaster recovery drills temporarily, and the corresponding databases and data tables can also be temporarily not synchronized. After the cloud products on which the entire process of the application depends are supported for disaster recovery, data synchronization can be performed.

Different configuration table

  • Application configuration method: In order to manage code and configuration separately, the application system usually stores and manages configuration parameters separately. Common configuration forms include configuration files, RDS MySQL databases, and dedicated configuration centers. The background of the dedicated configuration center also uses RDS MySQL to store data. The more taboo way is to hard-code configuration parameters in the code, such as IP, domain name, etc.
  • Environmental parameters: When the application software uses cloud products such as RDS MySQL, OSS, SLB and other products, it needs to connect through IP, domain name, account password, and AK/SK.
  • Application parameters: Some functions can only be executed in an application in a center, and these function switches are controlled by some field values ​​in the data table. For example, some timed tasks will periodically call batches with external organizations. If the scheduled tasks of the dual centers run at the same time, the batch processing of the external organization may be repeatedly executed, which depends on whether the external organization can support the repeated execution of the same batch processing task. The configuration tables of these scheduled tasks need to be configured separately in the dual centers.
  • The configuration method of disaster recovery in the same city: The environmental parameters of the second point are the same by default. The distance between the dual centers of a cloud in the same city is relatively close (less than 100 kilometers), the application is deployed in two availability zones of a cloud, and the cloud product connection information is the same. Therefore, when the application software is deployed, it accesses the same environment parameters. In this scenario, there are few environmental parameters that need to be sorted out.
  • Configuration method of remote disaster recovery: There are differences in the environmental parameters of point 2. The dual centers of the two clouds in the same city are far apart (greater than 100 kilometers), and the applications are deployed in the two availability zones of the two clouds, and the cloud product connection information is different. Therefore, when the application software is deployed, it accesses different environmental parameters. In this scenario, each application needs to sort out the different environmental parameters separately. The data table where the different environment parameters are located cannot be synchronized across clouds, otherwise the application system deployment will fail.

Business table that needs to be double written

  • Scenarios of double writing: a) Business traffic is processed in the dual centers at the same time, which is called active-active at the application layer, and data tables need to be written to the dual centers at the same time. b) Record the call log of the microservice during the application runtime, etc. Ideally, applications should log data to the database only when there is business traffic being processed. In actual projects, there will be special cases in the business. In the application of the standby center, even if there is no traffic request, some logs will be written regularly, such as the microservice call log, the scheduled task log, and the global unique sequence number Sequence is updated when the application starts. Wait. In the dual-write scenario, both the RDS MySQL of the primary center and the standby center are required to have read and write permissions.
  • In the same-city dual-active scenario: In the dual-active architecture of one cloud in the same city, the main center and the standby center provide unified cloud product connection information to the application layer, and applications have write permissions to RDS MySQL.
  • Remote active/standby scenario: In the active/standby architecture of two remote clouds, the main center RDS MySQL provides read and write permissions to the application layer, while the standby center RDS MySQL provides read-only permissions to the application layer. This permission policy cannot satisfy the double-write requirement in point 1. Therefore, for double-written tables, it is necessary to sort and filter the tables according to the application dimension.

How to sort out data tables without cross-cloud synchronization?

In the project, it will be found that application software developers pay more attention to the implementation of business logic, and their understanding of the best practices for using cloud products and disaster recovery capabilities may be different from our expectations. The sorting and filtering table is mainly performed by the application developer, and there are several common problems in the sorting process.

  • During design and development, what should developers do to reduce unsynchronized filter tables during disaster recovery?
  • During deployment and operation and maintenance, from what perspectives should the operation and maintenance personnel ensure the integrity and correctness of the filter table?

If the sorting is wrong, what impact will it have on the application-layer disaster recovery drill?

In a project, it is often limited by the construction period and the stable operation of the production system. Even if application developers and cloud platform manufacturers know the best practices for design and development, it is difficult to complete the transformation within a limited time. Therefore, during the deployment and operation and maintenance period, sorting out the filter table and preparing emergency plans are the key work items of the disaster recovery drill.

Let's analyze, if the filter table is sorted incorrectly, what impact may it have on application layer disaster recovery?

Impact on non-disaster recovery applications:

  • Almost no effect. As previously analyzed, it is recommended that non-disaster-tolerant applications do not need to do data backup, or backup center applications do not use backup data for production purposes.

Impact on disaster recovery applications:

  • After the backup center deploys the application, it fails to start the application, and the wrong environment parameters can be identified at this time. The countermeasure is to stop the synchronization of the corresponding data table, and continue to deploy after correcting the read and write permissions.
  • When the backup center application tests the function, it focuses on the background scheduled tasks and non-business requests to write RDS MySQL scenarios, and corrects the list of filter tables during the test phase.
  • Perform a disaster recovery switchover drill during the running period of the production system. In a remote disaster recovery architecture, an incorrect filter table list may cause a conflict in the write of the primary key of the database, resulting in the failure of the write business. At this time, it can be recovered by means of emergency plan, emergency stop or adding synchronization function or modifying data table field values ​​and restarting the application mode. Correct the filter table list before the next walkthrough. This article will briefly illustrate this scenario with a case.

4. Design asynchronous data tables in application disaster recovery

We have introduced the necessity of which tables are not synchronized in application disaster recovery. In this section, we will discuss how to sort out and set filter tables. The following analysis is an ideal situation, and there will be some differences in actual projects.

Cloud platform perspective

  • Understand the capabilities of the cloud platform: At present, mainstream cloud platform manufacturers have RDS MySQL products, but the disaster recovery capabilities of each manufacturer's RDS MySQL in multi-availability zones in the same city and in different regions are different. Users need to understand that the data synchronization capability of each cloud vendor is automatically completed in the background in the case of the same city and different places? Or use tools (such as Alibaba Cloud's DTS)? Or is it done manually by scripting?
  • How to configure filter tables: Alibaba Cloud DTS products support the configuration of which databases and data tables are not synchronized when creating an RDS MySQL instance synchronization link.
  • Automatic configuration filter table function: In the process of disaster recovery drill, it will involve the main switch to the standby switch and the standby switch to the main switch, so the corresponding data synchronization direction is reversed, which we call forward synchronization and reverse synchronization. When the synchronization direction is reversed, the disaster recovery switching platform needs to support the automatic configuration of the filter table. Alibaba Cloud ASR-DR supports saving a list of filter tables when a synchronization link is created for the first time. ASR-DR automatically configures a filter table for the new link each time the synchronization direction is switched.

The following are the data documents disclosed by Alibaba Cloud Data Transmission Service DTS products.

Application layer perspective

Next, we will analyze how to effectively deliver application software based on cloud disaster recovery from the stages that application developers focus on.

1. Design stage:

  • Based on the design idea of ​​cloud disaster recovery. Consider that the application will be deployed on two or more clouds in the future, possibly on cloud platforms of different manufacturers. Therefore, in the early disaster recovery architecture based on the IOE architecture, the data layer synchronization completed by professional storage hardware will not be applicable in multi-cloud scenarios, and Oracle's expensive license is also unacceptable for many enterprises.
  • Consider reserving identification parameters for each cloud and each center to indicate which cloud the current configuration applies to. The configuration center uniformly manages which cloud parameters take effect on the current operating environment, and the application code does not need to pay attention to which cloud it is running on.
  • Features that identify which scenarios only work on one of the clouds, and schedule switches for those features. Through the configuration center and set the switch to be dynamically configurable and effective. Focus on timed tasks.
  • It is recommended to put the operation of these function switches on the white screen interface, which is convenient for the operation and maintenance personnel to operate quickly during the limited and urgent time of disaster recovery switching, instead of calling and asking people everywhere to close a certain scheduled task in which library, Which field of which table to control the switch.
  • Record the list of filter tables and update them in time.

2. Development stage:

  • It is preferred to use the configuration center to save parameters. In actual projects, there are many ways to save the configuration, including configuration center, configuration file, RDS MySQL, and even directly encoding an address and account password in the code. Alibaba Cloud EDAS products provide the configuration center function, which supports dynamic configuration, static configuration, and dynamic push after configuration changes, without the need to restart the application to take effect.
  • The address of the configuration center itself can be recorded in the configuration file of the application, and the configuration file and the application can be packaged and released together. Because the configuration center service rarely changes after deployment.
  • If you cannot use the configuration center temporarily, you must use RDS MySQL to manage the configuration. It is recommended that the configuration of recording different cloud environment parameters be placed in a separate data table, and the configuration of separately providing function switches should also be placed in a separate data table, and should not be coupled with the business table. The benefit is that it reduces the difficulty of managing filter tables. Focus on the domain name, IP, account password, AK/SK of cloud products.

3. Deployment phase:

  • Operation and maintenance personnel and developers, confirm the reason why each filter table is selected, and what is the business basis behind it? Focus on whether more filter tables are configured.
  • Log in to each database and check whether the disaster recovery switching platform ASR-DR sets the filter table as expected. When there are hundreds of filter tables, omissions or errors are prone to occur.
  • Create conditions to verify business functions in advance on the standby center, focus on whether the filter table scenario meets expectations, and pay attention to whether scheduled tasks only run on one center.

4. Operation and maintenance stage:

  • Configuration changes are performed simultaneously on the filter tables on both clouds. When the filter step table is changed on the main center, such as adding a field or adjusting the field type, the standby center cannot perceive it, and the same modification needs to be done manually on the standby center. Otherwise, after the disaster recovery is switched to the standby center, an application error will occur because the table is not updated.
  • Filter table reverts to synchronization table. In the early days, the filter table list was incorrect, and more filter tables were configured. Later, the verification needed to be synchronized. It is necessary to resynchronize the full data of the data table, and modify the flag of whether the table is synchronized on the disaster recovery management platform ASR-DR.
  • The synchronization table is changed to a filter table. For the tables that were synchronized in the early stage, because the business has been adjusted, there is no need to synchronize them in the future. It is necessary to modify the flag of whether the table is synchronized on the disaster recovery management platform ASR-DR and on the disaster recovery management platform ASR-DR.

The following figure shows the configuration logic of the synchronization table and the filter table under the remote disaster recovery active-standby architecture.

Five cases

In the following analysis of a remote disaster recovery project, due to the error in sorting out the filter table list, business exceptions and processing experience are caused, so that readers can have a better sense of whether the data table needs to be synchronized across clouds.

(1) Problem description

After Alibaba Cloud disaster recovery platform ASR-DR performs disaster recovery switchover for an application (RDS MySQL read and write permissions are switched from Cloud A to Cloud B), when the business request is in the standby center (Cloud B), the business reports an error and the database prompts "" primary key conflict".

(2) Problem analysis

We analyze the problem location process according to the sequence of problem processing.

1. Analyze the database and report an error "primary key conflict":

  • Confirm that the conflicting field value is the transaction serial number ID. Check the business data sheet to confirm that the transaction information for this ID already exists.

2. Analyze the business request path:

  • Analyze whether double writes are caused by abnormal traffic scheduling at the access layer. In the active-standby architecture of remote disaster recovery, the global load balancing device GSLB at the access layer is used to control, to ensure that only the main center has business request traffic, and the standby center has no business request traffic. Therefore, the suspicion of primary key conflict caused by double writing of dual-center business can be ruled out.
  • Analyze whether the main application layer cache delays writing data after the master/slave switchover. In the master-standby architecture, the disaster recovery platform ASR-DR platform will ensure that the RDS MySQL database permission of the master center is set to read-only, and then the read and write permissions of RDS MySQL will be opened to the applications of the backup center. Even if the application layer of the main center has cache delayed write, after the disaster recovery switch, the application of the main center has no permission to write data, and there will be no double-write scenario. rule out this suspicion.
  • Analyze whether the serial number has been used before the disaster recovery switchover. Log in to the database of the main center and check that the current available range of the serial number field is [90000000000, 18446744073709551615], indicating that serial numbers less than 90000000000 have been used. However, the serial number 80000000000 that currently prompts a primary key conflict already has a corresponding transaction record in the business table. Therefore, confirm that the transaction record number is used in the main center.
  • Analyze the records of the serial number obtained by the backup center application. It can be seen from the application log that the backup center application obtained the latest serial number once when it was first started, and did not obtain the latest serial number from the database later. At the same time, check the memory value of the application and find that the serial number range [80000000000, 80000009999] is currently being used by the standby center. Apparently this is an expired serial number.

Conclusion of the problem: The standby center application uses an expired transaction serial number ID, which causes a "primary key conflict" when writing to the database.

3. Analyze the problem introduction process:

  • Analyze the process of obtaining serial numbers by an application: When the application starts for the first time, it obtains 10,000 available serial numbers from the database, and updates the memory values ​​of the database and the application.
  • Analyze the data synchronization mechanism on the active and standby centers: as the data table xx_table for managing the globally unique serial number, the data synchronization tool DTS can ensure that the data is synchronized between the two centers in real time. Locks prevent inconsistencies. In theory, the same serial number will not be obtained on the primary and secondary centers.
  • Analyze whether the contents of the data table xx_table on the active and standby centers are consistent: It is found that the available range of serial numbers on the active center is [90000000000, 18446744073709551615], and the available range of serial numbers on the standby center is [80000010000, 18446744073709551615]. The two are not consistent, indicating that the data tables are not synchronized.
  • Check Data Synchronization Tool DTS: Works fine, no errors or glitches found.
  • Check the filter table list: The data table xx_table that manages the globally unique serial number should be synchronized across clouds, but it is configured as a filter table, resulting in data not being synchronized.
  • Check the sorting process of the filter table: In the preparation stage before the disaster recovery drill, after the operation and maintenance personnel deployed the application in the backup center, the business personnel failed to verify the functional transaction. The reason for the failure is that the application fails to write to the database after obtaining the serial number at startup, and it prompts that there is no write permission, so the initialization of the transaction function fails. In the active-standby architecture, by default, the main center application has read and write permissions to RDS MySQL, and the standby center has read-only permissions to RDS MySQL. The standby center needs some permissions when starting, so the business personnel added the data table xx_table that manages the globally unique serial number to the list of unsynchronized filter tables, resulting in this table not being synchronized from the main center to the standby center.

Conclusion of the problem: The data table xx_table that manages the globally unique serial number is mistakenly added to the list of filter tables that do not synchronize across clouds

Emergency measures

  • Manually correct the valid range of serial numbers in the data table xx_table of the standby center to the correct [90000000000, 18446744073709551615].
  • Restart the application software of the standby center to trigger the application to obtain the serial number again.

improvement measures

  • Synchronize data: The data table xx_table that manages globally unique serial numbers needs to be synchronized. Remove xx_table from the filter table list to ensure that the valid serial number ranges of the active and standby centers are consistent.
  • Application modification: When the standby center has read-only permission to RDS MySQL, it is allowed to fail to update the serial number, and the application initialization succeeds. After the disaster recovery switchover, after the standby center obtains the RDS MySQ read and write permissions, the service request triggers the acquisition of the latest serial number on demand.
  • Test effect:
    • After the data synchronization between the primary center and the backup center is completed, disconnect the synchronization link and manually set the database of the backup center to be read-only.
    • Redeploy the transformed application. In read-only mode, verify that the application started successfully and the business request failed (as expected).
    • Manually set the standby center database to read and write, the service request is successful, and check whether the application successfully re-acquires a valid serial number.
    • Reconfigure the data synchronization link between the primary center and the backup center.
  • Disaster recovery drill: Conduct the drill again to verify the full business scenario.

before improvement

After improvement

Six Summary

  • Disaster recovery drills are the starting point for discovering systemic problems, not the end point. Regular drills are required to preserve the disaster recovery capabilities of the system.
  • Cloud platform disaster recovery does not mean application disaster recovery. Application-level disaster recovery is a systematic project.
  • The engineering capabilities are checked through drills, which technically include cloud platforms, applications, and networks; and the processes include fault judgment, disaster recovery decision-making, switching operations, and emergency plans.

Original link

This article is original content of Alibaba Cloud and may not be reproduced without permission.

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324122110&siteId=291194637