Rock Solid: TiDB Point-in-Time Recovery (PiTR) Feature Optimization丨6.5 New Feature Analysis

This article introduces the point-in-time recovery (PiTR) feature of TiDB database, which allows users to restore the database to a specific point in time, thereby avoiding the loss of important data. The article first introduces the basic concept and working principle of PiTR technology, and then discusses TiDB's optimization of PiTR, including PiTR's technical indicators, stability and performance improvement. Finally, the article looks forward to the future improvement direction of TiDB PiTR, and will continue to explore more possibilities for backup and recovery.

Introduction to Point-in-Time Recovery (PiTR) Technology

For database products, point-in-time recovery is a very important basic capability, which allows users to restore the database to a specific point in time as needed, so as to help customers' databases from accidental damage or wrong operations. For example, if the data of the database after a certain point in time is accidentally deleted or damaged, you can use the PiTR function to restore the database to the state before that point in time, so as to avoid losing important data.

Due to the TiDB database, each data change will generate a corresponding distributed log, which records the information of each change in the database, including transaction ID, timestamp and the specific content of the change.

When the user enables the PiTR function, TiDB will periodically save the distributed change log to external storage (for example: AWS S3, Azure BloB or NFS, etc.). If the data after a certain point in time is accidentally deleted or damaged, you can use the BR tool to restore the previous database backup, and change the data saved on the external storage to the user-specified point in time, so as to achieve a fixed point purpose of recovery.

1.png

The diagram above describes the architecture of the PiTR feature: when the user starts the log backup, the BR tool will register a backup task with the PD. At the same time, a TiDB node will be selected to be the coordinator of log backup and interact with PD regularly to calculate the global backup checkpoint ts. At the same time, each TiKV node will run and report the backup task status of the node to the PD periodically, and send the data change log to the designated external storage.

For the recovery process, when the user initiates a point-in-time recovery command, the BR tool will read the metadata information of the backup and notify all TiKV nodes to start the recovery work, and the Restore worker on the TiKV node will read the changes before the fixed point Log and apply it to the cluster, you can get the TiDB cluster at a specified point in time.

How the PiTR feature works

Next, let's take a closer look at how the log backup and recovery process works.

The flow chart below illustrates the main working mechanism of log backup

2.png

The main interaction process is as follows:

1.BR receives the backup command br log start

Analyze the log backup start time point and backup storage address of the log backup task, and register the log backup task (log backup task) with the PD.

2. TiKV regularly monitors new/updated log backup tasks

The log backup observer of each TiKV node monitors the creation and update log backup tasks in the PD, and then backs up the change data log on the node within the backup time range.

3. The TiKV node backs up the KV change log and reports the local backup progress to TiDB

The observer service in the TiKV node will continuously back up the KV change log, combine the global-checkpoint-ts queried from the PD to generate backup metadata information, and upload the log backup data and metadata to the storage on a regular basis. At the same time, the observer service will also Prevent unbacked MVCC data from being recycled by PD.

4. The TiDB node calculates and persists the global backup progress.

The TiDB coordinator node polls all TiKV nodes to obtain the backup progress of each Region, and calculates the overall log backup progress based on the backup progress of each node, and then reports it to PD.

For the recovery process, you can refer to the flow chart below to understand its working mechanism

3.png

When the user initiates the "br restore " command, the BR tool will verify the full data and log data backup address, the time point to be restored, the database object to be restored, etc., and start the restoration after ensuring that the information is valid. BR first restores the full amount of data, then reads the existing log backup data, calculates the log backup data that needs to be restored, and accesses the PD to obtain information about the Region and KV range that need to be restored, creates a restore log request, and sends it to the corresponding TiKV node. After receiving the restore request, the TiKV node starts the restore worker, downloads the corresponding backup data from the backup medium to the local, and restores the data changes that need to be restored to the corresponding region. After recovery is complete, the results of the recovery execution are returned to the BR tool.

TiDB's optimization of PiTR

From the above working mechanism, we can see that whether it is log backup or recovery, the process is relatively complicated. Therefore, after the release of PiTR, TiDB has been optimizing this feature and continuously improving the technical indicators, stability and performance of PiTR. .

For example, in the initial version, the log backup will generate a large number of small files, causing many problems for users during use. In the latest version, we aggregate the log backup files into multiple files with a size of at least 128M, which solves this problem well.

For a large-scale TiDB cluster, its full backup often takes a long time to run. If the function of resuming uploads from breakpoints is not supported, it will be very frustrating for users if some abnormal conditions occur during the backup process and the backup task is interrupted. hopeless. In version 6.5.0, we support the ability to continue the backup and optimize the backup performance. Currently, the data backup performance of a single TiKV can reach 100MB/s, and the performance impact of log backup on the source cluster can be controlled within 5 %, these optimizations have greatly improved the user experience of large-scale cluster backup and the success rate of backup.

Since backup and recovery are usually regarded as the last line of defense for data security by users, the RPO and RTO indicators of PiTR are also of concern to many users. We have also made a lot of optimizations on the stability of PiTR, including:

  • By optimizing the communication mechanism between BR, PD and TiKV, PiTR can ensure that the RPO is less than 5 minutes in most TiDB cluster exception scenarios and TiKV rolling restart scenarios
  • By optimizing the recovery performance, the performance of PiTR in the application log stage can reach 30 GB/h, thereby reducing the RTO time.

For more backup and recovery performance indicators, please refer to the " TiDB Backup and Recovery Overview " document.

future plan

Next, we will optimize the PiTR feature more, and continuously improve the stability and performance of this feature. And explore more possibilities for backup and recovery, and make TiDB's backup and recovery features a stable and reliable high-performance backup and recovery solution.

Guess you like

Origin blog.csdn.net/TiDB_PingCAP/article/details/129319719