How to Improve MySQL Backup Validation Performance by 10X

JuiceFS is very suitable for MySQL physical backup, please refer to our official documentation for details . Recently, a customer reported during testing that the data preparation ( xtrabackup --prepare) process for backup verification is very slow. We made analysis with the performance analysis tool provided by JuiceFS to quickly find performance bottlenecks. By continuously adjusting the parameters of XtraBackup and the mounting parameters of JuiceFS, the time was shortened to 1/10 of the original time within an hour. This article will share our performance analysis and optimization process records to provide a reference for you to analyze and optimize IO performance.

data preparation

We use the SysBench tool to generate a single-table database with a size of about 11GiB, and the partition of the database table is set to 10. In order to simulate a normal database read and write scenario, the database is accessed through SysBench with a pressure of 50 requests per second. Under this pressure, the data written by the database to the data disk is in the range of 8~10MiB/s. Backup the database to JuiceFS with the following command.

# xtrabackup --backup --target-dir=/jfs/base/

In order to ensure that the data of each data preparation operation is exactly the same, the snapshot function of JuiceFS is used to generate a snapshot based on the /jfs/basedirectory /jfs/base_snapshot/. Before each operation, the data of the previous data preparation operation will be deleted and a new snapshot will be generated.

Use default parameters

# ./juicefs mount volume-demoz /jfs

#  time xtrabackup --prepare --apply-log-only --target-dir=/jfs/base_snapshot

The total execution time is 62 seconds.

JuiceFS supports exporting the operation log oplog, and can display the oplog visually. Before executing the xtrabackup --prepareoperation , we open a new terminal to connect to the server and enter on the command line

# cat /jfs/.oplog > oplog.txt

Start collecting oplog logs, and then perform the xtrabackup --prepareoperation . After the operation is complete, download oplog.txt to the local and upload it to the oplog analysis page provided by JuiceFS: https://juicefs.com/oplog/.

We will visualize the oplog.

Here is a general introduction to the meaning of the various elements in this figure. One of our oplogs contains timestamp, thread ID, file system operation functions (read, write, fsync, flush, etc.), operation duration, etc. The numbers on the left represent thread IDs, the horizontal axis represents time, and different types of operations are marked with different colors.

We zoom in on the partial image, and different colors represent different types of operations to be clear at a glance.

Exclude several threads that are not related to this operation. In the data preparation process, 4 threads are responsible for reading, 5 threads are responsible for writing data, and the reading and writing overlap in time.

Increase XtraBackup's memory buffer

Referring to the official XtraBackup documentation , data preparation is the process of performing crash recovery on the backup dataset using the embedded InnoDB.

Use the --use-memory option increase the memory buffer size of the embedded InnoDB, the default is 100MB, we increased it to 4GB.

# time xtrabackup --prepare --use-memory=4G --apply-log-only --target-dir=/jfs/base_snapshot

The execution time dropped to 33 seconds.

It can be seen that the read and write do not overlap, and the data is read into the memory and written to the file system after the processing is completed.

Increase the number of XtraBackup read threads

By increasing the buffer, the time is cut in half, and the entire reading process is still time-consuming. We see that each read thread is basically full, and we try to add more read threads.

# time xtrabackup --prepare --use-memory=4G --innodb-file-io-threads=16 --innodb-read-io-threads=16 --apply-log-only --target-dir=/jfs/base_snapshot

The execution time dropped to 23 seconds.

The number of read threads has been increased to 16 (4 by default), and the read operation has dropped to about 7 seconds.

JuiceFS enables asynchronous writes

In the previous step, we greatly optimized the read operation time, and now the time consumed by the write process is more obvious. By analyzing the oplog, it is found that fsync cannot be parallelized in the write operation, so increasing the number of write threads cannot improve the efficiency of writing. In the actual operation process, we also verified this by increasing the number of write threads, so I won't go into details here. Analyze the parameters (offset, write data size) of oplog write operations to the same file (same file descriptor), and find that there are a large number of random write operations, we can enable the --writeback option when mounting JuiceFS, when writing data Write to the local disk first, and then asynchronously write to the object storage.

# ./juicefs mount --writeback volume-demoz /jfs
# time xtrabackup --prepare --use-memory=4G --innodb-file-io-threads=16 --innodb-read-io-threads=16 --apply-log-only --target-dir=/jfs/base_snapshot

The time dropped to 11.8 seconds.

The write process has dropped to around 1.5 seconds.

We see that the read operations of the read thread are still relatively intensive. We try to continuously increase the number of read threads. The maximum number of InnoDB read threads is 64, and we directly adjust it to 64.

# time xtrabackup --prepare --use-memory=4G --innodb-file-io-threads=64 --innodb-read-io-threads=64 --apply-log-only --target-dir=/jfs/base_snapshot

The execution time is 11.2 seconds, which is basically the same as before.

We can see that the read operations of read threads are relatively sparse. It should be that there is a dependency between the data read by the threads, which makes it impossible to fully parallelize, and it is no longer possible to compress the time of the reading process by increasing the number of threads.

Increase the disk cache of JuiceFS

In the previous step, we improved the efficiency of the read process by increasing the number of read threads, and we can only reduce the read process time by reducing the latency of reading data.

JuiceFS provides read-ahead and cache acceleration capabilities in read operation processing. Next, we try to reduce the latency of read operations by increasing the local cache of JuiceFS.

Change the local cache of JuiceFS from high-efficiency cloud disk to SSD cloud disk, and change the cache size from 1G to 10G.

# ./juicefs mount --writeback volume-demoz --cache-size=10000 --cache-dir=/data/jfsCache /jfs

# time xtrabackup --prepare --use-memory=4G --innodb-file-io-threads=64 --innodb-read-io-threads=64 --apply-log-only --target-dir=/jfs/base_snapshot

The execution time dropped to 6.9 seconds.

The read operation time consumption is further reduced by improving cache performance and increasing cache space.

At this point, let's summarize. By analyzing the oplog, we constantly look for points that can be optimized, and reduce the entire data preparation process step by step from 62 seconds to 6.9 seconds. The effect is more intuitively displayed through the following figure.

Increase the amount of database data

The above operations are optimized for a relatively small data set such as 11G by continuously adjusting the parameters to obtain a good result. As a comparison, we generate a single-table database with a partition of 10 of about 115G in the same way. The backup operation is performed with SysBench at 50 requests per second.

# time xtrabackup --prepare --use-memory=4G --innodb-file-io-threads=64 --innodb-read-io-threads=64 --apply-log-only --target-dir=/jfs/base_snapshot

This process took 74 seconds.

We see that reading and writing are still separate.

When the amount of data increases by about 10 times, the corresponding preparation time also increases by 10 times. This is because the time required for the backup ( xtrabackup --backup) process is expanded to 10 times, and the amount generated during the backup process is xtrabackup_logfilealso . Data preparation is xtrabackup_logfileto merge all data updates in the data file into the data file. It can be seen that even if the data size is increased by 10 times, the time to update a single log is basically the same. This can also be verified from the above figure. After the data size increases, the preparation process is still divided into two obvious processes: reading data and writing data, indicating that the set buffer size of 4GB is still sufficient. The whole process It can still be done in memory and then updated to the filesystem.

Summarize

We use SysBench, a relatively simple tool, to construct initial data, and continuously update the database to a certain pressure to simulate the database operation scenario during data backup. Use the oplog of JuiceFS to observe the read and write characteristics of XtraBackup accessing backup data during the data preparation process, and adjust the parameters of XtraBackup and JuiceFS to continuously optimize the efficiency of the data preparation process.

In the actual production scenario, the situation is much more complicated than our SysBench simulation. The linear relationship above is not necessarily strictly true, but we can quickly find the points that can be optimized by analyzing the oplog, and then continuously adjust the caching and concurrency ideas of XtraBackup and JuiceFS. is generic.

The entire parameter adjustment process takes about 1 hour. The oplog analysis tool has played a great role in this process, helping us to quickly locate the system performance bottleneck, so as to adjust the parameters for optimization in a targeted manner. I also hope that this oplog analysis function can also help us. You can quickly locate and analyze the performance problems encountered.

If it is helpful, please follow our project Juicedata/JuiceFS ! (0ᴗ0✿)