How Transparent Compression Technology Alleviates Write Enlargement in Database

  The authors are Zheng Ning, Wang Huan and Xu Shukun of the ScaleFlux system team.

1

Write amplification in the database

In the process of using the database (including other various applications), we usually pay attention to some system indicators, such as CPU usage, memory usage, or IO bandwidth consumption. These system indicators can help us evaluate the application's occupancy of system resources, and then find the direction for further optimization of the application.

At present, some common databases, such as MySQL, MariaDB, PostgreSQL, RockDB, etc., can be divided into two categories as shown in Figure 1 [1] in terms of design ideas :

  • Based on the structure of the B/B+ tree-this structure manages and organizes the records in the database through the nodes on the B/B+ tree. When the records increase or decrease, the nodes in the tree can be split and merged accordingly.

  • Based on the log structure merge (LSM, log structure merge) tree structure-this structure stores the records in the database in multiple levels, and the levels are merged to adapt to the increase or decrease of database records and maintain the shape of the tree .

(A) B+ tree

(B) LSM tree

Figure 1. B+ tree structure and LSM tree structure

Usually when the database is running, the write bandwidth occupied by it is much larger than the bandwidth required by the conversion of the TPS (transactions per second) that can be seen by the upper layer. For example, for a record size of 1KB and a TPS of 1K, the write bandwidth usage is ideally 1KB * 1K/s = 1MB/s, but the actual bandwidth usage may be tens of MB/s or even tens of MB/s. s. This is because in the database (including the lower-level software stack), in order to ensure the security and consistency of the data, as well as the simplicity and efficiency of the design, the problem of write amplification is inevitably introduced.

The most direct problem caused by write amplification is the significant increase in the consumption of write bandwidth. Another problem is the rapid consumption of SSD life. In SSD, limited by the physical conditions of NAND erasing and writing, new data is always written in a new place, and the total number of erasing and writing of NAND is limited, which causes the more data to be written, the easier NAND is. It is full, and more frequent erasure and garbage collection are required, so the life span is consumed faster.

The write amplification problem is very unfavorable for the upcoming mass production QLC (each NAND cell corresponds to 4 bits), and even the PLC (each NAND cell corresponds to 5 bits) that has been put on the agenda, because with each NAND As the number of bits corresponding to the cell increases, the number of erasable and writeable NAND will plummet. As shown in Figure 2, Intel’s D5-P4326 QLC series, the sequential write P/E cycle is about 1600 times, and the random write only has more than 300 times [2] .

Figure 2. D5-P4326 QLC series reliability parameters

This article mainly discusses the opportunities brought by transparent compression to the write amplification problem in the database. The following will first analyze the main sources of write amplification for some mainstream databases, then introduce the concept and implementation of transparent compression, and finally evaluate the reduction effect of transparent compression on write amplification in the database.

2

Write the source of amplification

We choose MySQL (B+ tree structure) and RocksDB (LSM tree structure) as the research objects. MySQL, as the most widely used relational database, has a huge stock market in China; RocksDB, as a representative of NoSQL, is also used in more and more scenarios. The research on these two kinds of databases can not only be used to evaluate the impact of transparent compression on write amplification in different types of databases, but also help us compare the situation of write amplification in databases under different tree structures.

For MySQL, the data writing is mainly divided into four parts as shown in Figure 3:

  • Binlog-the logical log of the MySQL server, which is responsible for recording all changes made to the MySQL database.

  • Redolog-the physical log of the InnoDB storage engine, responsible for recording the value after data modification (regardless of whether the transaction is committed or not).

  • Double write buffer-a "pre-write" process introduced to prevent partial write problems of the page.

  • Data page-the main part of the database storage records.

Figure 3. The main components of data written in MySQL

Among these four parts, only the data page is the storage part that we really want to solidify. The other parts are introduced for data integrity and security considerations. Obviously, it also introduces the problem of write amplification. Even the data page itself has a write amplification problem. In the default configuration, a MySQL data page is 16KB, and a record may only have a few hundred bytes; and when updating a record in the data page, we need to rewrite the entire 16KB page, the degree of write amplification can be imagined know.

In addition to the data generated by the application, MySQL generally runs on the file system. When we need to update data synchronously, it will also involve changes in the log and metadata of the file system itself. This part of the written data is not aware of the application layer, but in many cases it tends to occupy a considerable part of the write traffic ( For example, updating a 4KB page in a file may cause a 12KB log update), which leads to further write amplification.

In RocksDB, in addition to the write amplification introduced by WAL (WAL also involves the writing of file system-level logs and metadata), its internal write operations are basically the order of large blocks of data. This part The write amplification is relatively small when there is no compaction. However, based on the structure of the LSM tree, after the amount of data in each corresponding level reaches a certain level, a background compaction operation is required to merge and sort some of the data in two adjacent levels, and then rewrite it to the lower level (such as Figure 4 shows [3] ), which introduces another level of write amplification. As the amount of data increases, the number of compactions will increase, and the degree of write amplification will increase accordingly.

Figure 4. Compaction in MyRocks

Before evaluating the write amplification of these two types of databases, let's take a look at transparent compression.

3

Transparent compression

When it comes to compression, we generally think of software-based CPU compression, as shown in Figure 5(a). After the data is compressed by different compression algorithms, the amount of data will be reduced, so that the storage space required during storage will be smaller; when the data is read, it is first decompressed by the CPU, and then applied for further processing. Different compression algorithms often make a trade-off between efficiency and compression level: simple compression algorithms, such as LZ4 and Snappy, have fast compression and decompression speeds and low CPU usage, but the compression effect is worse; complex compression algorithms, such as Zlib With Bzip2, the compression and decompression speed is slow and the CPU occupies more, but the compression effect will be better.

However, no matter whether the compression algorithm itself is complex or not, CPU-based software compression needs to occupy the system's CPU resources, and the application needs to integrate the corresponding compression/decompression module. With transparent compression, CPU overhead can be avoided, and compression and decompression can be completely transparent to applications (applications are unaware of data compression and decompression), as shown in Figure 5(b).

Figure 5. (a) CPU-based software compression and (b) transparent compression

In transparent compression, when the application writes the original data to the hard disk, the hard disk itself compresses the data, and then writes the compressed data to NAND; when the application reads the data, it first reads the compressed data from NAND The data is decompressed on the hard disk and returned to the application. Unlike software compression, transparently compressed data does not have 4KB sector alignment constraints when writing to NAND, and the physical space occupied by it is allocated on demand. In this way, it is necessary to realize the variable-length mapping of the logical block address (LBA) and the physical block address (PBA) in the transparently compressed drive (in the traditional drive, it is fixed-length mapping, that is, a 4KB LBA corresponds to a 4KB PBA) , As shown in Figure 6 [4] .

Figure 6. Variable length mapping of LBA and PBA

The existence of the variable-length mapping table brings new challenges to the design of the hard disk drive, because on the one hand, it needs to record the correspondence between the logical space and the physical space before and after data compression in real time, and on the other hand, it also needs to provide efficient read and write access. Mechanism to ensure the ultimate performance of the hard drive. But it is precisely because of the variable-length mapping in the driver that upper-layer applications can transparently use the compression and decompression functions, without consuming additional CPU, and can also have linear compression/decompression and expansion capabilities.

‍‍‍‍‍‍‍‍4

Write a magnified assessment

‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍

We chose ScaleFlux's CSD 2000 (3.2TB) as the target SSD for evaluation. CSD 2000 is an SSD that implements transparent compression. Its compression capability is equivalent to Gzip Level 6, and the compression throughput can reach more than 2GB/s [5] . At the same time, we choose Sysbench 1.1.0 as the benchmark test, and use insert, update-non-index and update index three writing methods to evaluate the write amplification of MySQL 5.7 and MyRocks (MySQL 8.0 + RocksDB 6.9), and the reduction of transparent compression Write the effect on magnification.

The CPU used in the test is 24-core @2.3GHz E5-2630, the memory is 128GB, the file system is Ext4, and the number of threads is 24. During the test, Sysbench first writes 200GB of data to the database, and then performs insert, update-non-index, and update index operations on this basis, and records the amount of data written on the host side and the actual write of NAND in each case Incoming data volume (under ordinary SSD, these two data write volumes are basically the same, but in a transparent compression environment, the actual NAND write volume is often smaller than the host side write volume). In the MySQL test, software compression is turned off, the page size is 16KB, and the memory pool is 32GB; while RocksDB in MyRocks uses LZ4 software compression, a total of 7 levels, and the others are the default configuration. In addition, commit and sync operations are performed for each transaction in the test to ensure data security and consistency.

Figure 7 shows the log/journal write data volume and data write data volume of MySQL and MyRocks in different write scenarios under the row size of 188B (both are normalized with the write volume in normal SSD) . As can be seen from the figure, the proportion of log/journal write volume is considerable, especially in MyRocks, which basically reaches more than 70%. In general, the data written by log/journal has a better compression ratio, especially for logs written by the file system, the compression ratio can often reach 5:1 or even higher. Therefore, after transparent compression, the amount of data written will be significantly reduced (more than 50% in the test).

Figure 7. Normalized data write volume of MySQL and MyRocks under different SSD under 188B row size

Figure 8 shows the comparison of the absolute value of write amplification between MySQL and MyRocks in different scenarios under the row size of 188B. It can be seen that in the update scenario, write amplification is still very serious. MySQL has reached more than 200, and RocksDB has about 80. With transparent compression, whether it is MySQL or MyRocks, the overall write amplification can be reduced by more than half. This means that the life of the same SSD can be more than doubled after using transparent compression.

Figure 8. Absolute write amplification values ​​of MySQL and MyRocks under different SSDs under 188B row size

We also tested the normalized write volume and the absolute value of write amplification under the line size of 512B, as shown in Figure 9 and Figure 10. It can be seen from the figure that when the size of each row increases, the proportion of log/journal writes decreases, but its absolute proportion is still not small (25% in MySQL, and 60% in MyRocks) ; And after the row size increases, the absolute value of write amplification (in the update scenario) decreases, where MySQL decreases from about 200 to about 80, and MyRocks decreases from about 80 to about 50. Similar to the case of 188B, after transparent compression, no matter what kind of database, the overall write volume/write amplification is reduced by more than half.

Figure 9. Normalized data write volume of MySQL and MyRocks under different SSDs under 512B row size

Figure 10. Absolute write amplification values ​​of MySQL and MyRocks under different SSDs under 512B row size

5

to sum up

Most database applications have serious write amplification, which makes the life of SSDs (especially QLC SSDs that have just gone on the market) become a big concern. Transparent compression can effectively reduce the amount of data written to NAND, and provides a new idea for reducing the write amplification problem of the database.

6

references

  1. https://blog.csdn.net/dbanote/article/details/8897599

  2. https://ark.intel.com/content/www/cn/zh/ark/products/186675/intel-ssd-d5-p4326-series-15-36tb-e1-l-pcie-3-1-x4-3d2-qlc.html

  3. https://www.jianshu.com/p/5c846e205f5f

  4. http://www.postgres.cn/v2/news/viewone/1/654

  5. https://scaleflux.com/product.html

For more video information, including training, interviews and product introductions, please follow our WeChat official account-video information channel

ScaleFlux video information official website:
http://scaleflux.com/videos.html
ScaleFlux Youku channel:
https://i.youku.com/scaleflux?spm=a2hzp.8244740.0.0

                                                        

Guess you like

Origin blog.csdn.net/n88Lpo/article/details/113009330