Computable storage: transparent compression, database IO model and SSD life

Source: Public Account ScaleFlux

Future speed

what Andy giveth, Bill taketh away. 

Every time Intel provides more powerful computing power, Microsoft can do nothing.

Andy and Bill'Law, which was born in 1990, is still effective. With the exponential growth of data volume, it becomes more and more intense in the field of data storage and processing. "In the next 10 years, the company's changes will exceed its total changes in the past 50 years." This is the text of Bill Gates' 1999 book "Future Speed". It is difficult for us to list all the key changes one by one, but we also follow this prediction in the storage field. For example, the recently mentioned Huawei genius boy, Zhang Ji researches intelligent optimization related to disk and database, Yao Ting researches new storage media and key-value storage systems, Zuo Pengfei researches non-volatile memory systems, all of which are directly related to the storage field. The relationship seems to indicate that changes in the storage field are still taking place.

More efficient productivity is bound to replace the original productivity and production relations. From the partitioning technology of mainframes and minicomputers, to x86 virtualization and the current trending container and container orchestration technology, all efforts are made to increase the deployment density of applications, increase the utilization of computing resources, and ultimately reduce the cost of ownership. The storage field is no exception. Looking at the history of Flash:

SSDs continue to land and deploy on a large scale in key enterprise applications. From MLC, TLC to QLC, the capacity is gradually increasing and the cost is gradually decreasing, but based on the realization principle of SSD technology, the life problem is becoming more and more prominent.

Write amplification and life

SSDs cannot directly overwrite old data like memory and mechanical hard drives. They can only erase the block and write one of the "clean" pages. When the remaining space of the SSD becomes less and a large amount of data fragments appear, the entire block data must be read and the valid data will be rewritten to the erased block. This process is called Garbage Collection (GC), which leads to Write Amplification. In fact, there is a more quantitative indicator for measuring write amplification, called Write Amplification Factor (WAF for short). JEDEC (Solid State Technology Association, the organization that launched the SSD standard) defines the write amplification factor WAF:


The picture shows an example of how WAF is calculated in JEDEC. This block contains 64 Pages (256 sectors). Assuming that Page1~Page3 (8 sectors) data needs to be updated, the algorithm is:

  1. Copy all page0~page63 to DRAM;

  2. Update Page1~Page3 in DRAM;

  3. Erase this Block and write back DRAM data;

????At this time, the write amplification factor is 256/8=32, and the write amplification increases the amount of writing by 32 times, which leads to more erasing and writing.

As we all know, flash memory particles have a limit on the number of erasing and writing, so when measuring SSD life, TBW (or DWPD) is usually used as the basis for measurement. The write amplification results in the "magnification" of the flash memory particle erase and write times, which further reduces the life span. The calculation formula is as follows:

At the same time, as the amount of writing increases, the probability of bad blocks will increase. For reliability, the more important measurement index is UBER (Unrepairable Error Bit Rate), which is used to measure the ratio of the number of data errors per bit read to the total number of reads that are still generated after the error correction mechanism is applied.

UBER describes the probability of data reading errors. The lower the value, the better. The following figure describes the linear relationship of UBER with the increase in the amount of written data:

  • 0~600, UBER is always 0

  • 600~800. At this time, due to the continuous increase of the write volume, a small number of bad blocks are formed, and the UBER slowly increases from 0 to 0.003

  • 800~1000, the failure of more particles and the superimposition of the write amplification on the particle erase and write makes the remaining NAND particles more likely to form bad blocks, which leads to a small increase in the amount of writing, and the UBER also increases sharply. At this time, the data read error rate Sharp increase.

Anything that has it will be added to him; if it is not, even all of it will be taken away.

In short, for SSDs that often carry enterprise-critical business, consider reliability and not only pay attention to the amount of writes. As the number of erasing and writing increases, the read errors caused are also unacceptable for enterprise-level applications. TBW and UBER should not be discussed separately, just like in the database field Recovery Time Objective (RTO) and Recovery Point Objective (RPO) always appear at the same time, only talking about the recovery time of data services, or blindly emphasizing zero loss of data, is a rogue . Of course, the industry is also continuing to increase the number of erasing and writing of SSD storage particles and GC algorithms. At the same time, combined with the transparent compression of computable storage, it also brings a new direction for the improvement of SSD life and stability.

Compression and life

Qualitatively speaking, compression will inevitably reduce data writing and ultimately increase the writing life of storage particles. But in the engineering field, more detailed data is needed. To consider the life benefits of compression, first of all do not affect the business (see: Computable storage: data compression and database computing pushdown ) as the premise. There are many factors that determine the write life, such as the quality of storage particles, data model, temperature, humidity, and metaphysics. The writing model must be one of the most important factors, which must first return to the important business scenarios of SSD's enterprise-level services. JESD219 (a document in JEDEC that specifically introduces the analysis workload in the SSD life test) analyzes the characteristics of enterprise-level SSD workloads and conducts strict data load simulation tests on this basis. With the help of the de facto standard JESD219, it is convenient for us to further verify the benefits of data compression for reducing write amplification.

JESD219 workload

Contains the following points:

  • Data popularity: Data set access is relatively concentrated, 5% of the data gets 50% of the access frequency, and 20% of the data gets 80% of the access frequency;

  • IO size: mainly small IO, 67% of IO size is 4KB, too large IO or too small IO are relatively small, as shown in the figure below, is the proportion of different I/O sizes;

  • Test duration: JESD219 also uses this workload model to test compression ratio and write amplification, and conducts tests for different compression ratio data, different reserved space sizes, and different capacities. Among them, in order to be close to the real business scenario, to ensure that the test continues Duration is very necessary, for this we use the above IO pattern to continue testing for 10,000 minutes;

  • Recording method: write amplification factor sampling once every minute;

Let's take a look at the script that creates load after the magic change (intercepted part):

The final completion record is shown in the figure below:

The picture shows JESD219, based on the same load model, using the enterprise-level commonly used 3.2TB/3.84TB/6.4TB SSD capacity test, the data compression ratio is 1:1/1.2:1/2:1/2.13:1/3.7:1 /9:1 increases gradually, lasts 10000 minutes and records and writes enlarged data. Visible from the test results

  1. All models will reach a steady state after continuous testing for a period of time, at which time the write amplification tends to be stable;

  2. The same 3.2TB SSD, the data compression ratio is from 1:1/1.2:1/2.13:1/9:1, the write amplification is sharply reduced, and the results of 3.84TB and 6.4TB SSD are similar;

It can be seen that as the compression ratio increases, different types of SSDs can continuously obtain the benefits of reduced write amplification.

Select a typical SSD capacity ratio, compare and analyze specific data, as shown in the figure below

General enterprise SSD reserved space is 28% (available space 3.2TB), when the data compression ratio is 1:1, the write amplification factor is 1.79, when the data compression ratio is 1.2:1, the write amplification factor is reduced to 1.48 (only increase 20% compression and write amplification reduced by 17%). When the data compression ratio is 2.1:1, the write amplification factor is 0.58, and the write amplification is reduced by 67%.

Generally, the reserved space of consumer-grade SSD is 7% (the available space is 3.84TB). When the data compression ratio is 1:1, the write amplification factor is 3.67. When the data compression ratio is 1.2:1, the write amplification factor is reduced to 1.95, just do The 20% compression write amplification is reduced by 49%, which is close to the enterprise-level SSD data compression ratio of 1:1 with a write amplification factor of 1.79. When the data compression ratio is 2.1:1, the write amplification factor is 0.62, and the write amplification is reduced by 83%.

Looking back at TBW and UBER, one can summarize:

Compression not only improves performance, but also improves life and stability.

In addition to considering compression itself, there are many aspects that need to be considered in combination with enterprise-level services, such as whether it is transparent to the business, whether zero copy does not bring additional overhead, scalability, etc., you can refer to Computational Storage: Data Compression and Database Computing Pushdown, The following figure is based on the transparent compression of computable storage for reference.

JESD219 has considered enough, but if you want to get closer to the application load, you need to go further. Taking the database scenario as an example, using MySQL, using Sysbench to create read and write pressure (OLTP mixed read and write oltp_read_write, data set 2TB), you can observe that its IO model (based on eBPF tracking IO) and JESD219 are still quite different:

  1. I/O size, 73% of I/O is concentrated in 16~31kbytes, the reason is that InnodDB data page defaults to 16KB. 15% of the I/O is concentrated below 1KB because redolog defaults to 512 bytes for I/O operations;

  2. Data popularity, in the sysbench test, mainly discrete I/O, try to evenly access the complete 2TB data set;

Of course, this can be more targeted to simulate the application load through the magic modification of the JESD219 script, which is not expanded here.

Tenfold change

There's plenty of room at the bottom.

The physicist Richard P. Feynman delivered a speech entitled "There's plenty of room at the bottom" ("There is still much to do at the bottom") at the American Annual Conference of Physics in December 1959. The development of physics in the future will have a profound impact. Computable storage technology as the "bottom" of applications will also have a greater impact on application architecture.

reference

  • Computable storage: data compression and database calculation pushdown: https://mp.weixin.qq.com/s/VFgBtn1dyHW0VUsHxKW6BA

  • On the Impact of Garbage Collection on flash-based SSD Endurance:https://www.usenix.org/system/files/conference/inflow16/inflow16-paper-verschoren.pdf

  • Write Amplification Analysis in Flash-Based Sold State Drives:https://dl.acm.org/doi/abs/10.1145/1534530.1534544

  • QZFS: QAT Accelerated Compression in File System for Applicaton Agnostic and Cost Efficient Data Storage:https://www.usenix.org/system/files/atc19-hu.pdf

  • Improving Performance and Lifetime of Solid-State Drives Using Hardware-Accelerated Compression:https://ieeexplore.ieee.org/document/6131148

  • Zoned Namespaces (ZNS) SSDs:https://zonedstorage.io/introduction/zns/#:~:text=Frequently Asked Questions-,Zoned Namespaces (ZNS) SSDs,™ (NVMe™) organization.

  • JESD219:https://www.jedec.org/sites/default/files/docs/JESD219.pdf

Author:

Jin Ge@Scaleflux, 熊中哲@Scaleflux

For more video information, including training, interviews and product introductions, please follow our WeChat official account-video information channel

ScaleFlux video information official website:
http://scaleflux.com/videos.html
ScaleFlux Youku channel:
https://i.youku.com/scaleflux?spm=a2hzp.8244740.0.0

                                                        

The full text is over.

Enjoy MySQL :)


Teacher Ye's "MySQL Core Optimization" class has been upgraded to MySQL 8.0, scan the code to start the journey of MySQL 8.0 practice

Guess you like

Origin blog.csdn.net/n88Lpo/article/details/108459087