Essay study: Austere Flash Caching with Deduplication and Compression

Thesis Title: Austere Flash Caching with Deduplication and Compression

Source: USENIX ATC 2020

Link: https://www.usenix.org/conference/atc20/presentation/wang-qiuping

Key concepts

deduplication

Data deduplication removes block-level duplicate data in a coarse-grained but lightweight manner. By dividing the data into chunks (KB level), hash the content of each chunk. If the calculated hash value (fingerprint, FP) is the same (different), it is regarded as redundant (unique) data. By saving only one copy of redundant data in the physical space (on the SSD), data reduction can be performed. However, in the logical space, all redundant data will not be deleted, but all point to the same physical address. In addition, the mapping of each chunk and its FP is stored for repeated checking and chunk search. The size of the chunk may be fixed or variable. This article focuses on deduplication of fixed-size chunks.

compression

Compression aims at fine-grained data reduction at the byte level by transforming data into a more compact form. Compression is usually performed after deduplication, and is generally applied to those unique data blocks. The size of the compressed data block is generally variable. This article uses sequential compression algorithms (such as Ziv-Lempel algorithm) to operate on the bytes of each block.

Deduplication&compression are two data eduction technologies that complement each other.

flash

Flash memory is also called flash memory. It combines the advantages of ROM and RAM. It not only has the performance of electronically erasable and programmable (EEPROM), but also can read data quickly (the advantage of NVRAM), so that data will not be lost due to power failure . The biggest difference between solid-state hard drives and traditional mechanical hard drives is that they no longer use disks for data storage, but use storage chips for data storage. The storage chips of solid-state hard drives are mainly divided into two types: one uses flash memory as a storage medium; the other uses DRAM as a storage medium. At present, the most commonly used are solid-state hard drives that use flash memory as storage media

flashcache

Flashcache is an open source project of the Facebook technical team. The original purpose is to accelerate the MySQL database engine InnoDB. It is an open source hybrid storage solution that takes into account the advantages of both mechanical hard drives and solid state drives, that is, it has the high capacity and high order of HHD Access, there are also SSDs with high random access and low latency, and the cost of this solution is relatively low. Flashcache uses Linux's device mapping mechanism to map the block device of the Flash disk and the ordinary hard disk, which is realized as an ordinary disk in the OS, which is simple to use. A new caching layer is added between the file system and the device driver to realize the caching of hot data. SSDs are usually used as caches, and the hot data on the traditional hard disks are cached on the SSDs, and then the excellent read performance of the SSDs is used to speed up the system. Reference https://my.oschina.net/u/658505/blog/544599

The picture above is a general flashcache architecture with deduplication and compression functions. Flashcache with deduplication and compression functions ensures that the data blocks in the SSD are unique and compressed. The traditional flashcache does not have the functions of deduplication and compression, so only one index structure is maintained: LBA->CA. After hypothetical calculations, the flashcache with deduplication and compression has to maintain two indexes (LBA-index, FP-index), and its memory cost is 16 times (4G/256MB) of the traditional flashcache. In addition, additional CPU overhead may be generated (calculating FP, compressing data, searching for index pairs, etc.)

LBA-index:LBA->FP

Logical address index, which is to map the logical address (LBA) of the data stored in the HDD with the FP value of the data block (many to one, there may be multiple different logical addresses pointing to the same data)

FP-index:FP->CA,length

FP index, that is, mapping the FP value of each data block with the physical address (CA) of the data block in the SSD after compression and the size of the block (one to one)

Write-through (write-through mode)

When data is updated, it is written to the cache and back-end storage at the same time. The advantage of this mode is simple operation; the disadvantage is that data modification needs to be written to the storage at the same time, and the data writing speed is slow

Write-back (write-back mode)

When data is updated, only the cache is written. Only when the data is replaced out of the cache, the modified cache data will be written to the back-end storage. The advantage of this mode is that the data writing speed is fast, because there is no need to write to the storage; the disadvantage is that once the updated data is not written to the storage, the system is powered off and the data cannot be retrieved.

Solve the problem

At present, SSD is widely used as a cache layer between RAM and HDD (that is, flashcache technology, which can improve overall I/O performance by storing hot data in SSD), but its service life and capacity are limited. In order to solve this problem, this paper proposes a framework AustereCache, which has an efficient data deduplication and compression mechanism, which can extend the life of the SSD and expand its capacity as much as possible, and it can also reduce the data compression/deduplication caused by The huge memory overhead brought by the index management. This is 69.9-97.0% lower than the memory cost required by the latest flashcache framework CacheDedup, and can provide similar performance.

Core Technology

AustereCache emphasizes strict cache data management and uses a variety of technologies for effective data organization and cache data replacement. It mainly includes three core technical points:

bucketization(桶化)

As shown in the figure below, in order to eliminate the memory overhead of maintaining address mapping in LBA-index and FP-index, AustereCache hashes index entries into equal-sized partitions (called buckets, each bucket is divided into multiple slots), Each bucket saves part of LBA and FP (the prefix code is taken after hashing, such as the first 16 bits) to save memory. According to the location of the bucket, the data block is mapped to the SSD. The SSD is divided into metadata and chunk data. These two areas are also divided into multiple buckets. Each bucket contains multiple slots, and the number is consistent with FP-index. (1 to 1 mapping).

Fixed size compressed data management

In order to avoid tracking the block length in the FP-index, AustereCache divides the variable-size compressed block into smaller fixed-size sub-blocks, and manages the sub-blocks without recording the compressed block length. As shown in the figure below, multiple consecutive slots are selected in FP-index to store sub-blocks belonging to the same compressed block, and the size of the compressed block is not recorded in FP-index, but the compressed block is recorded in SSD , Thereby reducing the size of FP-index and reducing memory overhead. In this way, you can use the bucket-slot mechanism to manage each data block (divide the compressed block of varying size into multiple fixed-size sub-blocks), and also save memory overhead.

Bucket-based cache replacement

In order to increase the likelihood of cache hits, AustereCache combines the principle of recency and deduplication based on reference counting (ie, the count of duplicate copies of each unique block referenced) to achieve effective SSD cache data replacement. However, recording the reference count will produce a non-negligible memory overhead. Therefore, AustereCache uses a fixed-size compact sketch data structure to use a limited error in a limited memory space for reference count estimation.

For LBA-index, the replacement strategy is LRU. Whenever a new LBA is added or accessed, the LBA will be moved to the forefront (offset is 0), and all the other entries will be moved back by one grid, which is the last one. The entry (maximum offset) will be replaced.

For FP-index, the replacement strategy is to combine deduplication and recency (recency, equivalent to the principle of locality), and use an additional data structure: reference count (count) to indicate redundant LBA Degree, when the FP-index is full, the entry with the smallest count will be replaced to satisfy deduplication. In addition, the LBA-index is divided into two areas: Recent and Old. Entries located in the Recent area have count+2 each time they are accessed or newly added entries; when entering Old from Recent, or being in Old The replaced entry has its count-1; when the entry in Old is accessed into the Recent area, its count+1. Through this rule, both de-duplication and locality principle are taken into account.

Performance evaluation

data set

①The traces (I/O access sequence?) collected by FIU from three different devices are virtual machines, file servers, and mail servers on the Web server.

②Custom trace, a trace generator designed to measure throughput, generates a trace by setting two parameters: aI/O deduplication ratio (it can be understood as the redundancy of the data requested in the I/O request Degree); b. Read/write ratio

experiment

  1. Through multiple sets of use based on the data set ①, it is proved that AustereCache has a much higher overall memory saving rate, read hit rate, and write reduction rate than the current latest CacheDedup.

  2. In addition, multiple experiments based on the data set ① proved the sensitivity of AustereCache to its parameters, including the impact of block and sub-block size settings on memory overhead, and the impact of LBA-index size on memory overhead. It turns out that the larger the block, the lower the memory overhead, and the larger the LBA-index, the greater the memory overhead.

  3. Through experiments on the overall I/O throughput based on the data set ②, it is proved that in the data requested by I/O, the more redundant data, the fewer times the SSD needs to be accessed, thereby improving the overall throughput. It also proves that in I/O requests, the higher the proportion of read requests, the smaller the throughput will increase.

  4. The CPU overhead is also measured based on the data set ②, indicating that the main calculation overhead is basically the FP value of the block data. Subsequent proofs by enabling multithreading can reduce CPU overhead to a certain extent, and the higher the number of threads, the higher the throughput.

to sum up

advantage

The article elaborates on the three key technologies of AustereCache, and demonstrates the superiority of AustereCache in managing the index data structure through very rich experiments, which greatly reduces the memory overhead, while maintaining the same level as the current latest architecture. Read hit rate and write reduction rate.

Personally feel that there is a problem

However, the article did not continue to consider the block size setting of AustereCache. The larger the block, the lower the memory overhead, but it may bring additional read and write operations. Because a block is too large, it may not be suitable for some operations that process small data blocks.

And the article does not continue to prove that reducing the number of writes to the SSD will affect the life of the SSD, but qualitatively it will extend the life.

This article focuses on saving memory space overhead, but it may actually increase time overhead. For example, the replacement strategy of LBA-index in the article uses the LRU algorithm, but it does not give a comparison with other replacement algorithms, such as FIFO, CLOCK algorithm, etc. , Every time a new entry is inserted, deleted, or replaced in the article, all the remaining entry positions will be moved. This feels that it will bring a lot of time overhead and there is room for further optimization.

And for the count of each entry in the FP-index, the article uses the estimation method to get it, there will be a certain error, which feels not very accurate, and the reference counter will also bring a lot of memory overhead. However, the memory overhead caused by directly storing the count in FP-index will be higher.

The count of each entry in ex is obtained by estimation in the text, and there will be a certain error, which feels not very accurate, and the reference counter will also bring a lot of memory overhead. However, the memory overhead caused by directly storing the count in FP-index will be higher.

Another question is why the lower the count in the FP-index is, the easier it is to be replaced. This strategy reflects deduplication. Doesn't the lower the count mean the lower the data duplication? In terms of locality, the priority is to replace it. The lowest count is fine, but it feels somewhat contradictory to deduplication.

Guess you like

Origin blog.csdn.net/qq_40742077/article/details/108898095