Implementation of Hbase Compression

The basic properties in HBase are set in units of column families, as follows

Data encoding/compression Compress/DeCompress

Data compression is another feature provided by HBase. Before writing data blocks to HDFS, HBase will first compress the data blocks and then place them on the disk, thereby reducing the amount of disk space used. When reading data, the block is first loaded from HDFS and then decompressed, then cached in BlockCache, and finally returned to the user. The write path and read path are as follows:

 

(1) Resource usage: The most direct and important function of compression is to reduce the data hard disk capacity. In theory, the snappy compression ratio can reach 5:1, but depending on the test data, the compression ratio may not be theoretically ideal; (in The snappy compression we use in the actual environment, we feel that the performance is good, and the compression ratio is also very pleasant. The compression ratio in the actual environment can reach almost 6:1.)

Compression/decompression undoubtedly requires a lot of computation and requires a lot of CPU resources; according to the read path, the block will be decompressed before the data is read into the cache, and the block cached in the memory is decompressed, so it is different from the uncompressed situation. In comparison, there is basically no impact on the memory before and after.

(2) Read and write performance: Because data writing is to first write the kv data value to the cache, and finally unify the flush hard disk, and the compression is performed in the flush stage, so it will affect the flush operation, and it will not affect the write performance itself. It will not have much impact; and if the data is read from HDFS, it needs to be decompressed first, so theoretically the read performance will decrease; if the data is read from the cache, because of the block in the cache. It is already decompressed, so performance will not have any impact; in general, most reads are hot reads, and cache reads account for most of the proportions, and compression will not have much impact on reads.

It can be seen that the compression feature is the use of CPU resources in exchange for disk space resources, which does not have much impact on read and write performance. HBase currently provides three commonly used compression methods: GZip | LZO | Snappy. The following table is the official comparison of compression rate and codec rate:

 

In general, Snappy has the lowest compression rate, but the highest codec rate and the smallest CPU consumption. Currently, Snappy is generally recommended.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325197200&siteId=291194637