HBase optimization | reasonable use of encoding and compression

Why discuss HBase encoding and compression

  • Encoding + compression can double the disk space occupied by data and save considerable storage costs

  • Encoding + compression can generally increase the system throughput rate, allowing the system to do more functions

  • The default table creation does not enable encoding or compression, which is not friendly to beginners


Understanding HBase encoding

For example, we have a logistics table called "express", which records the details of the flow of logistics orders. As shown in the following table: rowkey consists of two parts, separated by #, the left is the logistics order number, and the right is the update time of the logistics information. The table contains two columns, one is logistics status and the other is logistics description information

rowkey Status bar Description column
10324224#2019-04-21 10:51 Shipped The package is waiting to be collected
10324224#2019-04-21 19:46 Pieces received [Jiaxing City] Pinghu Nanqiao’s xxx has been collected
10324224#2019-04-21 19:46 in transit [Jiaxing City] The express has departed from Pinghu Nanqiao and is ready to be sent to Jiaxing Transit Department
10324224#2019-04-21 20:41 in transit [Jiaxing City] Express has arrived at Jiaxing transit department
10324224#2019-04-21 20:42 in transit [Jiaxing City] The express has departed from the transfer department of Dajiaxing and is ready to be sent to the transfer department of Hangzhou
10324224#2019-04-21 22:50 in transit [Jiaxing City] The express has arrived at the transfer department of Hangzhou
... ... ...

We encode rowkey to get the following

rowkey

rowkey

Prefix length

Status bar Details column
10324224#2019-04-21 10:51 0 Shipped The package is waiting to be collected
19:46 20 Pieces received [Jiaxing City] Pinghu Nanqiao’s xxx has been collected
19:46 20 in transit [Jiaxing City] The express has departed from Pinghu Nanqiao and is ready to be sent to Jiaxing Transit Department
20:41 20 in transit [Jiaxing City] Express has arrived at Jiaxing transit department
20:42 20 in transit [Jiaxing City] The express has departed from the transfer department of Dajiaxing and is ready to be sent to the transfer department of Hangzhou
22:50 20 in transit [Jiaxing City] The express has arrived at the transfer department of Hangzhou
... ... ... ...

可以看到,除了第一行存储了完整的rowkey以外,其它行与首行进行了Diff,只保存了不相同的部分。除了rowkey编码以为,HBase还可以对版本号,更新类型等进行编码。HBase目前提供可靠的编码包括"Prefix,Diff,Fast Diff,Index Encoding(阿里巴巴HBase团队研发的优化随机读场景的编码)",更多关于编码的参考Compression and Data Block Encoding In HBase以及HBase数据压缩编码探索:

http://archive.cloudera.com/cdh5/cdh/5/hbase-0.98.6-cdh5.3.1/book/compression.html?spm=a2c4e.11153940.blogcont702370.12.5a075c3bjlQkZ8

https://www.cnblogs.com/hbase-community/p/8915498.html?spm=a2c4e.11153940.blogcont702370.13.5a075c3bjlQkZ8


了解HBase压缩

HBase压缩的对象是一行中的值,不包括RowKey、版本等。由于HBase的元数据中没有数据类型,因此HBase的压缩算法都是面向通用压缩场景,一般会有4-10倍的压缩率,下面表格介绍了各种压缩算法在不同场景下的收益:

图片

HBase还有其它的压缩算法如"Snappy,GZIP"。更多关于压缩的参考Compression and Data Block Encoding In HBase以及HBase数据压缩编码探索:

http://archive.cloudera.com/cdh5/cdh/5/hbase-0.98.6-cdh5.3.1/book/compression.html?spm=a2c4e.11153940.blogcont702370.12.5a075c3bjlQkZ8

https://www.cnblogs.com/hbase-community/p/8915498.html?spm=a2c4e.11153940.blogcont702370.13.5a075c3bjlQkZ8


如何选择HBase编码压缩

不同的场景下优化的侧重不同,选择的难度也不同。

  • 读少写多,大存储量场景

该场景通常是存储一些比较冷的数据,对请求的响应要求不高,对存储成本敏感。在存储介质上通常选择便宜的HDD、高效云盘。这些介质相比于SSD在IO吞吐和IOPS方面都要弱很多。这里我们选择高压缩比的压缩算法"ZSTD、GZIP",编码选择Diff。比如"Diff+ZSTD"组合,不仅节省存储成本,同时会减少写入和Compaction对IO的压力(因为写入磁盘的数据量减少了),提升写吞吐量

注:压缩不是万金油,有时候数据写入前就压缩了,或者压缩率不理想,这时候设置压缩反而浪费了CPU,弄巧成拙。因此我们需要运行时数据来判断压缩比,进入HBase的Web UI,找到表't1'的详情页:

图片


我们把Table Schema部分展开,看一下当前表的编码(DATA_BLOCK_ENCODING)和压缩(COMPRESSION)配置

图片


然后我们继续向下看Table Regions,看到表't1'有两个分区分布在两台Region Server节点上。我们选择一个Region Server点进入

图片

选择第一个Region Server进入

图片

在Region Server的详情中,我们找到Regions部分,然后选择Storefile Metrics标签,下面是该Region Server上各个分区的文件详情。我们找到't1'表的分区,如上图红色框的哪一行,可以看到文件未编码压缩前是305MB,编码压缩后是49MB,压缩比是1:6.2。

  • 读请求较多场景

该场景一般偏向在线,请求多且对延迟有一定要求。读的优化相比写要更复杂,这个和场景关联性很强,所以我们可以有一些策略和预判,但是还是需要实际的数据来指导优化,具体问题具体分析。总的来讲,读存在一个缓存"BlockCache",缓存的命中率严重影响读的性能。默认的情况下BlockCache中的数据是可以保持"编码",但是数据从磁盘读出后被解压存储在BlockCache,在2.0版本中新增Compressed Block Cache(全表维度),使得Block Cache中的数据也可以处于压缩状态。

优势:

  • 缓存一般的命中在20%~80%较常见,因此一定有相当量的请求要走磁盘IO,那么编码压缩有助于减少这部分IO的开销;

  • 编码和压缩都可以使得Block Cache缓存更多的数据,从而提高命中率。但具体的提升效果也和场景相关,随机读场景的提升就不如顺序读多的场景;

  • 由于缓存命中提高,缓存淘汰相应减少,有利于GC。

劣势:

  • 编码压缩增加了CPU的开销,在CPU资源吃紧的情况下会影响读写响应时间

总结,通常情况下建议编码压缩都打开,使用DIFF+ZSTD组合作为默认配置可以满足绝大部分场景。通过HBase WebUI以及监控观察压缩比,CPU负载,IO负载,缓存命中率,读写请求响应等指标来判断是否需要进一步调整编码压缩配置。


如何修改编码压缩

编码压缩是column family级别的,我们可以通过shell来在线修改

alter 't1', {NAME => 'info', DATA_BLOCK_ENCODING => 'DIFF', COMPRESSION => 'ZSTD'}

注:修改表的编码压缩后并不是立即生效的,需要执行一次Major Compaction刷一遍数据文件,新生成文件才是新的编码压缩格式。


图片


Guess you like

Origin blog.51cto.com/15060465/2676910
Recommended