Basic knowledge of HBase (7): Complete solution to HBase performance optimization examples

1. High availability

In HBase, HMaster is responsible for monitoring the life cycle of HRegionServer and balancing the load of RegionServer. If HMaster hangs up, the entire HBase cluster will fall into an unhealthy state, and the working state at this time will not last long. Therefore, HBase supports high-availability configuration of HMaster.

1. Shut down the HBase cluster (skip this step if it is not turned on)

bin/stop-hbase.sh

2. Create a backup-masters file in the conf directory

touch conf/backup-masters

3.Configure the highly available HMaster node in the backup-masters file

echo hadoop103 > conf/backup-masters

4.scp the entire conf directory to other nodes

scp -r conf/ 
--hadoop103:/opt/module/hbase/
scp -r conf/ 
--hadoop104:/opt/module/hbase/

5.Open the page to test and view http://hadooo102:16010

 

2. Pre-partitioning

Each region maintains StartRow and EndRow. If the added data matches the RowKey range maintained by a Region, the data is handed over to this Region for maintenance. Then according to this principle, we can roughly plan the partitions where the data will be put in advance to improve HBase performance.

1. Manually configure pre-partitioning

Hbase> create 'staff1','info','partition1',SPLITS =>  ['1000','2000','3000','4000'] 

2. Generate hexadecimal sequence pre-partition

create 'staff2','info','partition2',{NUMREGIONS => 15, SPLITALGO =>  'HexStringSplit'} 

3.Create splits.txt file in pre-partition according to the rules set in the file. The content of the file is as follows:

aaaa bbbb cccc dddd 

Then execute:

create 'staff3','partition3',SPLITS_FILE => 'splits.txt' 

4.Create pre-partition using JavaAPI

//自定义算法,产生一系列 hash 散列值存储在二维数组中
byte[][] splitKeys = 某个散列值函数
//创建 HbaseAdmin 实例
HBaseAdmin hAdmin = new HBaseAdmin(HbaseConfiguration.create());
//创建 HTableDescriptor 实例
HTableDescriptor tableDesc = new HTableDescriptor(tableName);
//通过 HTableDescriptor 实例和散列值二维数组创建带有预分区的 Hbase 表
hAdmin.createTable(tableDesc, splitKeys);

3. RowKey

The unique identifier of a piece of data is designed to be RowKey, so which partition the data is stored in depends on which pre-partitioned range the RowKey is in. The main purpose of designing RowKey is to make the data evenly distributed in all regions. Prevent data skew to a certain extent. Next, let’s talk about the commonly used design solutions of RowKey.

1. Generate random numbers, hashes, and hash values

比如:
原 本 rowKey 为 1001 的 , SHA1 后 变 成 :
dd01903921ea24941c26a48f2cec24e0bb0e8cc7
原 本 rowKey 为 3001 的 , SHA1 后 变 成 :
49042c54de64a1e9bf0b33e00245660ef92dc7bd
原 本 rowKey 为 5001 的 , SHA1 后 变 成 :
7b61dec07e02c188790670af43e717f0f46e8913
在做此操作之前,一般我们会选择从数据集中抽取样本,来决定什么样的 rowKey 来 Hash
后作为每个分区的临界值。

2. String reverse

20170524000001 is converted to 10000042507102
20170524000002 converted to 20000042507102

This can also hash the data that is gradually put in to a certain extent.

3.String concatenation

20170524000001_a12e
20170524000001_93i7

4. Memory optimization

HBase requires a lot of memory overhead during operation. After all, Table can be cached in memory. Generally, 70% of the entire available memory is allocated to the Java heap of HBase . However, it is not recommended to allocate a very large heap memory, because if the GC process continues for too long, the RegionServer will be in a long-term unavailable state. Generally, 16~48G memory is enough. If the system memory is insufficient because the memory occupied by the framework is too high, the framework will also be System services are stalled.

5. Basic optimization

1. Allow appending content to HDFS files

hdfs-site.xml、hbase-site.xml

属性:dfs.support.append 
解释:开启 HDFS 追加同步,可以优秀的配合 HBase 的数据同步和持久化。默认值为 true。 

2. Optimize the maximum number of open files allowed by DataNode

hdfs-site.xml

属性:dfs.datanode.max.transfer.threads 
解释:HBase 一般都会同一时间操作大量的文件,根据集群的数量和规模以及数据动作, 设置为 4096 或者更高。默认值:4096

3.Optimize the waiting time of data operations with high latency

hdfs-site.xml

属性:dfs.image.transfer.timeout 
解释:如果对于某一次数据操作来讲,延迟非常高,socket 需要等待更长的时间,建议把 该值设置为更大的值(默认 60000 毫秒),以确保 socket 不会被 timeout 掉。

4.Optimize data writing efficiency

mapred-site.xml

属性:
mapreduce.map.output.compress 
mapreduce.map.output.compress.codec 
解释:开启这两个数据可以大大提高文件的写入效率,减少写入时间。第一个属性值修改为 true,第二个属性值修改为:org.apache.hadoop.io.compress.GzipCodec 或者其 他压缩方式。

5.Set the number of RPC listeners

hbase-site.xml

属性:Hbase.regionserver.handler.count 
解释:默认值为 30,用于指定 RPC 监听的数量,可以根据客户端的请求数进行调整,读写 请求较多时,增加此值。

6.Optimize HStore file size

hbase-site.xml

 属性:hbase.hregion.max.filesize 
 解释:默认值 10737418240(10GB),如果需要运行 HBase 的 MR 任务,可以减小此值, 因为一个 region 对应一个 map 任务,如果单个 region 过大,会导致 map 任务执行时间 过长。该值的意思就是,如果 HFile 的大小达到这个数值,则这个 region 会被切分为两 个 Hfile。

7.Optimize HBase client cache

hbase-site.xml

属性:hbase.client.write.buffer 
解释:用于指定 Hbase 客户端缓存,增大该值可以减少 RPC 调用次数,但是会消耗更多内 存,反之则反之。一般我们需要设定一定的缓存大小,以达到减少 RPC 次数的目的。

8. Specify scan.next to scan the number of rows obtained by HBase

hbase-site.xml

属性:hbase.client.scanner.caching 
解释:用于指定 scan.next 方法获取的默认行数,值越大,消耗内存越大。 

9. The flush, compact, and split mechanisms flush the data in the Memstore into the Storefile when the MemStore reaches the threshold; the compact mechanism merges the flushed small files into a large Storefile. Split means that when the Region reaches the threshold, the overly large Region will be divided into two.

Involving attributes: That is: 128M is the default threshold of Memstore

hbase.hregion.memstore.flush.size:134217728

That is: the function of this parameter is to flush all memstores of the HRegion when the sum of the sizes of all Memstores in a single HRegion exceeds the specified value. RegionServer's flush is processed asynchronously by adding the request to a queue to simulate the production and consumption model. Then there is a problem here. When the queue has no time to consume and generates a large backlog of requests, it may cause a sudden increase in memory, and in the worst case, trigger OOM.

hbase.regionserver.global.memstore.upperLimit:0.4 
hbase.regionserver.global.memstore.lowerLimit:0.38

That is: when the total amount of memory used by MemStore reaches the value specified by hbase.regionserver.global.memstore.upperLimit, multiple MemStores will be flushed to the file. The order of MemStore flush is executed in descending order of size until the memory used by MemStore is flushed. Less than lowerLimit.

Guess you like

Origin blog.csdn.net/zuodingquan666/article/details/135228913
Recommended