HBase performance tuning (2)

Thanks for likes and attention, a little progress every day! come on!


Copyright statement: This article is the original article of CSDN blogger "Driving a tractor home", following the CC 4.0 BY-SA copyright agreement, please attach the original source link and this statement for reprinting.

Hbase performance tuning (2) - blog driving a tractor home - CSDN blog

Hbase performance tuning (1) - blog driving tractor home - CSDN blog

Table of contents

1. General optimization

Two, Linux optimization

3. HBase optimization

1. Modify the zookeeper configuration: zookeeper.session.timeout

2. Modify the HBase configuration: hbase.regionserver.handler.count

3. Modify the HBase configuration: hbase.hregion.max.filesize

4. Modify the HBase configuration: hbase.regionserver.global.memstore.upperLimit

5. Modify the HBase configuration: hfile.block.cache.size

6. Modify the HBase configuration: hbase.hstore.blockingStoreFiles

7. Modify the HBase configuration: hbase.hregion.memstore.block.multiplier

8. Modify the HBase configuration: hbase.hregion.memstore.mslab.enabled

9. Optimize the maximum number of open files allowed by DataNode

10. Optimize the waiting time of data operations with high latency

11. Optimize data writing efficiency (whether the map output is compressed)

12. Optimize DataNode storage

13. Optimize hbase client cache


1. General optimization


1. The metadata backup of NameNode uses SSD, and regularly backs up the metadata on NameNode, every hour or every day. If the data is extremely important, it can be backed up once every 5-10 minutes. Backup can copy the metadata directory through scheduled tasks.

2. Specify multiple metadata directories for NameNode, using dfs.name.dir or dfs.namenode.name.dir to specify. One specifies a local disk and one specifies a network disk. This provides metadata redundancy and robustness against failures.

3. Set dfs.namenode.name.dir.restore to true to allow attempts to restore the previously failed dfs.namenode.name.dir directory. This attempt is made when creating a checkpoint. If multiple disks are set, it is recommended to allow.

5. The NameNode node must be configured as a RAID1 (mirror disk) structure.

6. Keep enough space in the NameNode log directory. These logs can help you find problems.

7. Because Hadoop is an IO-intensive framework, try to improve the storage speed and throughput (similar to the bit width).


Two, Linux optimization


1. Turning on the read-ahead cache of the file system can improve the reading speed

sudo blockdev --setra 32768 /dev/sda
(尖叫提示:ra是readahead的缩写)

2. Close the process sleep pool

# 临时关闭
sudo sysctl -w vm.swappiness=0
    
# 永久关闭
cat /etc/sysctl.conf  
vm.swappiness = 0

Check if it takes effect

3. Adjust the upper limit of ulimit, the default value is a relatively small number

ulimit -n 查看允许最大进程数
ulimit -u 查看允许打开最大文件数

修改:
sudo vi /etc/security/limits.conf 修改打开文件数限制
末尾添加:
*                soft    nofile          1024000
*                hard    nofile          1024000
Hive             -       nofile          1024000
hive             -       nproc           1024000

sudo vi /etc/security/limits.d/20-nproc.conf 修改用户打开进程数限制
修改为:
#*          soft    nproc     4096
#root       soft    nproc     unlimited
*          soft    nproc     40960
root       soft    nproc     unlimited

3. HBase optimization


1. Modify the zookeeper configuration: zookeeper.session.timeout

zookeeper.session.timeout=180000ms(默认值3分钟)

Description: The connection timeout between RegionServer and Zookeeper . When the timeout expires, the RegionServer will be removed from the RS cluster list by Zookeeper. After Hmaster receives the removal notice, it will rebalance the regions responsible for this server and let other surviving RegionServers take over.

Tuning method :

This timeout determines whether the RegionServer can failover in time. Setting it to 1 minute or lower can reduce the extended failover time due to waiting timeout.

However, for some online applications, the time from downtime to recovery of the RegionServer itself is very short (for example: network interruption, crash and other failures), if the timeout time is lowered, the loss outweighs the gain. Because when the RegionServer is officially removed from the RS cluster, the HMaster starts to balance (let other RS ​​recover according to the WAL log recorded by the faulty machine). When the faulty RS is restored by manual intervention, this balance action is meaningless, but it will make the load uneven and bring more burden to the RS. Especially those scenarios with fixed allocation of Regions.

2. Modify the HBase configuration: hbase.regionserver.handler.count

hbase.regionserver.handler.count=10(默认值)
说明:RegionServer的请求处理IO线程数

Tuning method:

This tuning parameter is closely related to memory.

Fewer IO threads are suitable for processing BIG PUT scenarios with high memory consumption for a single request (a large-capacity single PUT or a scan with a large cache are both BIG puts) or scenarios where the memory of the RegionServer is relatively tight.

More IO threads are suitable for scenarios where the memory consumption of a single request is low and the TPS requirement is very high. When setting this value, monitor the memory as the main reference.

Note: If the RegionServer has a small number of regions and a large number of requests fall on one Region, the read-write lock caused by the flush triggered by the rapid filling of the memstore will affect the global TPS. The higher the number of IO threads, the better.

3. Modify the HBase configuration: hbase.hregion.max.filesize

hbase.hregion.max.filesize=10G(默认值)
说明:在当前RegionServer上单个Region的最大存储空间,单个Region超过该值时,这个Region会被自动split成更小的region.

Note: The default value is 10737418240 (10GB). If you need to run the MR task of HBase, you can reduce this value, because a region corresponds to a map task. If a single region is too large, the execution time of the map task will be too long. This value means that if the size of the HFile reaches this value, the region will be split into two Hfiles.

Tuning method:

Small regions are friendly to split and compaction, because storefiles in split or compact small regions are fast and have low memory usage. The disadvantage is that split and compact will be very frequent. Especially when there are too many small regions, there will be constant split and compaction. It will cause the cluster response time to fluctuate greatly. Too many regions will not only bring troubles to management, but will even cause some HBase bugs. Generally, those below 512M are considered small regions.

Large regions are not suitable for frequent split and compaction, because doing a compact and split will cause a long pause, which has a great impact on the read and write performance of the application. Also, large regions mean larger storefiles. When performing compaction on a large region, it is also a challenge to memory.

Of course, large Regions are also useful. If the access volume at a certain point in your application scenario is low, performing compact and split at this time can not only successfully complete the split and compaction, but also ensure stable read and write performance most of the time.

Since split and compaction affect performance so much, how to reduce this impact?

Compaction is unavoidable, but split can be changed from manual to automatic.

As long as the value of this parameter is increased to a value that is difficult to achieve, such as 100G, automatic split can be disabled indirectly (RegionServer will not split the region that does not reach 100G). In conjunction with the RegionSplitter tool, manually split when split is needed.

Manual split is much more flexible and stable than automatic split. On the contrary, the management cost does not increase much. It is recommended to use online real-time system. Smooth read and write performance.

In terms of memory, a small region is more flexible in setting the size of the memstore, while a large region will not work if it is too large or too small. If it is too large, the IO wait of the app will increase during flushing. If it is too small, the read performance will be affected due to too many store files.

Ambari-HDP HBase default StoreFile is up to 10G.

Too many small Regions will frequently trigger minor Compaction.

4. Modify the HBase configuration: hbase.regionserver.global.memstore.upperLimit

hbase.regionserver.global.memstore.upperLimit/lowerLimit
默认值:0.4


upperLimit Description : The parameter hbase.hregion.memstore.flush.size is the sum of the sizes of all memstores in a single Region. When the specified value is exceeded, all memstores in the region will be flushed. The flush of RegionServerd is processed asynchronously by adding the request to a queue and simulating the production consumption mode. Then there is a problem here. When the queue is too late to consume and a large backlog of requests is generated, it may cause a sharp increase in memory. In the worst case, OOM is triggered (the program applies for too much memory, the virtual machine cannot satisfy us, and then commits suicide).

The function of this parameter is to prevent excessive memory usage. When the total memory occupied by the memstores of all regions in ReionServer reaches 40% of the heap, Hbase will force all block updates and flush these regions to release the memory occupied by all memstores.

lowerLimit description : lowelimit does not flush all memstores when the memory occupied by all Region's memstores reaches 40% of Heap. It will find a region with the largest memstore memory usage and perform individual flushes. At this time, write updates will still be blocked. lowerLimit can be regarded as a remedial measure before all regions are forced to flush and cause performance degradation. In the log, it appears as "** Flush thread woke up with memory above low water."

Tuning method:

This is a Heap memory protection parameter, and the default value can be used in most scenarios.

Parameter adjustment will affect reading and writing. If the pressure of writing often exceeds this threshold, reduce the read cache hfile.block.cache.size to increase the threshold, or if the Heap margin is large, do not modify the read cache size.
If the threshold is not exceeded under high pressure, then it is recommended that you properly reduce the threshold before performing a pressure test to ensure that the number of triggers is not too much, and then increase hfile.block when there is still a lot of Heap margin .cache.size improves read performance.
Another possibility is that hbase.hregion.memstore.flush.size remains unchanged, but RS maintains too many regions. You must know that the number of regions directly affects the size of the occupied memory.

5. Modify the HBase configuration: hfile.block.cache.size

<property>
  <name>hfile.block.cache.size</name>
  <value>0.4</value>
</property>

Description: The read cache of stofile occupies the percentage of the Heap size. This value directly affects the performance of data reading.

Tuning method:

Of course, the bigger the better, if the writing is much less than the reading, it is no problem to open it to 0.4-0.5, if the reading and writing are balanced, set it to about 0.3. If there are more writes than reads, just use the default decisively. When setting this value, you should also refer to "hbase.regionserver.global.memstore.upperLimit". This value is the maximum percentage of memstore in the heap. One of the two parameters affects reading and the other affects writing. If the two values ​​add up to more than 0.8-0.9, there is a risk of OOM.

The memory of RegionServer on Hbase is divided into two parts, one part is used as Memstore, which is mainly used for writing; the other part is used as BlockCache, which is mainly used for reading.

The write request will be written to the Memstore first, and the RegionServer will provide a Memstore for each Region. When the Memstore is full of 64M, it will start flushing to the disk. When the total size of Memstore exceeds the limit (heapsize * hbase.regionserver.global.memstore.upperLimit * 0.9), the flush process will be forcibly started, starting from the largest Memstore until it falls below the limit.

The read request first checks the data in the Memstore, checks it in the BlockCache if it cannot be found, and reads it on the disk if it cannot be found, and uses the read result method BlockCache. Since BlockCache uses the LRU strategy ( a page replacement algorithm for memory management , for data blocks (memory blocks) that are in memory but not used ), BlockCache reaches the upper limit (heapsize * hfile.block.cache.size * 0.85) After that, the elimination mechanism will be activated to eliminate the oldest batch of data.

There is a BlocakCache and N Memstores on a RegionServer, and the sum of their sizes cannot be greater than or equal to heapsize * 0.8 (heapsize can be viewed under hbase-env.sh), otherwise HBase cannot start. The default is 0.2 for BlockCache and 0.4 for Memstore. For systems that focus on read response time, you can set the BlockCache larger, such as setting BlockCache=0.4, Memstore=0.39, to increase the hit rate of the cache.

6. Modify the HBase configuration: hbase.hstore.blockingStoreFiles

hbase.hstore.blockingStoreFiles    = 10 (默认值)
hbase.hstore.blockingStoreFiles is now 10.

Description: When flushing, when there are more than 10 storefiles in a Store (Coulmn Family) in a region, all write requests of the block will be compacted to reduce the number of storefiles.
Tuning method: Block write requests will seriously affect the response time of the current regionServer, but too many storefiles will also affect read performance. From a practical point of view, in order to obtain a smoother response time, the value can be set to infinity. If you can tolerate large peaks and troughs in the response time, you can adjust it by default or according to your own scene.

ColumnFamily: d merge complete

7. Modify the HBase configuration: hbase.hregion.memstore.block.multiplier

hbase.hregion.memstore.block.multiplier   =  4 (默认值)

Explanation: When the memstore in a region occupies more than four times the size of hbase.hregion.memstore.flush.size, the write operation is blocked, flush and release the memory. Although we set the total memory size of memstores occupied by the region, such as 128M, but imagine that at the last 127.9M, a 400M data is put, and the size of the memstore will suddenly increase to exceed the expected hbase.hregion. Several times of memstore.flush.size. The function of this parameter is that when the size of memstore increases to more than 4 times of hbase.hregion.memstore.flush.size, write requests will be blocked to curb further expansion of risks.

Tuning method:

The default value of this parameter is quite reliable. If you estimate that your normal application scenarios (excluding exceptions) will not have burst writes or the amount of writes is controllable, then keep the default value. If under normal circumstances, your write request volume will often increase to several times of normal, then you should increase this multiple and adjust other parameter values, such as hfile.block.cache.size and hbase.regionserver.global.memstore .upperLimit/lowerLimit, (but the sum of these two values ​​should not exceed 0.8-0.9) to reserve more memory to prevent HBase server OOM.

8. Modify the HBase configuration: hbase.hregion.memstore.mslab.enabled

hbase.hregion.memstore.mslab.enabled = true (默认值)

Description: Reduce Full GC caused by memory fragmentation and improve overall performance.

Tuning method:

Arena Allocation is a GC optimization technology that can effectively reduce the Full GC caused by memory fragmentation, thereby improving the overall performance of the system. This article introduces the principle of Arena Allocation and its application in Hbase-MSLAB.

Start MSLAB:

// 开启MSALB
hbase.hregion.memstore.mslab.enabled=true // 开启MSALB
    
// 内存分配单元 chunk的大小,越大内存连续性越好,但内存平均利用率会降低
hbase.hregion.memstore.mslab.chunksize=2m

// 通过MSLAB分配的对象不能超过256K,否则直接在Heap上分
hbase.hregion.memstore.mslab.max.allocation=256K

9. Optimize the maximum number of open files allowed by DataNode

属性:dfs.datanode.max.transfer.threads

File: hdfs-site.xml

Explanation: HBase generally operates a large number of files at the same time. According to the number and size of the cluster and data actions, set it to 4096 or higher. Default: 4096

<-- DataNode 进行文件传输时的最大线程数 -->
<property>
        <name>dfs.datanode.max.transfer.threads</name>
        <value>16384</value>
</property>

Ambari-web default:

10. Optimize the waiting time of data operations with high latency

Property: dfs.image.transfer.timeout

File: hdfs-site.xml

Explanation: If the delay is very high for a certain data operation and the socket needs to wait for a longer time, it is recommended to set this value to a larger value (60000 milliseconds by default) to ensure that the socket will not be timeout.

11. Optimize data writing efficiency (whether the map output is compressed)

Attributes:

mapreduce.map.output.compress
mapreduce.map.output.compress.codec

File: mapred-site.xml

Explanation: Enabling these two data can greatly improve the writing efficiency of the file and reduce the writing time. The value of the first property is changed to true, and the value of the second property is changed to: org.apache.hadoop.io.compress.GzipCodec

Whether to compress the map output, if it is compressed, it will consume more CPU, but reduce the transmission time, if it is not compressed, it will require more transmission bandwidth. Used with mapreduce.map.output.compress.codec, the default is org.apache.hadoop.io.compress.DefaultCodec, you can set the data compression method according to your needs.

12. Optimize DataNode storage

属性:dfs.datanode.failed.volumes.tolerated

File: hdfs-site.xml

Explanation: The default is 0, which means that when a disk in the DataNode fails, the DataNode will be considered to be shutdown. If it is changed to 1, when a disk fails, the data will be copied to other normal DataNodes, and the current DataNode continues to work.

<-- DataNode 决定停止数据节点提供服务允许卷的出错次数,0 则表示任务卷出错都要停止数据节点 -->
<property>
        <name>dfs.datanode.failed.volumes.tolerated</name>
        <value>0</value>
        <final>true</final>
</property>

13. Optimize hbase client cache

属性:hbase.client.write.buffer

Content: hbase-site.xml

Explanation: The default value is 2097152bytes (2M), which is used to specify the HBase client cache. Increasing this value can reduce the number of RPC calls, but it will consume more memory, and vice versa. Generally, we need to set a certain cache size to achieve the purpose of reducing the number of RPCs.


Reference: https://blog.csdn.net/yueyedeai/article/details/14648111

Link: Hbase Performance Tuning (1)_Blog driving a tractor home - CSDN Blog

Guess you like

Origin blog.csdn.net/qq_35995514/article/details/131422956