1. Garbage collection optimization
Users can set garbage collection related options by adding HBASE_OPTS or HBASE_REGIONSERVER_OPT to the hbase-env.sh file. The latter only affects the region server process and is also the recommended modification method.
Increase the size of the new generation and reduce the number of garbage collections in the new generation
-XX:MaxNewSize=8g -XX:NewSize=8g
Modify garbage collection policy
-XX:+UseParNewGC
Set the young generation to use the Parallel New Collector strategy, which will stop the running Java process and empty the young generation heap. The young generation is small compared to the old generation, so this process takes a very short time, usually a few hundred milliseconds.
-XX:+UseConcMarkSweepGC
If the above strategy is used in the old generation, it will cause the region server to pause for a few seconds or even minutes. If the pause time exceeds the zookeeper session timeout limit, the server will be considered by the master to have crashed, and will then be abandoned.
This situation can be mitigated using the Concurrent Mark-Sweep Collector (CMS) strategy, which differs in that its work attempts to complete the work asynchronously and in parallel without stopping the running Java process. This strategy will increase the burden on the CPU, but avoid the pause when rewriting the old generation fragments - unless a hint failure occurs, which will force the garbage collection to suspend the running Java process and perform memory defragmentation.
2. Local memstore allocation buffer
As memstore keeps creating and releasing memory space, holes will be created in the old generation Heap. When allocating new space, there is not enough contiguous space allocation due to excessive fragmentation, and the JRE will fall back to using the (stop the world) garbage collector, which will cause it to rewrite the entire heap space and compact the remaining available objects.
MSLAB (memstore-local allocation buffers) are many fixed-size buffers used to store keyvalue instances of different sizes. When a buffer cannot hold a newly added keyvalue, the system considers the buffer to be full and creates a new fixed-size buffer. Once these buffer objects are reclaimed, they will leave fixed-size holes in the heap, which will be reused by subsequent calls to new fixed-size objects, thus eliminating the need for the JRE to stop compacting to reclaim heap memory.
But mslab also has some side effects, such as more waste of heap space; using buffers requires additional memory copying work, which is slightly slower than using keyvalue instances directly
Configure hbase.hregion.memstore.mslab.enabled in hbase-site.xml The default value is true
3 Compression
Unless you are storing already compressed content such as JPEG images, for other scenarios, compression usually leads to better performance because the CPU compressing and decompressing takes less time than reading and writing more data from disk .
algorithm | Compression ratio % | Compress MB/S | Decompress MB/S |
GZIP | 13.4 | 21 | 118 |
VOC | 20.5 | 135 | 410 |
Zippy/Snappy | 22.2 | 172 | 409 |
By default Hbase does not compress files, see describe 'tablename'
4. Optimize split and merge
Usually Hbase handles region splitting automatically. Once they reach a predetermined threshold, the region will be split into two, after which they can accept new data and continue to grow. When the user's region size has grown at a constant speed, the region split will occur at the same time, because the storage files in the region need to be compressed at the same time, this process will rewrite the split data, which will cause IO to increase, which is called "Split and Merge Storm".
Instead of relying on automatic splitting, turn off this behavior and call the split and major_compace commands to split manually.
To prevent automatic splitting, set the value of hbase.hregion.max.filesize to a relatively large value, such as 100GB. Then use the client to implement a client that calls split() and majorCompact(). You can also use the shell to interactively call related commands, or use cron to call them regularly.
Another way is to do a pre-split when creating the table
create 't1', 'f1', SPLITS => ['10', '20', '30', '40']
5. Load Balancing
The master has a built-in feature called the balancer, which by default runs every 5 minutes (set via hbase.balancer.period). Once the balancer starts, it tries to distribute regions evenly across all region servers. The user can change the on or off state of the balancer through the balance_switch command of the shell.
In addition to relying on the balancer to do the work automatically, users can also use the move command to explicitly move a region to another region server.
6. Merge regions
When a user deletes a large amount of data and wants to reduce the number of regions managed by each server, the merge_region command can be used to merge adjacent regions.
7 Client API Best Practices
Disable auto refresh
put.setAutoFlush(false)
Use scan cache
scan.setCaching(1000);
limited scan range
Try to scan only in one Family
Close ResultScanner
Be sure to close the ResultScanner in the try catch finally
Cache usage
scan.setCacheBlocks() For those frequently accessed rows, block caching is recommended
Optimize the way to obtain row keys When the user only needs to obtain the required row keys, use the setFilter() method in Scan to add a FilterList with MUST_PASS_ALL. FilterList contains two filters, FirstKeyFilter and KeyOnlyFilter. Using the above combined filter will return the first keyvalue row found to the client.
Turn off WAL on put
When the data to be stored does not require high accuracy, use Put's writeToWAL(false) to close the WAL.