Record a violent solution to the HBase RIT problem in production

1. Phenomenon:

Last night, the cluster was congested and the memory was not enough, which caused the HBase RegionServer to hang up!
Immediately following CDH HBase Master (active) node is red, displaying the message: HBase Regions In Transition Over Threshold
At this time I know I have met HBase RIT again.

2. Common solutions:

2.1 After restarting HBase, after trying twice, I found that the HBase Master (active) node is still red

Although we can connect to HBase, the query speed ( dbeaver tool + Phoenix ) is very slow, and
an error was thrown: Cache of region boundaries are out of date.
image

2.2 On the master node: use the user where the hbase process is located

su-hbase
hbase hbck -fixAssignments is 
used to repair errors in region assignments. I observed that there are more and more RIT Regions. I
also tried hbase hbck and hbase hbck -repair to terminate the command in time. The log has not been resolved after a long time.

2.3 Check the master node log and find the following log
2018-08-21 09:50:47,924 INFO org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 1 unassigned = 0 tasks={/hbase/splitWAL/WALs%2Fhadoop49%2C60020%2C1534734073978-splitting%2Fhadoop49%252C60020%252C1534734073978.null0.1534762936638=last_update = 1534816154977 last_version = 22 cur_worker_name = hadoop47,60020,1534815723497 status = in_progress incarnation = 2 resubmits = 2 batch = installed = 1 done = 0 error = 0}

Because the log is refreshed very quickly and the info level is added, I really didn't pay attention at the beginning!
Later, by analyzing the H Base master web interface , it was found that:

  • a.RIT's regions are all in hadoop49 machines

  • b. The log log of the master also shows the splitting log of the hadoop49 machine,

    Always in the in_progress state
    (observed for almost 10 minutes, keep refreshing the info level and this state)

image

3. Violent resolution:

3.1 Directly use the hdfs command to find the log first, and then rm to delete it (remove to the recycle bin)
hadoop36:hdfs:/var/lib/hadoop-hdfs:>hdfs dfs -ls hdfs://nameservice1/hbase/WALs/*splitting
Found 1 items
-rw-r--r--   3 hbase hbase   21132987 2018-08-20 19:02 hdfs://nameservice1/hbase/WALs/hadoop49,60020,1534734073978-splitting/hadoop49%2C60020%2C1534734073978.null0.1534762936638
hadoop36:hdfs:/var/lib/hadoop-hdfs:>
hadoop36:hdfs:/var/lib/hadoop-hdfs:>
hadoop36:hdfs:/var/lib/hadoop-hdfs:>hdfs dfs -rm hdfs://nameservice1/hbase/WALs/hadoop49,60020,1534734073978-splitting/hadoop49%2C60020%2C1534734073978.null0.1534762936638
18/08/21 12:46:15 INFO fs.TrashPolicyDefault: Moved: 'hdfs://nameservice1/hbase/WALs/hadoop49,60020,1534734073978-splitting/hadoop49%2C60020%2C1534734073978.null0.1534762936638' to trash at: hdfs://nameservice1/user/hdfs/.Trash/Current/hbase/WALs/hadoop49,60020,1534734073978-splitting/hadoop49%2C60020%2C1534734073978.null0.1534762936638
hadoop36:hdfs:/var/lib/hadoop-hdfs:>
3.2 Restart HBase, wait for a while, everything is normal, and ensure that HBase provides external services.
3.3 Because we delete the HLOG file, we will inevitably lose data, so we use MCP real-time middleware and web interface to customize the data refresh job (19:00~21:00 failure range time last night) to restore the data.


image


Guess you like

Origin blog.51cto.com/15060465/2679394