Remember a HBASE failure analysis and troubleshooting process

HBASE is a commonly used component in the big data technology stack. It can store tens of billions of data in a single table and provide millisecond-level read and write capabilities. It can reach ultra-high TPS capabilities through linear expansion. It shows its talents in storage and query scenarios, but in actual production, there are often unexpected troubles.

This year, a project supported by the author experienced continuous downtime events in a large-scale cluster deploying HBASE, which seriously affected the business. After each downtime, the startup takes a long time, resulting in business interruption. The most critical problem is that it has continuous downtime and the reason is unknown, so there is a high probability that there is a risk of another downtime. If the root cause has not been located, the failure is that the sword of Damocles is still hanging in the air. Don't know when it will fall. The following sections will be based on this troubleshooting analysis process as the main line.

1. Fault clues, opportunities appear

The severe downtime event of the HBASE cluster was specifically manifested in the abnormal exit of the management node HMaster, and more than half of the RegionServers also exited due to downtime, resulting in interruption of business production. The operation and maintenance personnel start the cluster operation according to the usual steps, and then analyze and locate the fault. Since the log level is set to DEBUG level, the log refresh speed is fast. After the restart, the backup is not performed in time, and the log at the time of the fault cannot be obtained. This caused a lot of trouble to find the cause of the failure. Just when everyone faced the lack of key logs and did not know how to perform fault analysis and recovery, in less than 24 hours, the cluster experienced another downtime! The phenomenon is the same as the first time, and the HMaster exited abnormally, and a large number of RegionServers also went down during the period.

Although continuous failures have brought serious customer dissatisfaction, it also gave us a great opportunity to uncover the root cause of the failure.

2. Troubleshooting, there is a lot of fog

With the previous lessons learned, after each failure occurred, immediately dump and save the relevant logs of HBASE as soon as possible, and then immediately start analysis and positioning. The appearance of this failure is really special. The HBASE log reflects the Zookeeper connection exception:

At this time, all the thoughts are focused on the abnormal point of Zookeeper. Why does Session expire appear? Is there something abnormal in Zookeeper?

HBASE failures also occurred in the cluster before, because the Zookeeper connection timed out. At that time, the following measures were taken to solve it:

So it is relatively certain that the fault should still be on Zookeeper, so analyze ZK's log as soon as possible. However, during the entire failure process of ZK, there has never been any abnormality or downtime behavior, which brings great uncertainty to the problem location, because Zookeeper is normal, even if the HBASE log shows that the connection to ZK is abnormal, but at this time it is still There is no way to locate the specific reason caused by which parameter or configuration, that is to say, the real convincing reason is not located, so there is definitely no way to close the case, and the system may go down again at any time, and the crisis is still there.

You can only analyze Zookeeper logs again to find possible causes of exceptions. Through the logs of ZK, I found a situation that certain hosts have a particularly large number of connections, and there are millions of connections in a day. According to experience, this is not normal. Is it because of excessive visits? ? Quickly, a comprehensive monitoring of various indicators of ZK was carried out to find possible connection anomalies.

  • Troubleshooting which programs send out the hosts with abnormal connection data

  • Simulate a high-load connection to ZK in the laboratory to test the stability

However, the result was not smooth. The host did not deploy any special programs. It was only the computing node of the HADOOP cluster. The internal test also carried out a million-level high-load connection pressure test, and ZK did not show any abnormalities.

3. Pull the clouds to see the sun, the truth comes out

After a day of hard research, finally saw the light!

By analyzing the logs, it is found that the sessionid connected to Zookeeper is abnormal. HMaster and Zookeeper maintain the connection through the session, and the log here shows that the sessionId is: 0xff8201f4b7b63a73

Check the logs of all ZK servers, and found another piece of information:

This sessionid is not the host of HMaster, it is sent by 10.26.9.35, that is to say, there is a conflict of sessionid. However, due to the conflict of sessionid and the modification of the metadata information of the original session, the long connection between the Master and ZK has expired abnormally, and the contract cannot be renewed normally, and HMaster exits.

Then the next thing is clear: find out why ZK's sessionid conflicts

First detect the source code of sessionid generated by ZK:

It must be seen here that it is generated based on time and the instance ID of ZK. The problem now is that the same ID will parameterize the same start ID segment.

By testing this code, we found the problem, no matter what id is sent, the result is the same, so different ZK instances will definitely generate the same sessionid, that is to say, a BUG is coming!

After carefully analyzing this code, it was found that the signed right shift caused overflow, and then I checked the subsequent version immediately, found and fixed it:

The code is indeed very small, that is, >>8 becomes >>>8

Sign shifting is rarely used in daily use, and it is indeed error-prone:

Java provides two right shift operators: ">>" and ">>>". Among them, ">>" is called a signed right shift operator, and ">>>" is called an unsigned right shift operator. Their function is to shift the binary number corresponding to the object involved in the operation to the right by a specified number of digits . The difference between the two is that when ">>" performs a right shift operation, if the number involved in the operation is a positive number, 0 is added to the high bit; if it is a negative number, 1 is added to the high bit. However, ">>>" is different, regardless of whether the number involved in the operation is positive or negative, when the operation is performed, 0 will be added to the high bit.

The problem above is that after shifting 24 bits to the left and then shifting 8 bits to the right, the current time overflows, causing it to become a negative number, and the truth is revealed.

Update patch notes for ZK:

ZOOKEEPER-1622  session ids will be negative in the year 2022

 

It is true that after 2022, all ZK instances will generate the same initial value of sessionid due to the negative number of sessionid, so there will be a conflict.

4. Troubleshooting

Since it is caused by ZOOKEEPER's BUG, ​​it is inevitable that there will still be abnormalities again in the follow-up, so the root cause must be patched to ZK in order to completely solve it. In the absence of sufficient verification, the only way to reduce the risk of BUG being triggered is by avoiding frequent connections from the program to ZK and requiring rectification of the application, and performing daily system and business inspections before updating.

After a business cycle operation observation, the system runs stably.

5. Experience summary

The fault location this time was full of twists and turns, and it took a lot of effort to finally complete the problem location. Because it is caused by a component bug and will only be triggered in 2022, this is the first time that this has occurred in the operation and maintenance of HADOOP clusters for many years.

The analysis logs during the fault analysis process include: HBASE MASTER, REGIONSERVER instance logs, ZOOKEEPER 5 host instance logs, and YARN scheduler running record logs.

Through this fault analysis, it can be seen that there are often multiple reasons, which also reflect the complexity of the big data ecosystem. The compatibility and collaboration capabilities of different components are crucial, and require R&D and operation and maintenance teams to have a strong ecosystem. And source code control capabilities, rather than simply stacking open source components and using them.

A few messages:

01 After a failure occurs, the site needs to keep all the logs as soon as possible

02 It is necessary to find valuable information from the exception log at the point of failure

03 It is necessary to integrate all log files for comprehensive analysis. Big data clusters are relatively complex, and all component information needs to be comprehensively analyzed to locate real problems in a complex environment.

Guess you like

Origin blog.csdn.net/whalecloud/article/details/128084017