Remember a bug handling experience

    This time, the design components of the problem are a little more complicated. Here is a general introduction. We have done car information query. Responsible for the big data module, there is another group responsible for the bayonet, the data is extracted from each bayonet by them, and then inserted into hbase. Then through the coprocessor, the data is sent to the AMQ message queue before being written into hbase, and then consumed to establish a solr index. The architecture is basically like this.

    Based on this, around New Year's Day, there was a little problem with the data. The data on the high-speed bayonet before New Year's Day can be queried through solr, but the data after New Year's Day cannot be queried. When we found the problem at that time, we confidently believed that there must be a problem with the data extraction. Later, after investigation, the data of the high-speed bayonet was indeed extracted. Then we find the problem component by component, first we start with hbase. Due to the inaccuracy of the data extraction, it is known that the specific data is lost, only that the bayonet has a number, and these numbers belong to the high-speed, which brings some inconvenience to our query. I first thought of an easy way to query, which is to create an external table through hive and then map it to the table I need to query in hbase. Here I can choose a new one and only map a few fields that I need to know.

    The mapping statement is as follows:

CREATE EXTERNAL TABLE xx(
rowkey string, 
aa string, 
bb string) 
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = "
:key,
cf:aa,
cf:bb")
TBLPROPERTIES("hbase.table.name" = "table");

    Then, we can easily find out the data we need through hive.

    In hbase, the data with high-speed bayonet is indeed found. Then we thought that the program that consumes data in AMQ is written by ourselves, but it is impossible that there is a problem when consuming. After checking the logs of our consumer program, there is no exception log. In order to confirm that there is no problem on the node of consumption, I added a condition to the code on the spot. If the data of the high-speed bayonet is encountered, the log will be printed out. After restarting the consumer program, the verification result is that there is no data from the high-speed bayonet.

    So far, the problem has been basically narrowed down to the coprocessor and AMQ. At that time, the colleague in charge of the coprocessor was not there at the time, and the code verification could not be passed for a while. Then I had to bite the bullet and check the logs of the coprocessor first. When I looked at the logs, I really found that there was a problem. I found that there were several logs of the coprocessor in the log on one node of our RS. The log was defined by ourselves, roughly This means that the coprocessor has not been successfully mounted on the region. The problem is found, how to solve it at that time. Continue to search in the log and find that there is an exception that the class cannot be found, and the jar package of AMQ is missing. Because the coprocessor sends data to AMQ, it needs AMQ-related packages to connect to AMQ. Then we add AMQ related packages, restart the RegionServer on the node, OK, the data of the high-speed bayonet can be found in solr.

    So far, the problem has been solved. Through this search, one is the accumulation of experience. On the other hand, I also understand the importance of specifications, all of which occur because a node was added before New Year's Day, that is, the node with the problem. We have documentation on adding nodes, but it's fragmented. Once there is an omission, you will dig a hole for yourself, so it is indispensable to standardize the process.



Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325915202&siteId=291194637