The ResourceManager is stuck and the cluster job cannot be submitted

1. Exception description

The business responded that the hive job could not be submitted, and then performed the following simple test in beeline, and found that it was stuck in the following process:

1), the first query: the application id of this query is application_1600160921573_0026

select count(*) from co_ft_in_out_bus_dtl;

Insert picture description here
2), the second query: This query did not generate application_id after waiting for a long time in the following process,

select count(*) from xft_sheet1;

Beeline output stuck in the following process

Submitting tokens for job: job_1600165270401_0014
INFO  : Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:nameservice1, Ident: (token for hive: HDFS_DELEGATION_TOKEN owner=hive/[email protected], renewer=yarn, realUser=, issueDate=1600165500389, maxDate=1600770300389, sequenceNumber=20818157, masterKeyId=641)

Insert picture description here
3) At this time, the execution of pyspark is also slow, but the length of time to put a large file to HDFS is not significantly different from the previous normal state of the cluster, indicating that HDFS has not slowed down
Insert picture description here
4). Check the ResourceManager chart to show GC

Insert picture description here
The ResourceManager GC time was very long when the problem occurred, reaching 9s.

Insert picture description here
It was found that the GC time on September 10 was also very long, reaching as fast as 80s, but there was no abnormality in the cluster.

Insert picture description here
5) Check that the JVM of ResourceManager was not used much at that time.

Insert picture description here
The JVM usage of ResourceManager has been relatively constant in the past 7 days, and it has not reached the peak of 4GB configured by ResourceManager JVM.

Insert picture description here

2. Anomaly analysis

1. In order to restore the business as soon as possible, try to restart the ResourceManager several times, but found that the exception still cannot be resolved. So check the ResourceManager log, and troubleshoot the problem with the application_1600160921573_0026 submitted by the previous test. In the ResourceManager log, you can see that the submitted Job has been repeating Recovery

2020-09-15 18:09:46,009 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Recovering app: application_1600162721525_0026 with 0 attempts and final state = NONE
...
2020-09-15 18:21:31,592 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Recovering app: application_1600162721525_0026 with 0 attempts and final state = NONE
...
2020-09-15 18:33:44,648 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Recovering app: application_1600162721525_0026 with 0 attempts and final state = NONE
...
2020-09-15 18:45:31,393 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Recovering app: application_1600162721525_0026 with 0 attempts and final state = NONE
...
2020-09-15 18:55:21,618 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Updating application application_1600162721525_0026 with final state: FAILED

Insert picture description here
2. At the same time, you can see the following error related to Zookeeper in the Active ResourceManager (cmsnn002) log. Through the following log, we can see that the Active ResourceManager enters the Standby state due to the abnormal connection of Zookeeper [2]:

2020-09-15 16:36:00,882 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed out ZK retries. Giving up!
2020-09-15 16:36:00,882 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Watcher event type: None with state:Disconnected for path:null for Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
2020-09-15 16:36:00,882 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session disconnected

2020-09-15 16:36:00,883 WARN org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Transitioning the resource manager to standby.
2020-09-15 16:36:00,921 INFO org.apache.hadoop.ipc.Server: Stopping server on 8032

Insert picture description hereInsert picture description here
Insert picture description here
3. But the bad thing is that around 16:36 that day, another ResourceManager (cmsnn001) seems to have been unable to enter the Active state, (but due to missing related logs, it is impossible to confirm the specific reason, but it may also be caused by the loss of Zookeeper). This causes the ResourceManager (cmsnn002) to become Active again, but 10 minutes have passed, which is equivalent to the ResourceManager downtime for these 10 minutes [3], and all tasks cannot be submitted.

2020-09-15 16:47:09,713 INFO org.apache.hadoop.ha.ActiveStandbyElector: Writing znode /yarn-leader-election/yarnRM/ActiveBreadCrumb to indicate that the local node is the most recent active...
2020-09-15 16:47:09,718 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Transitioning to active state
2020-09-15 16:47:22,879 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Recovering 10176 applications

Insert picture description here
4. Combining with the previous September 10th, I also saw the abnormal increase of ResourceManager's GC time, so I tried to extract more effective information combined with the ResourceManager log of September 10th. Check the log of the ResourceManager when the problem occurred on September 15th and when the ResourceManager GC time was abnormally increased on September 10th, and found the following exception [4]. This exception indicates that the ResourceManager tried to establish a connection with Zookeeper, but it went through 1000 After several (1-1000) times, it is still unable to establish a connection with Zookeeper, so each Zookeeper node enters an endless loop connection.
Insert picture description here
5. So check the Zookeeper log and try to find the reason why the ResourceManager can't connect to Zookeeper all the time. In the Zookeeper log, the following abnormal information is found at the time of the problem [5]:

2020-09-15 16:25:43,431 WARN org.apache.zookeeper.server.NIOServerCnxn: Exception causing close of session 0x17187f84f281336 due to java.io.IOException: Len error 14307055

Insert picture description here
6. Through searching for information, there is a Knowledge Base on Cloudera's official website that mentions a similar problem in Zookeeper [6]. It is said that this problem is related to the size of Zookeeper's Jute Max Buffer parameter configuration.

https://my.cloudera.com/knowledge/Active-ResourceManager-Crashes-while-Executing-a-ZooKeeper?id=75670
Insert picture description here
7.Through the introduction of Knowledge Base, check that the Jute Max Buffer value of the current cluster Zookeeper is 4MB, and this value is 4MB by default [7], and we see the exception of Len error 14307055 (≈14MB) in the Zookeeper log, indicating that Zookeeper in the cluster accepts The data fragment of is already much larger than the default 4MB, which leads to an increase in Zookeeper's load. At a certain moment, after Active ResourceManager disconnects from Zookeeper, it can no longer be reconnected.

Insert picture description here

3. Exception resolution

1. Follow the steps below to increase the value of Zookeeper's Jute Max Buffer from the default 4MB to 32MB.

1) Open Cloudera Manager> Zookeeper> Configuration;

2) Search for Jute Max Buffer and modify the value of jute.maxbuffer from 4MB to 32MB;

Insert picture description here
3) Save the settings and restart the Zookeeper service on a rolling basis.

2. In the beeline test again, it is found that all tasks can be submitted normally, the previously pending application gradually resumes operation, and the pending container is gradually reduced. So far the problem is solved.

4. Summary of the problem

1. The cause of this failure is still related to the cluster load at that time, because Zookeeper's jute.maxbuffer overflowed, causing an error on the Zookeeper server, which closed the connection with the ResourceManager [1]. The Len error is not sent by the ResourceManager, but an error on the Zookeeper Server side, because the amount of data that Zookeeper wants to return to the ResourceManager exceeds the default 4MB limit. The specific amount of data should be related to the current situation of the corresponding Znode.

2020-09-15 16:25:43,431 WARN org.apache.zookeeper.server.NIOServerCnxn: Exception causing close of session 0x17187f84f281336 due to java.io.IOException: Len error 14307055

Insert picture description here
2. Generally, this problem may be related to the number of running jobs, the number of Attempts of the job, the increase in application load, and the expansion of the cluster. We check the time of this issue and on September 10th, YARN has more than 70,000 pending containers [2], which is relatively large for the load of YARN and Zookeeper at that time, which is also the reason for the disconnection of ResourceManager and Zookeeper.

On September 15th, there were more than 70,000 YARN pending containers:
Insert picture description here
On September 10th, there were more than 70,000 YARN pending containers:

Insert picture description here
YARN pending container in the past 30 days:

Insert picture description here
3. Through further investigation, it is found that the Lenerror exception caused by this fault is actually caused by a bug in Zookeeper (ZOOKEEPER-706) [1]. The data that YARN writes to Zookeeper includes not only the attribute information of the application, but also the runtime information such as the delegationToken of the application and the diagnostic info of each attempt. Therefore, a large-load cluster is prone to write a large amount of data to Zookeeper. YARN Important repairs have been made in this regard, see YARN-3469, YARN-2962, YARN-7262, YARN-6967 [2] for details. Our cluster is CDH5.15.1. Through investigation, it is found that this problem of ResourceManager is not fixed on CDH5. Although CDH5 has included YARN-3469, at least YARN-2962, YARN-7262 and YARN- are needed to solve this problem. 6967, by optimizing the structure of znode to prevent a large number of znode tiles from causing Len errors in Zookeeper services. All of these have entered CDH6 and subsequent versions, but will not be included in CDH5. In addition, although the community has improved YARN, as new features of YARN are constantly being added, Zookeeper's storage requirements are also increasing. Even if you upgrade to CDH6, there may still be a problem of writing large amounts of data between YARN and Zookeeper.

【1】

https://my.cloudera.com/knowledge/ERROR-javaioIOException-Len-errorquot-in-Zookeeper-causing?id=275334
https://issues.apache.org/jira/browse/ZOOKEEPER-706

【2】

https://issues.apache.org/jira/browse/YARN-3469
https://issues.apache.org/jira/browse/YARN-2962
https://issues.apache.org/jira/browse/YARN-7262
http://mail-archives.apache.org/mod_mbox/hadoop-yarn-issues/201708.mbox/%3CJIRA.13093146.1502193666000.123395.1502242800181@Atlassian.JIRA%3E

5. Fault summary

  1. ResourceManager connects to Zookeeper mainly to update its State-store, which is used to store the persistent state of the YARN cluster, including all running applications and attempts. This state-store is saved in Zookeeper's /rmstore directory, and we can access and read this data through zookeeper-client. When Zookeeper fails over, the Zookeeper elected as Active will read the data in /rmstore and restore the running state of the entire YARN cluster;

2. When the problem occurred, I saw two ResourceManagers constantly trying to switch between active and standby, but none of them were successful because the ResourceManager did not wait for a response from any Zookeeper.

1) Specifically, the ACTIVE ResourceManager (cmsnn002) is constantly trying to connect to one of the Zookeepers, through the log org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Retrying operation on ZK. Retry no. 999 in the ResourceManager. Seeing that one of the ResourceManagers tried to connect to the first Zookeeper 1000 times, but did not get a response from Zookeeper, and then continued to try to connect to the second Zookeeper from 1, but still did not get a response from the second Zookeeper after 1000 times. We have five Zookeepers. It repeatedly traversed 1000 connections on five Zookeepers in turn, and finally did not get a response from Zookeeper. In this process, we tried rolling restart the ResourceManager but still did not recover, which is equivalent to the infinite loop of the ACTIVE ResourceManager connecting to Zookeeper state that cannot work.

2) The number of times that ResourceManager reads data from Zookeeper is 1000 by default in each Zookeeper. As long as the Zookeeper data is read once, the active/standby switch can be completed. The default parameter that controls the number of times ResourceManager reads data from Zookeeper is yarn.resourcemanager.zk-num-retries, and the default parameter that controls each read time is yarn.resourcemanager.zk-retry-interval-ms. We can modify the configuration through CM's ResourceManager Advanced Configuration Snippet (Safety Valve) for yarn-site.xml. But the modification of these two values ​​is not helpful to this fault, because if the bug of ZOOKEEPER-706 is not fixed, it is still possible that ResourceManager cannot read data even if it is connected to Zookeeper.

3. In this failure, we solved this problem by changing Zookeeper's jute maxbuffer to 32MB. Regarding the configuration size of this value, we can find the maximum value of Len error in the Zookeeper log. If it is less than 32MB, then the unified recommendation is 32MB. But this value cannot be configured too large, after all, Zookeeper is not designed for large fragment data storage. If the application (or service) continues to write a large amount of data to Zookeeper, it will put pressure on the synchronization between the disk and Zookeeper itself, and it is easy to rely on Zookeeper's application (service) is unstable. For example, this YARN failure is essentially caused by this reason.

4. If you encounter this problem again next time, we can determine the current Zookeeper Leader from the CM -> Zookeeper page, and then extract the Zookeeper log directory (such as /var/log/zookeeper/) and data directory (such as / var/lib/zookeeper/version-2) directory for analysis. You can analyze Zookeeper's data storage, and then further determine which piece of data is written to cause the Zookeeper Len error problem. Through LogFormatter, to see what data zookeeper wrote at the time. For specific methods, please refer to the following link [1].

【1】:http://www.openkb.info/2017/01/how-to-read-or-dump-zookeeper.html

1) Zookeeper data is first written into the transaction log of the data directory (for example, /var/lib/zookeeper/version-2), and when the total amount reaches 100,000 records, a snapshot will be taken automatically [2]. Therefore, the data file is not saved according to time, and the data saving date cannot be set. The default setting of CDH is to clean this directory every day, and only keep 5 snapshots. If you want to keep more data, you can set to save more in CM -> Zookeeper -> Configuration -> Auto Purge Snapshots Retain Count [3] Snapshots, but this may not be very helpful for locating the problems caused by Zookeeper, because the time when the snapshots are generated is not necessarily when the problem occurs. So if the problem recurs, it is recommended to package the entire data directory at that time for analysis.

【2】
Insert picture description here
【3】
Insert picture description here

Guess you like

Origin blog.csdn.net/qq_43081842/article/details/114107934