7.2 hadoop failed: mission failure, application master fails, the management node fails, the resource manager fails

1.1 Failed

1.1.1 The mission failed

Map and reduce task failed : Map defective or reduce the task, an exception is thrown, JVM will be sent an error report to applicationmaster, applicationmaster mark the task failed, the user log write error report, the release of resources.

Stream mission failure : Streaming tasks to a non-zero exit status code, it is marked as failed, property stream.non.zero.exit.is.failure property is set to true, will trigger.

Jvm failed: Jvm software defects abruptly quit, Node Manager will find a process to exit, notice applicationmaster mark mission failed.

Task timeouts : applicationmaster is not received within a period of time progress updates, it will mark the task fails, the timeout by mapreduce.task.timeout setting of 0 indicates no timeout limit, this will lead to a pending task can not be completed, the release of resources .

Task retry : after a failure, application master will try to arrange to run the task again on the other Node Manager, the number of failures exceeds the value mapreduce.map.maxattempts property settings (default 4), the entire job will fail. If you do not want a single task failed, it is determined that the entire operation fails, failure rates, mapreduce.map.faileures.maxpercent property and mapreduce.reduce.failures.maxpercent ratio can be set.

Task is terminated: a copy of the task is to guess or node crashes, application master task will be marked as aborted (killed), do not count the number of failed attempts (maxattempts).

1.1.2 Application master fail

Mapreduce Application master attempts controlled by mapreduce.am.max-attempts property, the default value is 2. YARN YARN application master for the maximum number of attempts to run on the cluster also added a number of restrictions, set by the property yarn.resourcemanage.am.max-attempts, the default is 2, want to increase the number of attempts Mapreduce Application master, the first increase YARN settings.

application master restart a heartbeat between the Explorer and the application master, when the application fails, the Explorer does not detect a heartbeat, you open a new application master in a new container, using the job to restore the historical task of the state, without having to re-run , yarn.app.mapreduce.am.job.revocery.enable to turn this recovery.

Redirect the client application master address : The client application report to the master round robin schedule, initialization, client-side caching application after the master address, off-linking, the client application will re-address request to the master of the Explorer.

1.1.3 Node Manager failed to run.

There is also a heartbeat between a node manager and a resource manager, if 10 minutes (attribute yarn.resourcemanager.nm.liveness-monitor.expiry-interval-ms disposed in seconds). Explorer will have a problem Node Manager is removed from the section of the battery. Work on removing the Node Manager unfinished, on the other node recovery, re-run.

Node Manager Blacklist: blacklist by the applicationmaster management, for mapreduce task, there are three task fails on one node manager, will try to schedule tasks to different nodes. Mapreduce.job.maxtaskfailures.per.tracker property set thresholds.

1.1.4 Resource Manager Fails

Resource Manager fails to restart Explorer fails very serious job and task container does not start, failed jobs can not be restored. Using hot standby configuration, high availability. In a highly available storage area of the state (zookeeper or HDFS), the resource manager restart, information about the application reads the information stored in the application's operation from the storage area to restore the failed resource manager key state, restart all applications application master program, the number is not included yarn.resourcemanage.am.attempts. Node Manager information is not stored, the information can be reconstructed her Explorer.

Failover controller: Resource Manager fails, the resource manager automatically switched from the host to the standby machine.

It can also be configured manually, but this is not recommended.

Client and Node Manager to automatically connect Explorer : client polling and Node Manager to connect Explorer, has been trying to connect until the backup resource manager to replace the failed resource manager, a successful connection.

 

Himself developed an intelligent stock analysis software, very powerful, you need to click on the link below to obtain:

https://www.cnblogs.com/bclshuai/p/11380657.html

Guess you like

Origin www.cnblogs.com/bclshuai/p/12204106.html