spark abnormal articles -Removing executor 5 with no recent heartbeats: 120504 ms exceeds timeout 120000 ms possible solutions spark yarn task executor reason timeout without any reason analysis

Problem Description and analysis

The title problem generally can be described as:

Since there is no time sending a heartbeat to the Executor Driver, Driver is judged that the Executor has hung, this time should Driver transmitted to another task execution to re-execute the Executor Executor;

Spark.network.timeout = 120s for the long wait time default 

Complete the following error probably

17/01/13 09:13:08 WARN spark.HeartbeatReceiver: Removing executor 5 with no recent heartbeats: 161684 ms exceeds timeout 120000 ms
17/01/13 09:13:08 ERROR cluster.YarnClusterScheduler: Lost executor 5 on slave10: Executor heartbeat timed out after 161684 ms
17/01/13 09:13:08 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, slave10): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 161684 ms
17/01/13 09:13:08 INFO scheduler.DAGScheduler: Executor lost: 5 (epoch 0)
17/01/13 09:13:08 INFO cluster.YarnClusterSchedulerBackend: Requesting to kill executor(s) 5
17/01/13 09:13:08 INFO scheduler.TaskSetManager: Starting task 0.1 in stage 0.0 (TID 5, slave06, partition 0,RACK_LOCAL, 8029 bytes)
17/01/13 09:13:08 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 5 from BlockManagerMaster.
17/01/13 09:13:08 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(5, slave10, 34439)
17/01/13 09:13:08 INFO storage.BlockManagerMaster: Removed 5 successfully in removeExecutor
17/01/13 09:13:08 INFO scheduler.DAGScheduler: Host added was in lost list earlier: slave10
17/01/13 09:13:08 INFO yarn.ApplicationMaster$AMEndpoint: Driver requested to kill executor(s) 5.
17/01/13 09:13:08 INFO scheduler.TaskSetManager: Finished task 0.1 in stage 0.0 (TID 5) in 367 ms on slave06 (5/5)
17/01/13 09:13:08 INFO scheduler.DAGScheduler: ResultStage 0 (saveAsNewAPIHadoopFile at DataFrameFunctions.scala:55) finished in 162.495 s

 

Executor is no time to send a heartbeat to the Driver, the reason may be

1. really hang

2. Perform a task due to the lack of resources cause crashes

3. Other reasons

We solve the second major

 

solution

Additional resources --- increase memoryOverhead

Simple to explain 

spark.yarn.executor.memoryOverhead memory is tungsten own management mechanism to manage the spark, with a time of application, run out of the release, high memory utilization, [in view of the JVM memory management, low efficiency GC, only this management mechanism]

The spark.executor.memory memory is managed by the JVM, distribution, recycling, involving a variety of garbage collection mechanism, with good efficiency is low

 

Cause Analysis

If lack of space for storage of RDD, then the partition before RDD RDD after storage partition will be covered, resulting in the loss of RDD before, when using RDD lost partition, need to be recalculated;

If the java heap or lack of permanent generation memory, OOM will produce a variety of situations, executor will be killed, spark will re-apply for a container operation executor, failed task or lost data will be re-executed on this executor;

If the actual operation, ExecutorMemory + MemoryOverhead sum (JVM process of total memory) than container capacity, yarn directly kill the container, there will be no record executor log, spark will re-apply for container operation executor;

If the JVM process other than the java heap memory for more, you need to set MemoryOverhead large enough, otherwise the executor will be killed

 

 

Specific operations

The default configuration is spark.yarn.executor.memoryOverhead max (executorMemory * 0.10, 384), in units of M

We can manually set

--conf spark.yarn.executor.memoryOverhead=512

--conf spark.yarn.driver.memoryOverhead=512

 

Reduce resource consumption --- use combineByKey

As reduceByKey, can effectively reduce the memory footprint

rdd.repartition(20).map(mymap).groupBy(mygroup).mapValues(len).collect()
rdd.repartition(20).map(mymap).map(mygroup).reduceByKey(lambda x, y: x+y).collect()

 

There are also several relatively simple methods

1. increase wait long spark.network.timeout

2. In the case of the same resources, increase the executor memory, reducing the number of executor, increase executor cores, to think that, anyway, is the total of the same, to ensure that each task has enough memory

 

 

References:

https://blog.csdn.net/gangchengzhong/article/details/76474129

https://www.cnblogs.com/RichardYD/p/6281745.html  reason executor reason timeout of the spark yarn task analysis

http://f.dataguru.cn/thread-906602-1-1.html  spark.yarn.executor.memoryOverhead

Guess you like

Origin www.cnblogs.com/yanshw/p/12038627.html