Spark Common Error Analysis and Countermeasures

Question one:

Appears in the log: org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0

Cause analysis:
shuffle is divided into two parts: shuffle write and shuffle read.
The number of partitions for shuffle write is controlled by the number of RDD partitions in the previous stage, and the number of partitions for shuffle read is controlled by some parameters provided by Spark.
Shuffle write can be simply understood as an operation similar to saveAsLocalDiskFile, which temporarily places the intermediate results of calculation on the local disk where each executor is located according to certain rules.
The number of data partitions during shuffle read is controlled by some parameters provided by spark. It is conceivable that if this parameter value is set to a small value and the amount of shuffle read is large, it will result in a task that needs to process very large data. As a result, the JVM crashed, resulting in failure to retrieve the shuffle data. At the same time, the executor was also lost. You saw the error "Failed to connect to host", which means executor lost. Sometimes even if it does not cause a JVM crash, it will cause a long gc.
1. Reducing shuffle data
mainly starts from the code level. Unnecessary data can be filtered before shuffle. For example, the original data has 20 fields. Just select the required fields for processing, which will reduce a certain amount of shuffle. data.
2. Modify partitions
. Control the number of partitions through spark.sql.shuffle.partitions. The default is 200. Increase this value appropriately according to the amount of shuffle and calculation complexity, such as 500.
3. Increase the number of failed retries and the retry interval.
Control the number of retries through The default is 3, which can be increased appropriately, such as 10.
The retry interval is controlled through The default is 5s and can be increased appropriately, such as 10s.
4. Increase the memory of the executor.
When submitting a task with spark-submit, appropriately increase the memory value of the executor, such as 15G or 20G.

问题二: 日志中出现:Caused by: org.apache.spark.SparkException: Could not execute broadcast in 300 secs. You can increase the timeout for broadcasts via spark.sql.broadcastTimeout or disable broadcast join by setting spark.sql.autoBroadcastJoinThreshold to -1

Cause analysis:
It can be seen from the above log that in the ShuffleMapStage stage, that is, the ShuffleRead stage, a timeout occurred when the Driver broadcast input data to each Executor.
1. Increase the timeout appropriately: spark.sql.broadcastTimeout=800
2. Increase the number of retries appropriately: spark.sql.broadcastMaxRetries=3
3. Turn off the broadcast variable join: set spark.sql.autoBroadcastJoinThreshold = -1

Question three: Appears in the log: org.apache.spark.sql.catalyst.parser.ParseException

Cause analysis:
spark reports an error when doing sql conversion.
Check whether the sql is written correctly

Question 4: SparkException: Could not find CoarseGrainedScheduler appears in the log

Cause analysis:
This is a resource problem. More cores and executors should be allocated to the task, and more memory should be allocated. And you need to allocate more partition
solutions to RDD:
1. Increase the number of resources, cores and executors
2. Adding this sentence to the configuration resources may solve your problem:
–conf spark.dynamicAllocation.enabled=false

Question 5: Exception in thread “main” java.lang.NoSuchMethodError: scala.collection.immutable. colon colon appears in the$1()Lscala/collection/immutable/List;

Cause analysis: Solution to
the problem of scala version inconsistency : 1. Specify the same version of the image for the spark task –conf spark.kubernetes.container.image=mirror address

Question 6: Appears in the log: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 9478 tasks (1024.1 MiB) is bigger than spark.driver.maxResultSize (1024.0 MiB)

Cause analysis:
The size of the serialized result set exceeds the default maximum result set size of the spark task (the default spark.driver.maxResultSize is 1g)
1. Increase the size of spark.driver.maxResultSize
–conf spark.driver.maxResultSize= 2g

Question 7: Appears in the log: The executor with id 12 exited with exit code 137

Cause analysis:
executor memory overflow (oom)
1. Increase executor memory.
Example parameters: –conf spark.executor.memory=10g
Note: In a few cases, the off-heap memory (overhead memory) is insufficient and it is necessary to increase the off-heap memory.
Example Parameters: –conf spark.executor.memoryOverhead=5g

问题八: WARN TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1, aa.local): ExecutorLostFailure (executor lost) WARN TaskSetManager: Lost task 69.2 in stage 7.0 (TID 1145, Connection from / closed java.util.concurrent.TimeoutException: Futures timed out after [120 second ERROR TransportChannelHandler: Connection to / has been quiet for 120000 ms while there are outstanding requests. Assuming connection is dead; please adjust if this is wrong

Cause analysis:
TaskSetManager: Lost task & TimeoutException
Due to network or gc reasons, the worker or executor did not receive the heartbeat feedback of the executor or task.
1. Increase the value of and change it to 300 (5min) according to the situation. or higher
2. Configure the delay of all network transmissions. If the following parameters are not actively set, their properties will be overwritten by default.

Question 9: Appears in the log: java.lang.OutOfMemoryError: Not enough memory to build and broadcast

Cause analysis:
Driver side OOM.
OOM on the Driver side cannot escape Type 2 lesions:
The created data set exceeds the memory upper limit.
The collected result set exceeds the memory upper limit.
During the creation process of broadcast variables, it is necessary to first pull the data fragments distributed among all Executors to the Driver side, and then The broadcast variables are constructed on the Driver side, and finally the Driver side distributes the encapsulated broadcast variables to each Executors. The first step of data retrieval is actually implemented using collect. If the total size of data fragments in Executors exceeds the driver-side memory limit, OOM will also be reported.
Increase the memory size on the driver side

问题十: java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf java.lang.OutOfMemoryError: Java heap space at java.lang.reflect.Array.newInstance

Cause analysis:
executor-side OOM
User Memory is used to store user-defined data structures, such as arrays, lists, dictionaries, etc. Therefore, if the total size of these data structures exceeds the upper limit of the User Memory memory area, such an error will occur.

Question 11: When spark sql executes insert overwrite, data duplication occurs.

Cause analysis:
Spark SQL did not delete the old data files (data files generated by Spark SQL) when executing SQL overwrite. The process of writing Spark SQL to Hive is as follows:

1. When Spark writes to Hive, it will first generate a temporary _temporary directory to store the generated data files. After all the generation is completed, they will be moved to the output directory, then the _temporary directory will be deleted, and finally Hive metadata (write partition) will be created;
2 The .Spark data writing tasks used the same _temporary directory, which caused one of them to fail to delete the _temporary directory after completing data generation and moving to the Hive path (the task was killed), further causing the data to arrive but the metadata was not created.
3. Although the previous task generated a data file but no metadata, the overwrite of the latter task cannot find the metadata and therefore cannot delete the data file in the Hive path (the second task will generate no data in the task directory) 4
. When the last completed Spark insertion task ends, the data files of multiple tasks have been moved under the Hive path. Since there are no Spark write tasks being executed, the _temporary directory is deleted successfully and the metadata is created successfully. The result It is this metadata that corresponds to all versions of data files in the Hive path.

Question 12: Spark tasks normally execute for about 10 minutes, but occasionally the task running time is too long, such as about 5 hours.

Cause Analysis:

Through the spark UI, I can see that the tasks of the spark task are all running in about 10 minutes. There is a task that has a running time of 5.4 hours and has not been completed.
Set this parameter spark.speculation=true;
Principle: In Spark, tasks will be executed in parallel in the form of a DAG graph. Each node will run in parallel in different executors, but some tasks may execute very quickly. The task execution is very slow, such as network jitter, different performance, data skew, etc. Some tasks are very slow and will become the bottleneck of the entire task. At this time, the speculative execution function can be triggered to restart a task for a long task, use the result of whichever one is completed first, and kill the other task.

问题十三: org.apache.spark.shuffle.FetchFailedException: The relative remote executor(Id: 21), which maintains the block data to fetch is dead.

Cause analysis:
Due to insufficient resources, the executor has no heartbeat. The driver determines that it is lost and tries to connect to other executors. However, the other executors cannot be connected because they have the same configuration. After retrying n times, an error will be reported

Reduce the use of operations that trigger shuffle, such as reduceByKey, thereby reducing memory usage. Increase to allow more time to wait for heartbeat responses
. Increase spark.executor.cores to reduce the number of Executors created, so that The total memory used is reduced.
At the same time, increase spark.executor.memory to ensure that each Executor has enough available memory.
Increase spark.shuffle.memoryFraction, the default is 0.2 (requires spark.memory.useLegacyMode to be configured to true, applicable to 1.5 or older version, deprecated)
-conf spark.driver.memory=10g --conf spark.executor.cores=2 --conf spark.executor.memory=24g --conf spark.executor.memoryOverhead=4g --conf spark. default.parallelism=1500 --conf spark.sql.shuffle.partitions=1500 --conf

Question 14: Unexpected end of input stream

Cause analysis:
The spark task input data is abnormal. When the spark task reads the csv file compressed in gz format, an error occurs due to the presence of abnormal data. There is empty data in gz format compressed files

1. Locate the abnormal data and clear it.
2. Filter the abnormal data and write it directly.

问题十五: Exception in thread “main” java.lang.NoSuchMethodError: scala.Predef$.refArrayOps([Ljava/lang/Object;)[Ljava/lang/Object;

Cause analysis:
scala version is inconsistent

Replace the image with the same scala version of the service