Common Spark errors

1. SparkSQL related

1. In implementing the insert statement error, the stack information: FileSystem closed. It often appears in ThriftServer.

Reason: Because the FileSystem obtained by hadoop FileSystem.get will be loaded from the cache, if there are multiple threads and one thread closedFileSystem will cause the bug to be
solved: There is a solution for hdfs not to be loaded from the cache. Configure fs.hdfs.impl in hdfs-site.xml disable.cache=true

2. Throw out during the execution of Spark:Failed to bigdata010108:33381,caused by:java.nio.channels.unresolvedAdderssException

Reason: The reason is that the hosts are not configured, which leads to no recognition.
Solution: Modify the host of the corresponding machine.

3. Throw out when executing Sparksql operation orc type table:java.lang.IndexOutOfBoundsException 或者 java.lang.NullPointerException

Reason: There is an empty orc file under the partition or table. The BUG will Spark2.3.0be fixed later
. Solution: Avoid and solve. Modify ORC's default split strategy: hive.exec.orc.split.strategy=BI to solve.
Orc has 3 strategies for splitting (ETL, BI, HYBIRD), the default is HYBIRD (hybrid mode, ETL or BI mode is automatically selected according to file size and number of files), BI mode is split according to the number of files

4. Spark 2.1.0 does not support permanent functions, because Spark 2.2.0 does not support reading the jar package on hdfs.

5. Saprk-sql and ThriftServer report an error when using:Java.net.socketTimeOutException:read time out

Reason: the connection timeout is caused by hivemetastore being too busy or gc.
Solution: spark-sql solution: increase hive.metastore.client.socket.timeoutthe parameter.
ThriftServer solution: add before getting a Connection: DriverManager.setLoginTimeout(100)

6. Throw out when operating a snappy compressed table:java.lang.RuntimeException: native snappy library not available: this version of libhadoop was built without snappy support.

Reason: because the java.library.pathsnappy library is not added.
Solution: modify the spark-default.conf configuration file and add:
spark.executor.extraLibraryPath= /data/Install/hadoop/lib/native or
spark.executor.extraJavaOptions -Djava. library.path=/data/Install/hadoop/lib/native

7. Spark-sql splits a very small file into 20 tasks for running during execution, resulting in too slow running speed.

Reason: During the HaddopRDD generation process, partitions will take the parameter mapreduce.job.mapsor mapred.map.tasks(20)compare it with the maximum value of the default partition number (2) of spark, so the default is 20.
Solution: Modify this parameter to lower the task.

8. ThriftServer login exception:javax.security.sasl.AuthenticationException: Error validating LDAP user

Reason: The password is wrong or the LDAP service is abnormal.
Solution: solve the password and authentication problem

9. Use jdbc to connect to ThriftServer, you can perform operations similar to show tabls, but you cannot perform select-related operations:java.io.IOException: Failed to create local dir in /tmp/blockmgr-adb70127-0a28-4256-a205-c575acc74f9d/06.

Reason: The user has not used ThriftServer for a long time and the system has cleaned up the parent directory or the user has no write permission to the directory.
Solution: Restart ThriftServer and set the directory permission: spark.local.dir

10. Run the Spark SQL in SQL statement is too complicated, there will be java.lang.StackOverflowErrorabnormal

Reason: This is because the stack size is larger than the JVM setting size when the program is running.
Solution: Solve this problem by adding the --driver-java-options "-Xss10m" option when starting Spark-sql

11. INSERT INTO repeated execution appears:Unable to move source hdfs://bigdata05/tmp/hive-hduser1101_hive_2017-09-11_14-50-56_038_2358196375683362770-82/-ext-10000/part-00000 to destination hdfs://bigdata05/user/hive

Reason: This problem is a bug of 2.1.0, which has been resolved in Spark 2.1.1 for 2.1.0.
Solution: 2.1.0 Workaround INSERT OVERWRITE repeat execution without partition will not cause problems

12. When performing operations such as join with a large amount of data:

1.Missing an output location for shuffle;

2.Failed to connect to bigdata030015/100.103.131.13:38742;

3.FileNotFoundException……(not such file or directory)。

4.Container killed on request. Exit code is 143

Reason: shuffle is divided into two parts: shuffle write and shuffle read. The number of partitions for shuffle write is controlled by the number of RDD partitions in the previous stage, and the number of partitions for shuffle read is controlled by some parameters provided by Spark. Shuffle write can be simply understood as an operation similar to saveAsLocalDiskFile, where the intermediate result of the calculation is temporarily placed on the local disk where each executor is located according to a certain rule.
The number of data partitions during shuffle read is controlled by some parameters provided by spark. It is conceivable that if the value of this parameter is set very small and the amount of shuffle read is large, it will cause a task to process very large data. The result is a JVM crash (OOM), which leads to failure to fetch shuffle data, and the executor is also lost. If you see the error of Failed to connect to host, it means executor lost. Sometimes even if it does not cause a JVM crash, it will cause a long time gc
solution:
1. Tuning sql.
2. The join, group by and other operations of SparkSQL and DataFrame spark.sql.shuffle.partitionscontrol the number of partitions. The default is 200. This value is increased according to the amount of shuffle and the complexity of the calculation.
3.Rdd's join, groupBy, reduceByKey and other operations are spark.default.parallelismset to be larger by controlling the number of partitions processed by shuffle read and reduce.
4. spark.executor.memoryProperly increase the memory value of the executor by increasing the memory setting of the executor.
5. Determine whether there is a problem of data skew during the join process: you can refer to the link: https://tech.meituan.com/spark-tuning-pro.html

13. The Executor side throws during the use of Sparksql:java.lang.OutOfMemoryError: GC overhead limit exceeded

Reason: This is because most of the events are in the GC, leading to OOM.
Solution: increase the executor memory and modify the GC strategyspark.executor.extraJavaOptions -XX:+UseG1GC

14. When hiveserver2 and SparkThriftServer use the orc table to operate, an error is reported that user A cannot access the directory of user B.

Reason: This is because orc will perform user buffering during Split overshoot. ORC’s bug in hive1.2.1 was resolved after hive2.X and Spark2.3.X versions.
Solution: Temporarily circumvented the method is more violent.
1. Use the super user to perform the first query first, causing the cached user root.
2. Set hive.fetch.task.conversion=nonenot to cache

15. In the process of using spark-sql, small data queries are very slow. Checking sparkUI shows that each task is processed very fast, but the scheduling is performed every 3 seconds, which leads to the overall slowness.

Reason: This is caused by the locality of the data, the default spark.locality.waitis 3 seconds.
Solution: Set this parameter to 0 to speed up the speed. This setting is only recommended when the amount of data is small.

2. Spark core related

1. When spark on yarn starts spark-sql and spark-submit:java.lang.NoClassDefFoundError: com/sun/jersey/api/client/config/ClientConfig

Reason: Conflict with yarn related Jersey package
Solution: configure–conf spark.hadoop.yarn.timeline-service.enabled=false

2. In the process of using Spark:java.io.IOException: No space left on device

Reason: Generally, the tmp directory of Spark is full.
Solution: You can set the directory space to a larger size and support multiple directories divided by commas: spark.local.dir

3. Exceed the maximum result set: is bigger than spark.driver.maxResultSize (2.0GB)

Reason: the spark.driver.maxResultSizedefault configuration is 1G.
Solution: increase this parameter

4. Common OOM:java.lang.OutOfMemoryError: Java heap space

Reasons: 1. The amount of data is too large and the requested Executor resources are not enough to support it. 2. The amount of data in a single partition is too large, and the number of partitions causes too much information stored in the execution of tasks and jobs, resulting in Driver OutOfMemoryError.
Solution: 1. Try not to use the collect operation. 2. Check whether the data is skewed, increase the parallelism of shuffle, and increase the Executor memory

5. Executor lost, task failure, various timeouts caused by Executor's FullGC:Futures timed out after【120S】

Reason: Generally, it is caused by the excessive amount of data processed by Executor, such as skew, which causes Executor full gc to cause timeout. The lost
solution of Executor and task : 1. If the log of Executor is caused by full GC, adjust the SQL appropriately. Increase the memory of Executor. 2. If there is no fullGC, consider improving:spark.network.timeout

6. When the jar package version conflicts:java.lang.ClassNotFoundException: XXX

Reason: Generally, it may be a conflict between user jar and Spark jar.
Solution: 1. It is best to adapt to Spark-related jar. 2. If it doesn’t work, you can use the parameter: spark.driver.userClassPathFirst和spark.executor.userClassPathFirstset to true

7. Throw shuffle: Shuffle Fetch Failed: OOM

Reason: The amount of fetch data opened in the Shuffle fetch phase is too large.
Solution: 1. Increase the Executor memory. 2. spark.reduce.maxSizeInFlightDecrease the parameter , the default is 48M

8.shuffle newspaperorg.apache.spark.shuffle.FetchFailedException: Direct buffer memory

Reason: Insufficient off-heap memory caused.
Solution: Increase the JVM parameter -XX:MaxDirectMemorySize(for example: spark.executor.extraJavaOptions = -XX:MaxDirectMemorySize=xxxm)

9. Abnormal cluster nodes cause Spark job failure, such as disk read-only.

Reason: Spark is a high-performance, fault-tolerant distributed computing framework. Once it knows that a certain computing machine has a problem, it will re-schedule the task on this machine based on the previously generated lineage. If the number of failures exceeds the number of failures, it will cause the job. failure.
Solution: Spark has a blacklist mechanism. After a certain number of failures, it will not schedule tasks to this node or Executor. Set the corresponding Black parameters:spark.blacklist.enabled=true

Reprinted from: https://mp.weixin.qq.com/s/bqDu_4WBqjjJ7HIjW4KyiQ

Guess you like

Origin blog.csdn.net/qq_43081842/article/details/107440518