Spark performance tuning and troubleshooting (6) Spark Troubleshooting

1. Control the size of the reduce buffer to avoid OOM

In the shuffle process, the reduce task does not wait until the map side task writes all its data to disk before pulling it, but the map side writes a little data, and the reduce side task will pull a small part of the data, and then proceed immediately. Operations such as aggregation and the use of operator functions.

How much data the reduce task can pull is determined by the buffer from which reduce pulls the data, because the pulled data is first placed in the buffer, and then subsequent processing is performed buffer的默认大小为48MB.

reduce 端task会一边拉取一边计算,不一定每次都会拉满 48MB 的数据,可能大多数时候拉取一部分数据就处理掉了。

Although increasing the buffer size on the reduce side can reduce the number of pulls and improve the performance of Shuffle, sometimes the amount of data on the map side is very large and the writing speed is very fast. At this time, all tasks on the reduce side may be all pulled Reach the maximum limit of its own buffer, which is 48MB. At this time, coupled with the code of the aggregate function executed on the reduce side, a large number of objects may be created, which can easily lead to memory overflow, namely OOM.

If there is a memory overflow problem on the reduce side, we can consider reducing the size of the data buffer on the reduce side, for example, to 12MB.

This kind of problem has occurred in the actual production environment. This is a typical 以性能换执行principle. The reduction of the buffer for pulling data on the reduce side does not easily lead to OOM, but correspondingly, the number of pulls on the reudce side increases, which causes more network transmission overhead and reduces performance.

note:要保证任务能够运行,再考虑性能的优化

2. Shuffle file pull failure caused by JVM GC

In Spark jobs, sometimes there will be a shuffle file not found error, which is a very common error.有时出现这种错误以后,选择重新执行一遍,就不再报出这种错误。

The possible reasons for the above problems are Shuffle 操作中,后面 stage 的 task 想要去上一个 stage 的 task所在的 Executor 拉取数据,结果对方正在执行 GC,执行 GC 会导致 Executor 内所有的工作现场全部停止,比如 BlockManager、基于 netty 的网络通信等,这就会导致后面的 task 拉取数据拉取了半天都没有拉取到,就会报出 shuffle file not found 的错误,而第二次再次执行就不会再出现这种错误.

You can be adjusted reduce 端拉取数据重试次数and reduce 端拉取数据时间间隔to these two parameters Shuffle performance tuning, the value of the parameter is increased, so that the pulling end reduce the number of retries of data increases, and the failure of each time interval to wait longer.

val conf = new SparkConf()
 .set("spark.shuffle.io.maxRetries", "6")
 .set("spark.shuffle.io.retryWait", "6s")

Three, solve various serialization errors

Spark job when an error during operation, and error information contained Serializableand other similar words, there may be a series of problems caused by the error.

The serialization problem should pay attention to the following three points:

(1)作为 RDD 的元素类型的自定义类,必须是可以序列化的;

(2)算子函数里可以使用的外部的自定义变量,必须是可以序列化的;

(3)不可以在 RDD 的元素类型、算子函数里使用第三方的不支持序列化的类型,如Connection。

Fourth, solve the problem caused by the operator function returning NULL

In some operator functions, we need to have a return value, but in some cases we do not want to have a return value. At this time, if we return NULL directly, an error will be reported, such as Scala.Math(NULL) exception.

If you encounter certain situations and do not want a return value, you can solve it in the following ways:

(1) Return a special value, not NULL, such as "-1";
(2) After obtaining an RDD through the operator, you can perform a filter operation on this RDD to filter data, and give the data with a value of -1 to Filter out;
(3) After using the filter operator, continue to call the coalesce operator for optimization.

Fifth, the network card traffic surge caused by YARN-CLIENT mode

The operating principle of the YARN-client mode is shown
Insert picture description here
in the following figure: In the YARN-client mode, the Driver is started on the local machine, and the Driver is responsible for all task scheduling and needs to communicate frequently with multiple Executors on the YARN cluster.

Assuming there are 100 Executors and 1000 tasks, then each Executor is assigned to 10 tasks. After that, the Driver will frequently communicate with the 1000 tasks running on the Executor. The communication data is very large and the communication category is particularly high. This leads to the possibility that the network card traffic of the local machine will increase sharply due to frequent and large network communication during the running of the Spark task.

Note that YARN-client mode can only be used in a test environment. The reason for using YARN-client mode is that you can see detailed and comprehensive log information. By viewing the log, you can lock down the problems in the program and avoid the production environment. Send failure.

在生产环境下,使用的一定是 YARN-cluster 模式. In YARN-cluster mode, it will not cause the problem of a surge in local machine network card traffic. If there is a network communication problem in YARN-cluster mode, the operation and maintenance team needs to solve it.

Sixth, JVM stack memory overflow in YARN-CLUSTER mode cannot be executed

The operating principle of YARN-cluster mode is shown in the figure below:
Insert picture description here

When the Spark job contains SparkSQL content, it may run in YARN-client mode, but it cannot be submitted for operation in YARN-cluster mode (an OOM error is reported).

In the YARN-client mode, the Driver runs on the local machine. The PermGen configuration of the JVM used by Spark (before JDK1.8) is the spark-class file on the local machine. The size of the JVM permanent generation is 128MB. This is No problem, but in YARN-cluster 模式下,Driver 运行在 YARN 集群的某个节点上,使用的是没有经过配置的默认设置,PermGen 永久代大小为 82MB.

SparkSQL needs to perform complex SQL semantic analysis, syntax tree conversion, etc., which is very complicated. If the SQL statement itself is very complex, it is likely to cause performance loss and memory usage, especially for PermGen. It will be bigger.

Therefore, if PermGen occupies just over 82MB but is less than 128MB at this time, it will run in YARN-client mode but not in YARN-cluster mode.

When solving the above problems, you 增加 PermGen 的容量need to set the relevant parameters in the spark-submit script. The setting methods are as follows.

--conf spark.driver.extraJavaOptions="-XX:PermSize=128M -XX:MaxPermSize=256M"

Through the above method, the size of the permanent generation of the Driver is set, the default is 128MB, and the maximum is 256MB, so that the problems mentioned above can be avoided.

Seven, solve the JVM stack memory overflow caused by SparkSQL

When SparkSQL sql statement hundreds of or 关键字time, then 可能会出现 Driver 端的JVM 栈内存溢出.

JVM 栈内存溢出基本上就是由于调用的方法层级过多,产生了大量的,非常深的,超出了 JVM 栈深度限制的递归。(We guess that when SparkSQL has a large number of or statements, when parsing SQL, such as converting to a syntax tree or generating an execution plan, the processing of or is recursive. When there are too many or, a lot of recursion will occur)

At this time, it is recommended to split one sql statement into multiple sql statements for execution, and try to ensure that each sql statement has less than 100 clauses.根据实际的生产环境试验,一条 sql 语句的 or 关键字控制在 100个以内,通常不会导致 JVM 栈内存溢出。

8. RDD data loss after persistence

Spark persistence is not a problem in most cases, but sometimes data may be lost. If the data is lost, you need to recalculate the lost data, and then cache and use it after the calculation. In order to avoid data loss, you can Choose to do this RDD checkpoint, that is 将数据持久化一份到容错的文件系统上(for example, HDFS).

After an RDD is cached and checkpointed, if the cache is found to be lost, the checkpoint data will be checked first. If there is, the checkpoint data will be used instead of recalculation. In other checkpoint 可以视为 cache 的保障机制,如果 cache 失败,就使用 checkpoint的数据words, .

The advantage of using checkpoint is 提高了 Spark 作业的可靠性that once there is a problem with the cache, there is no need to recalculate the data. The disadvantage is that checkpoint 时需要将数据写入 HDFS 等文件系统,对性能的消耗较大.

Guess you like

Origin blog.csdn.net/weixin_43520450/article/details/108652479