spark-- Troubleshooting

Troubleshooting a: Control reduce the size of the buffer to avoid OOM

Shuffle in the process, reduce wait until the map is not the end of the task will end its task all the data written to disk and then to pull, but a little map-side write data, reduce end task will be to pull a small part of the data, and then immediately back polymerization, use of subfunction count operation.

end task reduce how much data can be pulled by pulling reduce buffer buffers data to determine, since the pulling up data are first placed in the buffer, then the subsequent processing, the default buffer size is 48MB.

end task will reduce side pull side computing, might not always pull over 48MB of data, most of the time may pull out part of the data is processed.

Although increasing the buffer size may be reduced reduce side pull frequency Shuffle improve performance, but sometimes the amount of map data terminal is very large, very fast write time all Task reduce end when the pull, all possible buffer reaches the maximum limit yourself, that is 48MB, this time, together with the code of aggregate functions reduce side execution, it may create a large number of objects, which can be difficult to cause a memory overflow, that OOM.

Once a problem if reduce side memory overflow occurs, we can consider reducing the pulling end reduce the size of the data buffer, for example, reduced to 12MB.

In the actual production environment is such a problem occurred, which is typical of the principle of performance-change execution. Reduce side pull data buffer is reduced, resulting in the OOM not easy, but the corresponding increased number reudce pulling end, resulting in more overhead transmission network, resulting in decreased performance.

       Note that to ensure that the task to run, and then consider optimizing performance.

Troubleshooting two: JVM GC caused shuffle files pulled fail

       Later in the Spark job, shuffle file not found error sometimes occurs, which is a very common error, this error sometimes appears, select the re-run it again, it is no longer reported this error.

       Possible causes problems mentioned above is the Shuffle operation, the back stage of the task you want to go to Executor pull a stage where the task of taking data, the results of the other party is performing GC, GC will lead to the implementation of all the work at the scene within Executor all the stops, such as BlockManager, network-based communications netty etc., which will lead to the latter task to pull data for a long time did not pull to pull, it will report the shuffle file not found error, while the second will not perform again this error occurs.

       Can pull retries and reduce the data terminal by adjusting reduce the data interval end pulling these two parameters to adjust performance Shuffle, the parameter values, such that pulling the end reduce the number of retries increases data, and each after the failure of waiting longer time interval.

Listing 4-1 JVM GC caused shuffle files pulled fail

selection conf = new SparkConf ()

  .set("spark.shuffle.io.maxRetries", "60")

  .set("spark.shuffle.io.retryWait", "60s")

Troubleshooting III: resolve an error caused by a variety of serialization

       Spark job when an error during operation, and error information contained Serializable and other similar words, there may be a problem due to an error sequence.

       Serialization issues to note the following three points:

  1. Element type as RDD custom class must be serializable;
  2. External operator function in a custom variable may be used, it must be serializable;
  3. RDD is not the element type, operator function in type does not support serialization of using a third party, such as Connection.

Troubleshooting four: to solve the problem caused by operator function returns NULL

In some operator function where we need to have a return value, but in some cases we do not want to return to duty at this time if we direct return NULL, will complain, for example Scala.Math (NULL) exception.

If you encounter some cases, do not want to have a return value, it can be resolved in the following manner:

  1. Returns the special value, does not return NULL, for example, "-1";

2. After acquiring the operator through a RDD, RDD can perform this filter operation, data filtering, the value -1 to the filtered data;

3. After using the filter operator, continues to call coalesce operator optimization.

 Troubleshooting Five: solve NIC flow YARN-CLIENT mode due to surges

YARN-client mode operating principle as shown below:

 

 

In YARN-client mode, Driver started on the local machine, and Driver is responsible for all scheduling, the need for frequent communication with multiple Executor on YARN cluster.

Suppose there Executor 100, task 1000, then each of the ten Executor assigned task, then, to be frequently Driver 1000 with Executor task running on the communication, the communication data is very large, and particularly high communication category. This leads to possible Spark task during the operation, due to frequent large number of network communications, network cards will surge in traffic on the local machine.

Note, YARN-client mode will use in a test environment, and the reason for using YARN-client model, is due to see a detailed and comprehensive log information by looking at the log, the problems in the program can be locked to avoid in a production environment under failure.

In a production environment, use must be YARN-cluster mode. In YARN-cluster mode, it will not cause a surge of traffic problems the local machine card, if network communication problems exist under YARN-cluster model, operations teams need to be resolved.

 

 Troubleshooting Six: solve YARN-CLUSTER mode JVM stack memory overflow problem can not be executed

Operating principle YARN-cluster pattern as shown below:

 

 

When Spark job contains content SparkSQL may encounter can run YARN-client mode, but can not be submitted to run under YARN-cluster model (OOM reported error) situation.

       YARN-client模式下,Driver是运行在本地机器上的,Spark使用的JVM的PermGen的配置,是本地机器上的spark-class文件,JVM永久代的大小是128MB,这个是没有问题的,但是在YARN-cluster模式下,Driver运行在YARN集群的某个节点上,使用的是没有经过配置的默认设置,PermGen永久代大小为82MB。

       SparkSQL的内部要进行很复杂的SQL的语义解析、语法树转换等等,非常复杂,如果sql语句本身就非常复杂,那么很有可能会导致性能的损耗和内存的占用,特别是对PermGen的占用会比较大。

所以,此时如果PermGen的占用好过了82MB,但是又小于128MB,就会出现YARN-client模式下可以运行,YARN-cluster模式下无法运行的情况。

解决上述问题的方法时增加PermGen的容量,需要在spark-submit脚本中对相关参数进行设置,设置方法如代码清单4-2所示。

--conf spark.driver.extraJavaOptions="-XX:PermSize=128M -XX:MaxPermSize=256M"

通过上述方法就设置了Driver永久代的大小,默认为128MB,最大256MB,这样就可以避免上面所说的问题。

故障排除七:解决SparkSQL导致的JVM栈内存溢出

      当SparkSQL的sql语句有成百上千的or关键字时,就可能会出现Driver端的JVM栈内存溢出。

      JVM栈内存溢出基本上就是由于调用的方法层级过多,产生了大量的,非常深的,超出了JVM栈深度限制的递归。(我们猜测SparkSQL有大量or语句的时候,在解析SQL时,例如转换为语法树或者进行执行计划的生成的时候,对于or的处理是递归,or非常多时,会发生大量的递归)

      此时,建议将一条sql语句拆分为多条sql语句来执行,每条sql语句尽量保证100个以内的子句。根据实际的生产环境试验,一条sql语句的or关键字控制在100个以内,通常不会导致JVM栈内存溢出。

故障排除八:持久化与checkpoint的使用

Spark持久化在大部分情况下是没有问题的,但是有时数据可能会丢失,如果数据一旦丢失,就需要对丢失的数据重新进行计算,计算完后再缓存和使用,为了避免数据的丢失,可以选择对这个RDD进行checkpoint,也就是将数据持久化一份到容错的文件系统上(比如HDFS)。

一个RDD缓存并checkpoint后,如果一旦发现缓存丢失,就会优先查看checkpoint数据存不存在,如果有,就会使用checkpoint数据,而不用重新计算。也即是说,checkpoint可以视为cache的保障机制,如果cache失败,就使用checkpoint的数据。

使用checkpoint的优点在于提高了Spark作业的可靠性,一旦缓存出现问题,不需要重新计算数据,缺点在于,checkpoint时需要将数据写入HDFS等文件系统,对性能的消耗较大。

Guess you like

Origin www.cnblogs.com/tesla-turing/p/11959615.html