Spark performance tuning and fault handling (1) Spark regular performance tuning

1. Optimal resource allocation

The first step in Spark performance tuning is为任务分配更多的资源,在一定范围内,增加资源的分配与性能的提升是成正比的,实现了最优的资源配置后,在此基础上再考虑进行后面论述的性能调优策略。

The allocation of resources is specified when using the script to submit the Spark task. The standard Spark task submission script is as follows:

/usr/opt/modules/spark/bin/spark-submit \
--class com.atguigu.spark.Analysis \
--num-executors 80 \
--driver-memory 6g \
--executor-memory 6g \
--executor-cores 3 \
/usr/opt/modules/spark/jar/spark.jar \

The resources that can be allocated are shown in the following table:
Insert picture description here

Regulation principle :尽量将任务分配的资源调节到可以使用的资源的最大限度。

For the allocation of specific resources, we separately discuss the two Cluster operating modes of Spark:

The first is Spark Standalone 模式that you must know or obtain the resources available to you from the operation and maintenance department before submitting the task. When writing the submit script, allocate resources based on the available resources. For example, the cluster has 15 machines, each machine has 8G memory and 2 CPU cores, then 15 Executors are assigned, and each Executor is allocated 8G memory and 2 CPU cores.

The second is Spark Yarn 模式that because Yarn uses resource queues for resource allocation and scheduling, when writing the submit script, it allocates resources according to the resource queue to which the Spark job is submitted. For example, the resource queue has 400G of memory and 100 CPUs. core, then specify 50 Executors, each Executor allocates 8G of memory and 2 CPU cores.

After adjusting the resources in the table, the performance improvement obtained is shown in the following table:
Insert picture description here
Supplement : The script configuration of Spark submit in the production environment

/usr/local/spark/bin/spark-submit \
--class com.atguigu.spark.dataetl \
--num-executors 80 \
--driver-memory 6g \
--executor-memory 6g \
--executor-cores 3 \
--master yarn-cluster \
--queue root.default \
--conf spark.yarn.executor.memoryOverhead=2048 \
--conf spark.core.connection.ack.wait.timeout =300 \
/usr/local/spark/spark.jar



参数配置参考值:
--num-executors:50~100
--driver-memory:1G~5G
--executor-memory:6G~10G
--executor-cores:3
--master:实际生产环境一定使用 yarn-cluster

Two, RDD optimization

2.1 RDD 复 用

When performing operators on RDDs, avoid repeated calculations on RDDs under the same operator and calculation logic, as shown in
Insert picture description here
the figure below: Modify the RDD calculation architecture in the figure above to obtain the optimization results shown in the figure below :
Insert picture description here

2.2 RDD endurance

In Spark, when an operator operation is performed on the same RDD multiple times, the RDD will be recalculated with the previous parent RDD each time. This situation must be avoided 对同一个 RDD 的重复计算是对资源的极大浪费. Therefore, it must be used for multiple uses The RDD is persisted, and the data of the public RDD is cached in memory/disk through persistence, and then the calculation of the public RDD will directly obtain the RDD data from the memory/disk.

For the persistence of RDD, there are two points that need to be explained:

First, the persistence of RDD is possible 序列化. When the memory cannot store the RDD data completely, you can consider using serialization to reduce the data volume and store the data completely in the memory.

Second, if the reliability of the data is very high and the memory is sufficient, it can be used 副本机制to persist the RDD data. When the copy mechanism is enabled for persistence, a copy is stored for each data unit of persistence and placed on other nodes, thereby achieving data fault tolerance. Once a copy of data is lost, there is no need to recalculate, and another A copy.

2.3 RDD filter operation as early as possible

After obtaining the initial RDD, you should consider filtering out unnecessary data as soon as possible, thereby reducing the memory usage and improving the efficiency of the Spark job.

Three, broadcast big variables

By default, if an external variable is used in the operator in the task, each task will get a copy of the variable, which causes a huge consumption of memory. On the one hand, if the RDD is subsequently persisted, the RDD data may not be stored in memory and can only be written to disk. Disk IO will seriously consume performance; on the other hand, when the task creates an object, it may find a heap The memory cannot store newly created objects, which will cause frequent GC, which will cause worker threads to stop, which will cause Spark to suspend work for a period of time, which will seriously affect Spark performance.

Suppose that the current task is configured with 20 Executors, 500 tasks are specified, and a variable of 20M is shared by all tasks. At this time, 500 copies of the 500 tasks will be generated, which consumes 10G of cluster memory. If broadcast variables are used, then Each Executor saves a copy, which consumes a total of 400M memory, and the memory consumption is reduced by 5 times.

广播变量在每个 Executor 保存一个副本,此 Executor 的所有 task 共用此广播变量,这让变量产生的副本数量大大减少。

In the initial stage, there is only one copy of the broadcast variable in Driver. When the task is running, if you want to use the data in the broadcast variable, you will first try to get the variable in the BlockManager corresponding to your local Executor, and 如果本地没有,BlockManager 就会从 Driver 或者其他节点的 BlockManager上远程拉取变量的复本,并由本地的 BlockManager 进行管理;then all tasks of this Executor will get the variables directly from the local BlockManager.

Four, Kryo serialization

默认情况下,Spark 使用 Java 的序列化机制. Java's serialization mechanism is easy to use and does not require additional configuration. The variables used in the operator can implement the Serializable interface. However, the Java serialization mechanism is not efficient, the serialization speed is slow, and the serialized data is occupied The space is still large.

The Kryo serialization mechanism is about 10 times faster than the Java serialization mechanism. The reason why Spark does not use Kryo as the serialization library by default is because it does not support the serialization of all objects. At the same time Kryo 需要用户在使用前注册需要序列化的类型, it is not convenient enough, but starting from Spark 2.0.0 Starting from the version, Shuffling RDDs of simple type, simple type array, and string type have been serialized by Kryo by default.

The example code of Kryo serialized registration method is as follows:

public class MyKryoRegistrator implements KryoRegistrator
{
    
    
	 @Override
	 public void registerClasses(Kryo kryo)
	 {
    
    
	 kryo.register(StartupReportLogs.class);
	 }
}

The example code for configuring Kryo serialization is as follows:

//创建 SparkConf 对象
val conf = new SparkConf().setMaster().setAppName()
//使用 Kryo 序列化库,如果要使用 Java 序列化库,需要把该行屏蔽掉
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
//在 Kryo 序列化库中注册自定义的类集合,如果要使用 Java 序列化库,需要把该行屏蔽掉
conf.set("spark.kryo.registrator", "atguigu.com.MyKryoRegistrator"); 

Five, adjust the localization waiting time

During the running of the Spark job, the Driver will allocate the tasks of each stage. 根据 Spark的 task 分配算法,Spark 希望 task 能够运行在它要计算的数据所在的节点(数据本地化思想), So that you can avoid network transmission of data. Generally speaking, the task may not be assigned to the node where the data it processes, because the resources available to these nodes may have been exhausted. At this time, Spark will wait for a period of time, 3s by default 如果等待指定时间后仍然无法在指定节点运行,那么会自动降级,尝试将 task 分配到比较差的本地化级别所对应的节点上, such as assigning the task to the node away from it. to calculate a more recent data node, and then calculated 如果当前级别仍然不行,那么继续降级.

When the data to be processed by the task is not on the node where the task is located, data transmission will occur. The task will obtain the data through the BlockManager of the node where it is located. When the BlockManager finds that the data is not local, it will obtain the data from the BlockManager of the node where the data is located through the network transmission component.

The situation of network transmission of data is something we don’t want to see 大量的网络传输会严重影响性能. Therefore, we hope to adjust the localized waiting time. If the target node has completed part of the task during the waiting time, then the current task will have a chance to get Execution, which can improve the overall performance of Spark jobs.

The localization level of Spark is shown
Insert picture description here
in the following table: During the development phase of the Spark project, the client mode can be used to test the program. At this time, you can see relatively complete log information locally, and the log information has clear task data localization Level, if most of them are PROCESS_LOCAL, then there is no need to adjust, but if you find that many levels are NODE_LOCAL, ANY, then you need to adjust the localization waiting time, by extending the localization waiting time, look at the local task Whether the level of optimization is improved, and observe whether the running time of Spark jobs is shortened.

note:过犹不及,不要将本地化等待时长延长地过长,导致因为大量的等待时长,使得 Spark 作业的运行时间反而增加了。

The setting of Spark localization wait time is as follows:

val conf = new SparkConf().set("spark.locality.wait", "6")

Guess you like

Origin blog.csdn.net/weixin_43520450/article/details/108649183