Spark performance tuning and fault handling (2) Spark operator tuning

P 、 mapPartitions

The ordinary map operator operates on each element in the RDD, while the mapPartitions operator operates on each partition in the RDD. If it is an ordinary map operator, assuming that a partition has 10,000 pieces of data, then the function in the map operator has to be executed 10,000 times, that is, to operate on each element.
Insert picture description here
If it is a mapPartition operator, because 一个 task 处理一个 RDD 的 partition, then 一个task 只会执行一次 function,function 一次接收所有的 partition 数据,效率比较高.
Insert picture description here
For example, when you want to write all the data in the RDD through JDBC, if you use the map operator, you need to create a database connection for each element in the RDD, which consumes a lot of resources. If you use the mapPartitions operator , Then only one database connection needs to be established for the data of a partition.

The mapPartitions operator also has some shortcomings: For ordinary map operations, one piece of data is processed at a time. If the memory is insufficient after 2000 pieces of data are processed, the 2000 pieces of data that have been processed can be garbage collected from the memory; but如果使用 mapPartitions 算子,当数据量非常大时,function一次处理一个分区的数据,如果一旦内存不足,此时无法回收内存,就可能会 OOM,即内存溢出。

Therefore, mapPartitions 算子适用于数据量不是特别大的时候at this time, using the mapPartitions operator to improve performance is still good. (When the amount of data is large, once the mapPartitions operator is used, it will be directly OOM).

In the project, you should first estimate the amount of data in RDD, the amount of data in each partition, and the memory resources allocated to each Executor. If resources permit, you can consider using mapPartitions instead of map.

Two, foreachPartition optimizes database operations

在生产环境中,通常使用 foreachPartition 算子来完成数据库的写入,通过foreachPartition 算子的特性,可以优化写数据库的性能。

If the foreach operator is used to complete the database operation, because the foreach operator traverses each data in the RDD, each data will establish a database connection, which is a great waste of resources. Therefore, for writing database operations, we The foreachPartition operator should be used.

It is very similar to the mapPartitions operator foreachPartition 是将 RDD 的每个分区作为遍历对象,一次处理一个分区的数据, that is, if it involves database-related operations, only one database connection needs to be created for the data of a partition, as shown in the following figure: After
Insert picture description here
using the foreachPartition operator, the following performance improvements can be obtained:

(1) For we write the function function 一次处理一整个分区的数据;
(2) for a data partition, create a unique database connection;
(3) only need to database 发送一次SQL statements and multiple sets of parameters;

在 生 产 环 境 中 , 全部都会使用 foreachPartition 算子完成数据库操作 。There is a problem with the foreachPartition operator, which is similar to the mapPartitions operator. If a partition has a particularly large amount of data, it may cause OOM, that is, memory overflow.

3. The use of filter and coalesce

In Spark tasks, we often use the filter operator to filter the data in the RDD. In the initial stage of the task, the amount of data loaded from each partition is similar, but once 经过 filter 过滤后,每个分区的数据量有可能会存在较大差异, as shown in the following figure:
Insert picture description here
According to the above figure, we can find Two questions:

(1) 每个 partition 的数据量变小了If the current data is processed according to the number of tasks equal to the partition before, it will be a waste of task computing resources;

(2) When 每个 partition 的数据量不一样了each subsequent task processes the data of each partition, the amount of data to be processed by each task is different, which is likely to cause data skew. As shown in the figure above, there are only 100 pieces of data in the second partition after filtering, and 800 pieces of data in the third partition after filtering. Under the same processing logic, the amount of data processed by the task corresponding to the second partition The gap in the amount of data processed by the task corresponding to the third partition has reached 8 times, which may also lead to a gap of several times in the running speed, which is also the problem of data skew.

In view of the above two problems, we analyze separately:

(1) For the first question, since the amount of data in the partition has become smaller, we hope to be able to 对分区数据进行重新分配, for example, convert the data of the original 4 partitions into 2 partitions, so that only the next two tasks are needed for processing. Avoid the waste of resources.

(2) For the second problem, the solution is very similar to that of the first problem 对分区数据重新分配,让每个 partition 中的数据量差不多, which avoids the problem of data skew.

So how exactly should the above solution ideas be realized? We need coalesce operator. Both repartition and coalesce can be used to repartition. Repartition is just a simple implementation of shuffle being true in coalesce interface. Coalesce does not perform shuffle by default, but can be set by parameters.

Suppose we want to change the original number of partitions A to B by repartitioning, then there are the following situations:

  1. A> B (the majority of partitions are merged into a few partitions)
    ① The difference between A and B is not large.
    At this time, coalesce can be used without shuffle process.
    ② There is a big difference between A and B.
    At this time, coalesce can be used and the shuffle process is not enabled, but the performance of the merge process will be low, so it is recommended to set the second parameter of coalesce to true, that is, start the shuffle process.

  2. A <B (a few partitions are decomposed into many partitions).
    At this time, repartition can be used. If coalesce is used, shuffle must be set to true, otherwise coalesce is invalid.

我们可以在 filter 操作之后,使用 coalesce 算子针对每个 partition 的数据量各不相同的情况,压缩 partition 的数量,而且让每个 partition 的数据量尽量均匀紧凑,以便于后面的 task 进行计算操作,在某种程度上能够在一定程度上提升性能。

note:local 模式是进程内模拟集群运行,已经对并行度和分区数量有了一定的内部优化,因此不用去设置并行度和分区数量。

Fourth, repartition solves the problem of SparkSQL low parallelism

The parallelism of Spark SQL does not allow users to specify by themselves.Spark SQL 自己会默认根据 hive 表对应的 HDFS 文件的 block 个数自动设置 Spark SQL 所在的那个 stage 的并行度,但有时此默认并行度过低,导致任务运行缓慢。

Since the parallelism of the stage where Spark SQL is located cannot be manually set, if the amount of data is large, and the subsequent transformation operations in this stage have complex business logic, and the number of tasks automatically set by Spark SQL is small, this means that each task To process a large amount of data, and then perform very complex processing logic, this may be manifested as the first stage with Spark SQL is very slow, and the subsequent stages without Spark SQL run very fast.

In order to solve the problem that Spark SQL cannot set the degree of parallelism and the number of tasks, we can use the repartition operator.

There must be no way to change the parallelism and the number of tasks in this step of Spark SQL, but,对于Spark SQL 查询出来的 RDD,立即使用 repartition 算子,去重新进行分区,这样可以重新分区为多个 partition,从 repartition 之后的 RDD 操作,由于不再设计 Spark SQL,因此 stage 的并行度就会等于你手动设置的值,这样就避免了 Spark SQL 所在的 stage 只能用少量的 task 去处理大量数据并执行复杂的算法逻辑。

The before and after comparison of using the repartition operator is shown in the figure below:
Insert picture description here

Five, reduceByKey local aggregation

Compared with the ordinary shuffle operation, reduceByKey has a remarkable feature that it will perform local aggregation on the map side. The map side will first combine the local data, and then write the data to the file created by each task of the next stage. That is, on the map side, execute the reduceByKey operator function for the value corresponding to each key.

The execution process of the reduceByKey operator is shown in the following figure: The
Insert picture description here
performance improvement of using reduceByKey is as follows:

(1) After local aggregation, the amount of data on the map side is reduced, reducing disk IO, and also reducing the occupation of disk space;
(2) After local aggregation, the amount of data pulled by the next stage is reduced, reducing network transmission Data volume;
(3) After local aggregation, the memory usage of data caching on the reduce side is reduced;
(4) After local aggregation, the amount of data aggregated on the reduce side is reduced.

Based on the local aggregation feature of reduceByKey, we should consider using reduceByKey instead of other shuffle operators, such as groupByKey.
The operation principle of reduceByKey and groupByKey is shown in the following figure:
Insert picture description here
According to the above figure, groupByKey does not perform map-side aggregation, but shuffles all map-side data to the reduce side, and then performs data aggregation operations on the reduce side. due toreduceByKey有 map 端 聚 合 的 特 性 , 使 得 网 络 传 输 的 数 据 量 减 小 , 因 此 效 率 要 明 显 高 于groupByKey。

Guess you like

Origin blog.csdn.net/weixin_43520450/article/details/108650220