Spark kernel analysis (6) analysis of Spark Shuffle operating principle

1. The core points of Shuffle

1.1 ShuffleMapStage 与 FinalStage

Insert picture description here
When dividing stages, the last stage is called FinalStage, which is essentially a ResultStage object, and all previous stages are called ShuffleMapStage.

ShuffleMapStage 的结束伴随着 shuffle 文件的写磁盘。

ResultStage basically corresponds to the action operator in the code, namely将一个函数应用在 RDD 的各个 partition 的数据集上,意味着一个 job 的运行结束。

1.2 Number of tasks in Shuffle

1. Determining the number of tasks on the map side

The number of tasks in the shuffle process is determined by the number of RDD partitions, and the number of RDD partitions is closely related to the parameter spark.default.parallelism.

In Yarn Cluster mode, if spark.default.parallelism is not manually set, there are:

Others: total number of cores on all executor nodes or 2, whichever is larger.
spark.default.parallelism = max(所有 executor 使用的 core 总数, 2)

If manual configuration is performed, then:

spark.default.parallelism = 配置值

There is another important configuration:

The maximum number of bytes to pack into a single partition when reading files.
spark.files.maxPartitionBytes = 128 M (默认)

It represents the maximum number of bytes of data that can be stored in a partition of rdd. If a 400MB file is divided into only two areas, an error will occur during action.

When a spark application is executed, a sparkContext is generated and two parameters are generated at the same time. The values ​​of these two parameters are derived from the spark.default.parallelism obtained above:

sc.defaultParallelism = spark.default.parallelism
sc.defaultMinPartitions = min(spark.default.parallelism,2)

When the above parameters are determined, the number of RDD partitions can be calculated:
(1) RDD generated by parallelize the scala collection method

val rdd = sc.parallelize(1 to 10)

In this way, if the number of partitions is not specified during the parallelize operation, there are:rdd 的分区数 = sc.defaultParallelism

(2) RDD generated by textFile on the local file system

val rdd = sc.textFile(“path/file”)

The number of partitions of rdd = max (the number of fragments of the local file, sc.defaultMinPartitions)

(3) RDD generated in HDFS file system

rdd 的分区数 = max(HDFS 文件的 Block 数目, sc.defaultMinPartitions)

(4) Obtain data from HBase data table and convert to RDD

rdd 的分区数 = Table 的 region 个数

(5) DataFrame converted into json (or parquet, etc.) files by obtaining

rdd 的分区数 = 该文件在文件系统中存放的 Block 数目

(6) Spark Streaming obtains the number of partitions corresponding to Kafka messages

Based on Receiver:

In the Receiver approach, the partition in Spark and the partition in kafka are not related, so if we increase the number of partitions for each topic, we only increase threads to process topics consumed by a single Receiver. But this did not increase Spark's parallelism in processing data.

Based on DirectDStream:

Spark will create as many RDD partitions as Kafka partitions, and will read data from Kafka in parallel, so
there is a one-to-one mapping between Kafka partitions and RDD partitions.

2. Determining the number of tasks on the reduce side

Reduce data aggregation 一部分聚合算子可以手动指定 reducetask 的并行度, if not specified, it will bemap 端的最后一个 RDD 的分区数作为其分区数,那么分区数就决定了 reduce 端的 task 的个数。

1.3 Reading data on the reduce side

According to the division of stages, we know that the map-side task and the reduce-side task are not in the same stage, the map task is located in ShuffleMapStage, the reduce task is located in the ResultStage, and the map task will be executed first. Then how does the reduce task executed later know where to pull the map from? What about the data after the task is placed?

The data pull process on the reduce side is as follows:

(1) After the map task is executed, the calculation status and the location of small files on the disk will be encapsulated into the mapStatue object, and then the MapOutPutTrackerWorker object in this process will send the mapStatus object to the MapOutPutTrackerMaster object of the Driver process;

(2)在 reduce task 开始执行之前会先让本进程中的 MapOutputTrackerWorker 向Driver 进程中的 MapoutPutTrakcerMaster 发动请求,请求磁盘小文件位置信息;

(3)当所有的 Map task 执行完毕后,Driver 进程中的 MapOutPutTrackerMaster就掌握了 所有的 磁盘小 文件的位 置信息 。此 时 MapOutPutTrackerMaster 会告诉MapOutPutTrackerWorker 磁盘小文件的位置信息;

(4) After completing the previous operations, BlockTransforService will go to the node where the Executor is located to pull data 默认会启动五个子线程. The amount of data pulled each time cannot exceed 48M (the reduce task pulls up to 48M data each time, and stores the pulled data in 20% of the Executor memory).

Two, HashShuffle analysis

The following discussion assumes that each Executor has 1 CPU core.

2.1 Unoptimized HashShuffleManager

The shuffle write stage is mainly to perform shuffle-like operators (such as reduceByKey) after the calculation of one stage is completed, and the data processed by each task is "divided" by key for the next stage. 所谓“划分”,就是对相同的 key 执行 hash 算法,从而将相同 key 都写入同一个磁盘文件中,而每一个磁盘文件都只属于下游 stage 的一个 task。Before writing the data to the disk, the data will be written to the memory buffer, and when the memory buffer is full, it will overflow to the disk file.

下一个 stage 的 task 有多少个,当前 stage 的每个 task 就要创建多少份磁盘文件。For example, the next stage has a total of 100 tasks, and each task of the current stage has to create 100 disk files. If the current stage has 50 tasks, there are a total of 10 Executors, and each Executor executes 5 tasks, then a total of 500 disk files will be created on each Executor, and 5000 disk files will be created on all Executors. This shows that the number of disk files generated by unoptimized shuffle write operations is extremely staggering.

The shuffle read stage is usually what is done at the beginning of a stage. 此时该 stage 的每一个 task 就需要将上一个 stage 的计算结果中的所有相同 key,从各个节点上通过网络都拉取到自己所在的节点上,然后进行 key 的聚合或连接等操作。Since in the shuffle write process, the map task creates a disk file for each reduce task in the downstream stage. Therefore, in the shuffle read process, each reduce task only needs to pull from the node where all the map tasks in the upstream stage are located. Your own disk file is fine.

shuffle read 的拉取过程是一边拉取一边进行聚合的。Each shuffle read task will have its own buffer, and each time it can only pull data of the same size as the buffer buffer, and then perform aggregation and other operations through a Map in memory. 聚合完一批数据后,再拉取下一批数据,并放到 buffer 缓冲中进行聚合操作。And so on, until finally all the data is pulled, and the final result is obtained.

The working principle of the unoptimized HashShuffleManager is shown in the following figure:
Insert picture description here

2.2 Optimized HashShuffleManager

In order to optimize HashShuffleManager, we can set a parameter:, spark.shuffle.consolidateFilesthe default value of this parameter is false, set it to true to enable the optimization mechanism. Generally speaking, if we use HashShuffleManager, it is recommended to enable this option.

After the open consolidate mechanisms in the shuffle write process, task is not created for each task downstream stage of a disk file, and this time there will be shuffleFileGroupthe concept, 每 个shuffleFileGroup 会对应一批磁盘文件,磁盘文件的数量与下游 stage 的 task 数量是相同的。a number of CPU core on the Executor, on how many task can be executed in parallel . and第一批并行执行的每个 task 都会创建一个 shuffleFileGroup,并将数据写入对应的磁盘文件内。

When the Executor of the CPU core after executing a group task, then the next batch task, the next batch will reuse an existing task before shuffleFileGroup, which includes disk file, that is to say, at this time task会将数据写入已有的磁盘文件中,而不会写入新的磁盘文件中。and therefore,consolidate机制允许不同的 task 复用同一批磁盘文件,这样就可以有效将多个 task 的磁盘文件进行一定程度上的合并,从而大幅度减少磁盘文件的数量,进而提升 shuffle write的性能。

Assuming that the second stage has 100 tasks and the first stage has 50 tasks, there are still 10 Executors (the number of Executor CPUs is 1), and each Executor executes 5 tasks. So when the unoptimized HashShuffleManager is used, each Executor will generate 500 disk files, and all Executors will generate 5000 disk files. But after optimization at this time, the calculation formula for the number of disk files created by each Executor is:, CPU core 的数量 * 下一个 stage 的 task 数量that is, each Executor will only create 100 disk files at this time, and all Executors will only create 1000 disk files.

The working principle of the optimized HashShuffleManager is shown in the figure below:
Insert picture description here

Three, SortShuffle analysis

The operating mechanism of SortShuffleManager is mainly divided into two types, one is 普通运行机制, the other is bypass 运 行 机 制.当 shuffle read task 的数量小于等于 spark.shuffle.sort.bypassMergeThreshold 参数的值时(默认为 200),就会启用 bypass 机制。

3.1 General operating mechanism

In this mode 数据会先写入一个内存数据结构中, at this time, according to different shuffle operators, different data structures may be selected. If it is the shuffle operator of the aggregation class like reduceByKey, then the Map data structure will be used, and the map will be aggregated while writing to the memory; if it is the ordinary shuffle operator like join, then the Array data structure will be selected and written directly Into memory. Then, every time a piece of data is written into the memory data structure, it will be judged whether it has reached a certain critical threshold.如果达到临界阈值的话,那么就会尝试将内存数据结构中的数据溢写到磁盘,然后清空内存数据结构。

在溢写到磁盘文件之前,会先根据 key 对内存数据结构中已有的数据进行排序。After sorting, the data will be written to the disk file in batches. The default batch number is 10,000, that is, the sorted data will be written to the disk file in batches in the form of 10,000 data in each batch. 写入磁盘文件是通过 Java 的 BufferedOutputStream 实现的。BufferedOutputStream is a buffered output stream of Java. It first buffers the data in the memory, and then writes it to the disk file again when the memory buffer overflows. This can reduce the number of disk IOs and improve performance.

一个 task 将所有数据写入内存数据结构的过程中,会发生多次磁盘溢写操作,也就会产生多个临时文件。最后会将之前所有的临时磁盘文件都进行合并,这就是merge 过程,此时会将之前所有临时磁盘文件中的数据读取出来,然后依次写入最终的磁盘文件之中。 In addition,由于一个 task 就只对应一个磁盘文件,也就意味着该 task为下游 stage 的 task 准备的数据都在这一个文件中,因此还会单独写一份索引文件,其中标识了下游各个 task 的数据在文件中的 start offset 与 end offset。

SortShuffleManager has a disk file merge process, which greatly reduces the number of files. For example, the first stage has 50 tasks, a total of 10 Executors, each Executor executes 5 tasks, and the second stage has 100 tasks. Since each task finally has only one disk file, there are only 5 disk files on each Executor at this time, and all Executors only have 50 disk files.

The working principle of SortShuffleManager of the normal operating mechanism is shown in the figure below:
Insert picture description here

3.2 Bypass operating mechanism

The triggering conditions of bypass operation mechanism are as follows:

(1) The number of shuffle map tasks is less than the value of the spark.shuffle.sort.bypassMergeThreshold parameter
(2) It is not an aggregate shuffle operator

At this point, of 每个 task 会为每个下游 task 都创建一个临时磁盘文件,并将数据按 key进行 hash 然后根据 key 的 hash 值,将 key 写入对应的磁盘文件之中。course, when writing to a disk file, it is also written to the memory buffer first, and after the buffer is full, it overflows and writes to the disk file. Finally, all temporary disk files are also merged into one disk file and a separate index file is created.

该过程的磁盘写机制其实跟未经优化的 HashShuffleManager 是一模一样的,因为都要创建数量惊人的磁盘文件,只是在最后会做一个磁盘文件的合并而已。Therefore, the small number of final disk files also makes the mechanism of shuffle read better than the unoptimized HashShuffleManager.

The difference between this mechanism and the normal SortShuffleManager operating mechanism is:

(1) Different disk write mechanism
(2) No sorting

In other words, the biggest benefits of enabling this mechanism are :shuffle write过程中,不需要进行数据的排序操作,也就节省掉了这部分的性能开销。

The working principle of SortShuffleManager of the normal operating mechanism is shown in the figure below:
Insert picture description here

Guess you like

Origin blog.csdn.net/weixin_43520450/article/details/108626739