bypass SortShuffleManager the bypass operation mechanism

bypass operation mechanism

The following figure illustrates the principle of bypass SortShuffleManager. Triggers bypass operation mechanism is as follows:

  • Spark.shuffle.sort.bypassMergeThreshold shuffle map task number is smaller than the value of a parameter.
  • shuffle operator is not aggregated class (such reduceByKey).

At this task will creates a temporary file for each disk task downstream, and press hash key data and key according to a hash value of the corresponding key is written into the disk file. Of course, when writing to a disk file is first written to the memory buffer, after the buffer is full and then overflow to disk files. Last, but not all will be temporary disk files are merged into a disk file, and create a separate index file.

Disk write mechanism of the process is actually with non-optimized HashShuffleManager is exactly the same, because should create a surprising number of disk files, but in the final merge will do a disk file only. Therefore, the final amount of disk files, but also to the relatively non-optimized HashShuffleManager mechanism, the performance will be better shuffle read.

The different mechanisms and operating mechanisms of ordinary SortShuffleManager: First, different disk write mechanism; second, will not be sorted. That is, enabling maximum advantage of this mechanism is that, shuffle write process, the operation does not require sorting data, thereby saving out this part of the performance overhead.

 

 

 

 

Guess you like

Origin www.cnblogs.com/sunpengblog/p/11915439.html