Spark shuffle process source code analysis

foreword

In order to better understand the shuffle process of spark, by reading the source code, we can thoroughly understand the execution process of the shuffle process and the content related to sorting.

The spark version used in this article is: 2.4.4

1、shuffle之BypassMergeSortShuffleWriter

Fundamental:

1. As many partitions as there are in the downstream reduce, as many fileWriter[reduceNumer] are created in the upstream map, and the data of each downstream partition is written into an independent file. After all the partition files are written, merge the data of multiple partitions into one file, the code is as follows:

while (records.hasNext()) {
      final Product2<K, V> record = records.next();
      final K key = record._1();
      //作者注:将数据写到对应分区的文件中去。
      partitionWriters[partitioner.getPartition(key)].write(key, record._2());
    }

    for (int i = 0; i < numPartitions; i++) {
      final DiskBlockObjectWriter writer = partitionWriters[i];
      partitionWriterSegments[i] = writer.commitAndGet();
      writer.close();
    }

    File output = shuffleBlockResolver.getDataFile(shuffleId, mapId);
    File tmp = Utils.tempFileWith(output);
    try {
      //作者注:合并所有分区的小文件为一个大文件,保证同一个分区的数据连续存在
      partitionLengths = writePartitionedFile(tmp);
      //作者注:构建索引文件
      shuffleBlockResolver.writeIndexFileAndCommit(shuffleId, mapId, partitionLengths, tmp);
    } finally {
      if (tmp.exists() && !tmp.delete()) {
        logger.error("Error while deleting temp file {}", tmp.getAbsolutePath());
      }
    }

2. Since the data of each independent partition file belongs to the same reduce, when merging files, there is no need to sort them, just merge them into one file according to the order of the files, and create the corresponding partition data index file.

3. The conditions for using BypassMergeSortShuffleWriter are:

        (1), the number of downstream partitions cannot exceed the value of the parameter spark.shuffle.sort.bypassMergeThreshold (the default is 200)

        (2), non-map-side pre-aggregation operator (reduceByKey)

        The specific judgment code is as follows:

def shouldBypassMergeSort(conf: SparkConf, dep: ShuffleDependency[_, _, _]): Boolean = {
    // We cannot bypass sorting if we need to do map-side aggregation.
    if (dep.mapSideCombine) {
      false
    } else {
      val bypassMergeThreshold: Int = conf.getInt("spark.shuffle.sort.bypassMergeThreshold", 200)
      dep.partitioner.numPartitions <= bypassMergeThreshold
    }
  }

2、shuffle之SortShuffleWriter

Conditions for executing the writer:

(1) The number of downstream partitions exceeds the value set by the spark.shuffle.sort.bypassMergeThreshold parameter (default 200)

(2), skip a writer called UnsafeShuffleWriter (see 3 for details)

Description of the execution process:

1. If the map end is a pre-aggregation operator (such as reduceByKey)

(1), use a map: PartitionedAppendOnlyMap object for data storage and pre-aggregation, the code is as follows:

        Note: You can see that when map.changeValue, the updated key is not the key of the data, but the partition id of the key is added on the basis of the data key ((getPartition(kv._1), kv._1) ), the purpose of doing this is to sort by partition id when the data is overflowed to the disk below, so as to ensure that the data of the same partition can be stored together continuously.

//作者注:判断是否是一个预聚合算子
if (shouldCombine) {
      // Combine values in-memory first using our AppendOnlyMap
      //作者注:获取预聚合算子的执行函数
      val mergeValue = aggregator.get.mergeValue
      val createCombiner = aggregator.get.createCombiner
      var kv: Product2[K, V] = null
      val update = (hadValue: Boolean, oldValue: C) => {
        if (hadValue) mergeValue(oldValue, kv._2) else createCombiner(kv._2)
      }
      while (records.hasNext) {
        addElementsRead()
        kv = records.next()
        //作者注:使用一个map:PartitionedAppendOnlyMap类型进行数据的存储和预聚合更新
        map.changeValue((getPartition(kv._1), kv._1), update)
        //作者注:执行溢出到磁盘操作
        maybeSpillCollection(usingMap = true)
      }
    }

(2) Execute the data spill to disk operation: maybeSpillCollection, the code is as follows:

 private def maybeSpillCollection(usingMap: Boolean): Unit = {
    var estimatedSize = 0L
    //作者注:判断是否是预聚合算子
    if (usingMap) {
      //作者注:预聚合算子,则把map对象里面的数据写入到磁盘
      estimatedSize = map.estimateSize()
      if (maybeSpill(map, estimatedSize)) {
        map = new PartitionedAppendOnlyMap[K, C]
      }
    } else {
      //作者注:不是预聚合算子,则把buffer对象里面的数据写入到磁盘
      estimatedSize = buffer.estimateSize()
      if (maybeSpill(buffer, estimatedSize)) {
        buffer = new PartitionedPairBuffer[K, C]
      }
    }

       Then execute the maybeSpill function to judge whether to overflow to disk according to the overflow condition. The code is as follows:

protected def maybeSpill(collection: C, currentMemory: Long): Boolean = {
    var shouldSpill = false
    if (elementsRead % 32 == 0 && currentMemory >= myMemoryThreshold) {
      // Claim up to double our current memory from the shuffle memory pool
      val amountToRequest = 2 * currentMemory - myMemoryThreshold
      val granted = acquireMemory(amountToRequest)
      myMemoryThreshold += granted
      // If we were granted too little memory to grow further (either tryToAcquire returned 0,
      // or we already had more memory than myMemoryThreshold), spill the current collection
      shouldSpill = currentMemory >= myMemoryThreshold
    }

    //作者注:是否溢出磁盘,有两个判断条件
    //1、shouldSplill:判断内存空间的是否充足
    //2、_elementsRead > numElementsForceSpillThreshold:判断当前的写的数据条数是否超过阈值numElementsForceSpillThreshold(默认Integer.MAX_VALUE)
    shouldSpill = shouldSpill || _elementsRead > numElementsForceSpillThreshold
    // Actually spill
    if (shouldSpill) {
      _spillCount += 1
      logSpillage(currentMemory)
      spill(collection)
      _elementsRead = 0
      _memoryBytesSpilled += currentMemory
      releaseMemory()
    }
    shouldSpill
  }

        If the conditions are met, this executes spill(collection) for data overflow, the code is as follows:

override protected[this] def spill(collection: WritablePartitionedPairCollection[K, C]): Unit = {
    val inMemoryIterator = collection.destructiveSortedWritablePartitionedIterator(comparator)
    val spillFile = spillMemoryIteratorToDisk(inMemoryIterator)
    spills += spillFile
  }

        Take a look at this line of code:

val inMemoryIterator = collection.destructiveSortedWritablePartitionedIterator(comparator)

        The function of this line of code is to sort the data, how to sort it? After the author's debugging process, it is found that this sorting process is not actually sorting the keys of the data, but sorting the partition ids, so as to ensure that the data of the same partition can be consecutively together, and provide support for the subsequent merge sorting of overflow files. Base.

        So far, the disk overflow operation of data is completed. The next step is how to merge the overflow data.

(3), the overflow disk data files are merged into one large file, and an index file of a partition is established. The specific code execution process is as follows (SortShuffleWriter): the specific execution process inside will not be repeated again.

try {
      val blockId = ShuffleBlockId(dep.shuffleId, mapId, IndexShuffleBlockResolver.NOOP_REDUCE_ID)
      //作者注:将溢出的磁盘文件和当前缓存的文件进行归并合并,保证同一分区的数据连续存在
      val partitionLengths = sorter.writePartitionedFile(blockId, tmp)
      //作者注:构建索引文件
      shuffleBlockResolver.writeIndexFileAndCommit(dep.shuffleId, mapId, partitionLengths, tmp)
      mapStatus = MapStatus(blockManager.shuffleServerId, partitionLengths)
    } finally {
      if (tmp.exists() && !tmp.delete()) {
        logError(s"Error while deleting temp file ${tmp.getAbsolutePath}")
      }
    }

2. If the map side is not a pre-aggregation operator (such as groupByKey)

The execution process of the shufflerwriter of the pre-aggregation operator is introduced above, and the execution process of the shufflewriter of the non-pre-aggregation operator is basically the same as that of the pre-aggregation operator. The only difference is that the structure of storing data is not map:PartitionedAppendOnlyMap, but It is buffer: PartitionedPairBuffer, the code is as follows:

if (shouldCombine) {
      // Combine values in-memory first using our AppendOnlyMap
      val mergeValue = aggregator.get.mergeValue
      val createCombiner = aggregator.get.createCombiner
      var kv: Product2[K, V] = null
      val update = (hadValue: Boolean, oldValue: C) => {
        if (hadValue) mergeValue(oldValue, kv._2) else createCombiner(kv._2)
      }
      while (records.hasNext) {
        addElementsRead()
        kv = records.next()
        map.changeValue((getPartition(kv._1), kv._1), update)
        maybeSpillCollection(usingMap = true)
      }
    } else {
      // Stick values into our buffer
      //作者注:非预聚合算子的数据存储到buffer中
      while (records.hasNext) {
        addElementsRead()
        val kv = records.next()
        buffer.insert(getPartition(kv._1), kv._1, kv._2.asInstanceOf[C])
        maybeSpillCollection(usingMap = false)
      }
    }

As for the subsequent processes of data overflow, data sorting, and merging of overflow data files, they are exactly the same as the execution process of the pre-aggregation operator, and the same execution process is called, so I won’t go into details here.

3、shuffle之UnsafeShuffleWriter

The author did not pursue the specific execution process of this UnsafeShuffleWriter, because it can be seen from the name that Unsafe uses off-heap memory for data storage and related operations. The basic principle is to serialize data objects and store them on the heap. External memory, and then use the binary method to sort the data, which can improve the computing performance.

In the actual execution process, this method is preferred for writing in the shuffle process. The specific execution conditions are as follows:

def canUseSerializedShuffle(dependency: ShuffleDependency[_, _, _]): Boolean = {
    val shufId = dependency.shuffleId
    val numPartitions = dependency.partitioner.numPartitions
    //作者注:序列化器支持relocation.
    //作者注:目前spark提供的有两个序列化器:JavaSerializer和KryoSerializer
    //其中KryoSerializer支持relocation,而JavaSerializer不支持relocation
    if (!dependency.serializer.supportsRelocationOfSerializedObjects) {
      log.debug(s"Can't use serialized shuffle for shuffle $shufId because the serializer, " +
        s"${dependency.serializer.getClass.getName}, does not support object relocation")
      false
    } else if (dependency.mapSideCombine) { //作者注:非map端预聚合算子
      log.debug(s"Can't use serialized shuffle for shuffle $shufId because we need to do " +
        s"map-side aggregation")
      false
    } else if (numPartitions > MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE) {
      //作者注:下游分区个数小于MAXIMUM_PARTITION_ID = (1 << 24) - 1
      log.debug(s"Can't use serialized shuffle for shuffle $shufId because it has more than " +
        s"$MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE partitions")
      false
    } else {
      log.debug(s"Can use serialized shuffle for shuffle $shufId")
      true
    }
  }

The author did not trace the specific execution logic in depth, because the traces may be all binary data, and the data information cannot be viewed intuitively. If readers are interested, they can debug and trace by themselves.

Guess you like

Origin blog.csdn.net/chenzhiang1/article/details/126834574