spark source reading --shuffle read source code analysis process

Shuffle read source code analysis process

Last article, we analyze the shuffle write process in the map phase. A brief review, mainly the result of the calculation data ShuffleMapTask sorted by the partition and key in memory process due to memory limitations will overflow to write more disk files, last on all files and memory remaining data merge Sort and overflow to a file while logged offset of each partition (reduce side partition) in the data file, and the mapping relationship between the partition and the offset written in an index file.
Well, after a brief review of the writing process, we can not help thinking, specific stages of the process reduce data read is what? What happened timing of data read is?

You should first answer the question: the timing of data read what is happening? We know that rdd calculation chain is cut according to shuffle into a different stage, beginning a stage of general data is read from the beginning of the last phase, that stage of the process to read the data is actually reduce process, and then through be calculated chain of the stage after the results of the data, and then it will write this data to the next disk for a stage reading, the writing process is actually a map output process, and this process until we have analyzed before. This article, we want to analyze is the process reduce the stage to read the data.

Long-winded so a large section, in fact, leads to the entrance to the read data, or to return to ShuffleMapTask, here I only posted part of the code:

  // shuffle管理器
  val manager = SparkEnv.get.shuffleManager
  // 获取一个shuffle写入器
  writer = manager.getWriter[Any, Any](dep.shuffleHandle, partitionId, context)
  // 这里可以看到rdd计算的核心方法就是iterator方法
  // SortShuffleWriter的write方法可以分为几个步骤:
  // 将上游rdd计算出的数据(通过调用rdd.iterator方法)写入内存缓冲区,
  // 在写的过程中如果超过 内存阈值就会溢写磁盘文件,可能会写多个文件
  // 最后将溢写的文件和内存中剩余的数据一起进行归并排序后写入到磁盘中形成一个大的数据文件
  // 这个排序是先按分区排序,在按key排序
  // 在最后归并排序后写的过程中,没写一个分区就会手动刷写一遍,并记录下这个分区数据在文件中的位移
  // 所以实际上最后写完一个task的数据后,磁盘上会有两个文件:数据文件和记录每个reduce端partition数据位移的索引文件
  writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]])
  // 主要是删除中间过程的溢写文件,向内存管理器释放申请的内存
  writer.stop(success = true).get

Reading the data code is actually rdd.iterator (Partition, context),
Iterator method is mainly rdd cache logic, if the cache is read (by BlockManager) from the cache, the cache is performed if there is no actual calculation , we found that the final call RDD.compute actual calculation method, this method is an abstract method, specific calculation logic is implemented by a subclass tag for the user to do some of the RDD transform operation will eventually be reflected in the actually compute process .
On the other hand, we know, map, filter operator is not the kind of shuffle operations, does not lead to division stage, so we want to see what you can shuffle the process of reading a Shuffle types of operations, we look at RDD.groupBy, the final call a method groupByKey

RDD.groupByKey

def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
// groupByKey shouldn't use map side combine because map side combine does not
// reduce the amount of data shuffled and requires all map side data be inserted
// into a hash table, leading to more objects in the old gen.
val createCombiner = (v: V) => CompactBuffer(v)
val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
  createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
bufs.asInstanceOf[RDD[(K, Iterable[V])]]
}

The final call combineByKeyWithClassTag

RDD.combineByKeyWithClassTag

Do some judge, to check illegal situation, and then deal with what partition, and finally return to a ShuffledRDD, so let's analyze the compute method ShuffleRDD

def combineByKeyWithClassTag[C](
  createCombiner: V => C,
  mergeValue: (C, V) => C,
  mergeCombiners: (C, C) => C,
  partitioner: Partitioner,
  mapSideCombine: Boolean = true,
  serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope {
require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0
// 如果key是Array类型,是不支持在map端合并的
// 并且也不支持HashPartitioner
if (keyClass.isArray) {
  if (mapSideCombine) {
    throw new SparkException("Cannot use map-side combining with array keys.")
  }
  if (partitioner.isInstanceOf[HashPartitioner]) {
    throw new SparkException("HashPartitioner cannot partition array keys.")
  }
}
// 聚合器,用于对数据进行聚合
val aggregator = new Aggregator[K, V, C](
  self.context.clean(createCombiner),
  self.context.clean(mergeValue),
  self.context.clean(mergeCombiners))
// 如果分区器相同,就不需要shuffle了
if (self.partitioner == Some(partitioner)) {
  self.mapPartitions(iter => {
    val context = TaskContext.get()
    new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context))
  }, preservesPartitioning = true)
} else {
  // 返回一个ShuffledRDD
  new ShuffledRDD[K, V, C](self, partitioner)
    .setSerializer(serializer)
    .setAggregator(aggregator)
    .setMapSideCombine(mapSideCombine)
}
}

ShuffleRDD.compute

Get a reader through shuffleManager, data read in the reader logic.

override def compute(split: Partition, context: TaskContext): Iterator[(K, C)] = {
val dep = dependencies.head.asInstanceOf[ShuffleDependency[K, V, C]]
// 通过shuffleManager获取一个读取器
SparkEnv.get.shuffleManager.getReader(dep.shuffleHandle, split.index, split.index + 1, context)
  .read()
  .asInstanceOf[Iterator[(K, C)]]
}

SortShuffleManager.getReader

Needless to say, looking directly at BlockStoreShuffleReader

override def getReader[K, C](
  handle: ShuffleHandle,
  startPartition: Int,
  endPartition: Int,
  context: TaskContext): ShuffleReader[K, C] = {
new BlockStoreShuffleReader(
  handle.asInstanceOf[BaseShuffleHandle[K, _, C]], startPartition, endPartition, context)
}

BlockStoreShuffleReader.read

Obviously, this method is the core. Summarize the main steps:

  • Get a package iterator ShuffleBlockFetcherIterator, which iterates the elements of this block is blockId and corresponding read stream, it is clear that this class is the key to reduce implementation phase of data read
  • Stream into the original read iterator after deserialization
  • Iterator can be converted into a statistical measure of iterators, conversion and java this series is very similar to a variety of decorator stream
  • The iterator can be packed into the corresponding interrupt iterators. Each piece of data will be read to check the task has not been killed, this approach is to try to kill the task in response to the request in a timely manner, such as killing the task message sent from the driver side.
  • Results of polymerization using a polymerization vessel. Here again, the use of this data structure AppendonlyMap, shuffle preceding stage is also used to write the data structure that is internal to array as an underlying data structure, the linear detection method of linear hash table.
  • Finally, to sort the results.

So obviously, we want to know the specific logic shuffle read data hidden in the ShuffleBlockFetcherIterator

    private[spark] class BlockStoreShuffleReader[K, C](
        handle: BaseShuffleHandle[K, _, C],
        startPartition: Int,
        endPartition: Int,
        context: TaskContext,
        serializerManager: SerializerManager = SparkEnv.get.serializerManager,
        blockManager: BlockManager = SparkEnv.get.blockManager,
        mapOutputTracker: MapOutputTracker = SparkEnv.get.mapOutputTracker)
      extends ShuffleReader[K, C] with Logging {
    
      private val dep = handle.dependency
    
      /** Read the combined key-values for this reduce task */
      override def read(): Iterator[Product2[K, C]] = {
        // 获取一个包装的迭代器,它迭代的元素是blockId和这个block对应的读取流
        val wrappedStreams = new ShuffleBlockFetcherIterator(
          context,
          blockManager.shuffleClient,
          blockManager,
          mapOutputTracker.getMapSizesByExecutorId(handle.shuffleId, startPartition, endPartition),
          serializerManager.wrapStream,
          // Note: we use getSizeAsMb when no suffix is provided for backwards compatibility
          SparkEnv.get.conf.getSizeAsMb("spark.reducer.maxSizeInFlight", "48m") * 1024 * 1024,
          SparkEnv.get.conf.getInt("spark.reducer.maxReqsInFlight", Int.MaxValue),
          SparkEnv.get.conf.get(config.REDUCER_MAX_BLOCKS_IN_FLIGHT_PER_ADDRESS),
          SparkEnv.get.conf.get(config.MAX_REMOTE_BLOCK_SIZE_FETCH_TO_MEM),
          SparkEnv.get.conf.getBoolean("spark.shuffle.detectCorrupt", true))
    
        val serializerInstance = dep.serializer.newInstance()
    
        // Create a key/value iterator for each stream
        // 将原始读取流转换成反序列化后的迭代器
        val recordIter = wrappedStreams.flatMap { case (blockId, wrappedStream) =>
          // Note: the asKeyValueIterator below wraps a key/value iterator inside of a
          // NextIterator. The NextIterator makes sure that close() is called on the
          // underlying InputStream when all records have been read.
          serializerInstance.deserializeStream(wrappedStream).asKeyValueIterator
        }
    
        // Update the context task metrics for each record read.
        val readMetrics = context.taskMetrics.createTempShuffleReadMetrics()
        // 转换成能够统计度量值的迭代器,这一系列的转换和java中对于流的各种装饰器很类似
        val metricIter = CompletionIterator[(Any, Any), Iterator[(Any, Any)]](
          recordIter.map { record =>
            readMetrics.incRecordsRead(1)
            record
          },
          context.taskMetrics().mergeShuffleReadMetrics())
    
        // An interruptible iterator must be used here in order to support task cancellation
        // 每读一条数据就会检查一下任务有没有被杀死,
        // 这种做法是为了尽量及时地响应杀死任务的请求,比如从driver端发来杀死任务的消息
        val interruptibleIter = new InterruptibleIterator[(Any, Any)](context, metricIter)
    
        val aggregatedIter: Iterator[Product2[K, C]] = if (dep.aggregator.isDefined) {
          // 利用聚合器对结果进行聚合
          if (dep.mapSideCombine) {
            // We are reading values that are already combined
            val combinedKeyValuesIterator = interruptibleIter.asInstanceOf[Iterator[(K, C)]]
            dep.aggregator.get.combineCombinersByKey(combinedKeyValuesIterator, context)
          } else {
            // We don't know the value type, but also don't care -- the dependency *should*
            // have made sure its compatible w/ this aggregator, which will convert the value
            // type to the combined type C
            val keyValuesIterator = interruptibleIter.asInstanceOf[Iterator[(K, Nothing)]]
            dep.aggregator.get.combineValuesByKey(keyValuesIterator, context)
          }
        } else {
          require(!dep.mapSideCombine, "Map-side combine without Aggregator specified!")
          interruptibleIter.asInstanceOf[Iterator[Product2[K, C]]]
        }
    
        // Sort the output if there is a sort ordering defined.
        // 最后对结果进行排序
        dep.keyOrdering match {
          case Some(keyOrd: Ordering[K]) =>
            // Create an ExternalSorter to sort the data.
            val sorter =
              new ExternalSorter[K, C, C](context, ordering = Some(keyOrd), serializer = dep.serializer)
            sorter.insertAll(aggregatedIter)
            context.taskMetrics().incMemoryBytesSpilled(sorter.memoryBytesSpilled)
            context.taskMetrics().incDiskBytesSpilled(sorter.diskBytesSpilled)
            context.taskMetrics().incPeakExecutionMemory(sorter.peakMemoryUsedBytes)
            CompletionIterator[Product2[K, C], Iterator[Product2[K, C]]](sorter.iterator, sorter.stop())
          case None =>
            aggregatedIter
        }
      }
    }

ShuffleBlockFetcherIterator

This class is more complex, a closer look at the class initialization code calls the initialize method.
Secondly, we should pay attention to the parameters in its constructor,

    val wrappedStreams = new ShuffleBlockFetcherIterator(
    context,
    // 如果没有启用外部shuffle服务,就是BlockTransferService
    blockManager.shuffleClient,
    blockManager,
    // 通过mapOutputTracker组件获取每个分区对应的数据block的物理位置
    mapOutputTracker.getMapSizesByExecutorId(handle.shuffleId, startPartition, endPartition),
    serializerManager.wrapStream,
    // Note: we use getSizeAsMb when no suffix is provided for backwards compatibility
    // 获取几个配置参数
    SparkEnv.get.conf.getSizeAsMb("spark.reducer.maxSizeInFlight", "48m") * 1024 * 1024,
    SparkEnv.get.conf.getInt("spark.reducer.maxReqsInFlight", Int.MaxValue),
    SparkEnv.get.conf.get(config.REDUCER_MAX_BLOCKS_IN_FLIGHT_PER_ADDRESS),
    SparkEnv.get.conf.get(config.MAX_REMOTE_BLOCK_SIZE_FETCH_TO_MEM),
    SparkEnv.get.conf.getBoolean("spark.shuffle.detectCorrupt", true))

ShuffleBlockFetcherIterator.initialize

  • First, local and remote block of the block separated
  • Then starts pulling the remote data transmission request. This process there will be some constraints limit the amount of data requested pulling, the total amount of main data is acquired limitations limit the number of concurrent requests; the number of blocks of each remote address will simultaneously pull is limited, but this threshold The default is Integer.MAX_VALUE
  • Get local block data

Wherein the local data acquisition is simple, mainly to data acquisition block BlockManager node and acquires the data specified by the index file partition.
We pulled focuses on a remote part of

private[this] def initialize(): Unit = {
// Add a task completion callback (called in both success case and failure case) to cleanup.
// 向TaskContext中添加一个回调,在任务完成时做一些清理工作
context.addTaskCompletionListener(_ => cleanup())

// Split local and remote blocks.
// 将本地的block和远程的block分隔开
val remoteRequests = splitLocalRemoteBlocks()
// Add the remote requests into our queue in a random order
fetchRequests ++= Utils.randomize(remoteRequests)
assert ((0 == reqsInFlight) == (0 == bytesInFlight),
  "expected reqsInFlight = 0 but found reqsInFlight = " + reqsInFlight +
  ", expected bytesInFlight = 0 but found bytesInFlight = " + bytesInFlight)

// Send out initial requests for blocks, up to our maxBytesInFlight
// 发送远程拉取数据的请求
// 尽可能多地发送请求
// 但是会有一定的约束:
// 全局性的约束,全局拉取数据的rpc线程并发数,全局拉取数据的数据量限制
// 每个远程地址的限制:每个远程地址同时拉取的块数不能超过一定阈值
fetchUpToMaxBytes()

// 记录已经发送的请求个数,仍然会有一部分没有发送请求
val numFetches = remoteRequests.size - fetchRequests.size
logInfo("Started " + numFetches + " remote fetches in" + Utils.getUsedTimeMs(startTime))

// Get Local Blocks
// 获取本地的block数据
fetchLocalBlocks()
logDebug("Got local blocks in " + Utils.getUsedTimeMs(startTime))
}

ShuffleBlockFetcherIterator.splitLocalRemoteBlocks

We first look at how the segmentation of remote and local data blocks, to sum up this approach:

  • First, the amount of data simultaneously pulled size divided by 5, as the amount of data per request pulling limitations reason for this is to allow simultaneous pull data from the five nodes, because the network node may not be stable environment , while pulling data from a plurality of nodes help reduce the effects of fluctuations in network performance, as the same time pulling on the overall amount of data is restricted primarily to limit the present machine network traffic
  • A loop through each node address (here BlockManagerId),
  • If the same address and the local address, then the block corresponding to the blocks is the local
  • For remote block, according to have to limit the amount of data while pulling all the block size for each node into multiple requests (FetchRequest), ensure that requests single pulling is not too large amount of data

      private[this] def splitLocalRemoteBlocks(): ArrayBuffer[FetchRequest] = {
      // Make remote requests at most maxBytesInFlight / 5 in length; the reason to keep them
      // smaller than maxBytesInFlight is to allow multiple, parallel fetches from up to 5
      // nodes, rather than blocking on reading output from one node.
      // 之所以将请求大小减小到maxBytesInFlight / 5,
      // 是为了并行化地拉取数据,最毒允许同时从5个节点拉取数据
      val targetRequestSize = math.max(maxBytesInFlight / 5, 1L)
      logDebug("maxBytesInFlight: " + maxBytesInFlight + ", targetRequestSize: " + targetRequestSize
        + ", maxBlocksInFlightPerAddress: " + maxBlocksInFlightPerAddress)
    
      // Split local and remote blocks. Remote blocks are further split into FetchRequests of size
      // at most maxBytesInFlight in order to limit the amount of data in flight.
      val remoteRequests = new ArrayBuffer[FetchRequest]
    
      // Tracks total number of blocks (including zero sized blocks)
      // 记录总的block数量
      var totalBlocks = 0
      for ((address, blockInfos) <- blocksByAddress) {
        totalBlocks += blockInfos.size
        // 如果地址与本地的BlockManager相同,就是本地block
        if (address.executorId == blockManager.blockManagerId.executorId) {
          // Filter out zero-sized blocks
          localBlocks ++= blockInfos.filter(_._2 != 0).map(_._1)
          numBlocksToFetch += localBlocks.size
        } else {
          val iterator = blockInfos.iterator
          var curRequestSize = 0L
          var curBlocks = new ArrayBuffer[(BlockId, Long)]
          while (iterator.hasNext) {
            val (blockId, size) = iterator.next()
            // Skip empty blocks
            if (size > 0) {
              curBlocks += ((blockId, size))
              remoteBlocks += blockId
              numBlocksToFetch += 1
              curRequestSize += size
            } else if (size < 0) {
              throw new BlockException(blockId, "Negative block size " + size)
            }
            // 如果超过每次请求的数据量限制,那么创建一次请求
            if (curRequestSize >= targetRequestSize ||
                curBlocks.size >= maxBlocksInFlightPerAddress) {
              // Add this FetchRequest
              remoteRequests += new FetchRequest(address, curBlocks)
              logDebug(s"Creating fetch request of $curRequestSize at $address "
                + s"with ${curBlocks.size} blocks")
              curBlocks = new ArrayBuffer[(BlockId, Long)]
              curRequestSize = 0
            }
          }
          // Add in the final request
          // 扫尾方法,最后剩余的块创建一次请求
          if (curBlocks.nonEmpty) {
            remoteRequests += new FetchRequest(address, curBlocks)
          }
        }
      }
      logInfo(s"Getting $numBlocksToFetch non-empty blocks out of $totalBlocks blocks")
      remoteRequests
      }

ShuffleBlockFetcherIterator.fetchUpToMaxBytes

Back initialize method, after the local and remote block segmentation, we obtain a number of packaged data pull request, these requests to the queue, the next step is to send the client by rpc request,

This method is relatively simple logic, the logic is mainly two cycles, the first delayed transmission requests in the queue, and then sends a normal request; reason there will be delayed because the request queue is the first time because the amount of data to be transmitted exceeds a threshold value or the number of requests exceeding a threshold can not be transmitted, it is placed in a queue delay, but also the processing where the priority transmission request queue delay. Before sending each request must meet several conditions will be sent following:

  • The current amount of data being pulled must not exceed the threshold value maxReqsInFlight (default 48m); there will be a problem, if the amount of data in a block of more than maxReqsInFlight value it? Currently has no data will be pulling in the request will not send a request in this case, because it will check the current request when the data amount threshold is determined bytesInFlight == 0, if this condition is satisfied does not check this whether the requested amount of data exceeds a threshold value.
  • The current amount of data that is being requested pulling can not exceed the threshold value (default Int.MaxValue)
  • At the same time the number of requests each remote address there will be restrictions (default Int.MaxValue)
  • The last qualifying request will be sent here to raise the amount of data that is requested more than once if maxReqSizeShuffleToMem value, it will write a temporary file on disk, and the default value of this threshold is Long.MaxValue, so By default, there is no limit.

      // 发送请求
      // 尽可能多地发送请求
      // 但是会有一定的约束:
      // 全局性的约束,全局拉取数据的rpc线程并发数,全局拉取数据的数据量限制
      // 每个远程地址的限制:每个远程地址同时拉取的块数不能超过一定阈值
      private def fetchUpToMaxBytes(): Unit = {
      // Send fetch requests up to maxBytesInFlight. If you cannot fetch from a remote host
      // immediately, defer the request until the next time it can be processed.
    
      // Process any outstanding deferred fetch requests if possible.
      if (deferredFetchRequests.nonEmpty) {
        for ((remoteAddress, defReqQueue) <- deferredFetchRequests) {
          while (isRemoteBlockFetchable(defReqQueue) &&
              !isRemoteAddressMaxedOut(remoteAddress, defReqQueue.front)) {
            val request = defReqQueue.dequeue()
            logDebug(s"Processing deferred fetch request for $remoteAddress with "
              + s"${request.blocks.length} blocks")
            send(remoteAddress, request)
            if (defReqQueue.isEmpty) {
              deferredFetchRequests -= remoteAddress
            }
          }
        }
      }
    
      // Process any regular fetch requests if possible.
      while (isRemoteBlockFetchable(fetchRequests)) {
        val request = fetchRequests.dequeue()
        val remoteAddress = request.address
        // 如果超过了同时拉取的块数的限制,那么将这个请求放到延缓队列中,留待下次请求
        if (isRemoteAddressMaxedOut(remoteAddress, request)) {
          logDebug(s"Deferring fetch request for $remoteAddress with ${request.blocks.size} blocks")
          val defReqQueue = deferredFetchRequests.getOrElse(remoteAddress, new Queue[FetchRequest]())
          defReqQueue.enqueue(request)
          deferredFetchRequests(remoteAddress) = defReqQueue
        } else {
          send(remoteAddress, request)
        }
      }
    
      // 发送一个请求,并且累加记录请求的块的数量,
      // 以用于在下次请求时检查请求块的数量是否超过阈值
      def send(remoteAddress: BlockManagerId, request: FetchRequest): Unit = {
        sendRequest(request)
        numBlocksInFlightPerAddress(remoteAddress) =
          numBlocksInFlightPerAddress.getOrElse(remoteAddress, 0) + request.blocks.size
      }
    
      // 这个限制是对所有的请求而言,不分具体是哪个远程节点
      // 检查当前的请求的数量是否还有余量
      // 当前请求的大小是否还有余量
      // 这主要是为了限制并发数和网络流量的使用
      def isRemoteBlockFetchable(fetchReqQueue: Queue[FetchRequest]): Boolean = {
        fetchReqQueue.nonEmpty &&
          (bytesInFlight == 0 ||
            (reqsInFlight + 1 <= maxReqsInFlight &&
              bytesInFlight + fetchReqQueue.front.size <= maxBytesInFlight))
      }
    
      // Checks if sending a new fetch request will exceed the max no. of blocks being fetched from a
      // given remote address.
      // 检测正在拉取的块的数量是否超过阈值
      // 每个地址都有一个同事拉取块数的限制
      def isRemoteAddressMaxedOut(remoteAddress: BlockManagerId, request: FetchRequest): Boolean = {
        numBlocksInFlightPerAddress.getOrElse(remoteAddress, 0) + request.blocks.size >
          maxBlocksInFlightPerAddress
      }
      }

ShuffleBlockFetcherIterator.next

By a method of analysis, we can see it, pulls data request initiated initialization not all requests will be sent out to all, and also because there is a request exceeds the threshold value is placed in the queue delay, these unsent the request is sent when it is again? The answer lies in the next method. We know ShuffleBlockFetcherIterator is an iterator, so the caller access to external elements through the next method, so it is easy to think next logical method will certainly be sent pulling the requested data.
in conclusion:

  • First results obtained from the queue pulls a successful outcome (result queue is a queue blocking, if not pull a successful outcome will block the caller)
  • After getting the results of a check result is pulling the success or failure of a pull, and if that fails then direct throw an exception (retry logic really rpc client implementation is not achieved here)
  • If it is a successful outcome, we must first look at some of the tasks metric update, update some amount of internal bookkeeping, such as the amount of data being pulled in
  • To be pulled into a byte buffer packed byte input stream
  • Function passed in by external convection repackaging once passed in through an external function once again packaging, typically decompressed and decrypted
  • And stream is compressed or encrypted, if the block size is relatively small, so you want a copy of this stream, which would start decompression and decryption practical, in order to expose corruption as early as possible problem areas
  • Finally, a critical statement, once again initiate a pull to send the requested data, because after next treatment, the data have been pulling the success of the amount of data and the number of requests being pulled may be reduced, and this is to send a new the request to make room

      override def next(): (BlockId, InputStream) = {
      if (!hasNext) {
        throw new NoSuchElementException
      }
    
      numBlocksProcessed += 1
    
      var result: FetchResult = null
      var input: InputStream = null
      // Take the next fetched result and try to decompress it to detect data corruption,
      // then fetch it one more time if it's corrupt, throw FailureFetchResult if the second fetch
      // is also corrupt, so the previous stage could be retried.
      // For local shuffle block, throw FailureFetchResult for the first IOException.
      while (result == null) {
        val startFetchWait = System.currentTimeMillis()
        result = results.take()
        val stopFetchWait = System.currentTimeMillis()
        shuffleMetrics.incFetchWaitTime(stopFetchWait - startFetchWait)
    
        result match {
          case r @ SuccessFetchResult(blockId, address, size, buf, isNetworkReqDone) =>
            if (address != blockManager.blockManagerId) {
              numBlocksInFlightPerAddress(address) = numBlocksInFlightPerAddress(address) - 1
              // 主要是更新一些度量值
              shuffleMetrics.incRemoteBytesRead(buf.size)
              if (buf.isInstanceOf[FileSegmentManagedBuffer]) {
                shuffleMetrics.incRemoteBytesReadToDisk(buf.size)
              }
              shuffleMetrics.incRemoteBlocksFetched(1)
            }
            bytesInFlight -= size
            if (isNetworkReqDone) {
              reqsInFlight -= 1
              logDebug("Number of requests in flight " + reqsInFlight)
            }
    
            // 将字节缓冲包装成一个字节输入流
            val in = try {
              buf.createInputStream()
            } catch {
              // The exception could only be throwed by local shuffle block
              case e: IOException =>
                assert(buf.isInstanceOf[FileSegmentManagedBuffer])
                logError("Failed to create input stream from local block", e)
                buf.release()
                throwFetchFailedException(blockId, address, e)
            }
    
            // 通过外部传进来的函数再包装一次,一般是增加压缩和加密的功能
            input = streamWrapper(blockId, in)
            // Only copy the stream if it's wrapped by compression or encryption, also the size of
            // block is small (the decompressed block is smaller than maxBytesInFlight)
            // 如果块的大小比较小,而且流被压缩或者加密过,那么需要将这个流拷贝一份
            if (detectCorrupt && !input.eq(in) && size < maxBytesInFlight / 3) {
              val originalInput = input
              val out = new ChunkedByteBufferOutputStream(64 * 1024, ByteBuffer.allocate)
              try {
                // Decompress the whole block at once to detect any corruption, which could increase
                // the memory usage tne potential increase the chance of OOM.
                // TODO: manage the memory used here, and spill it into disk in case of OOM.
                Utils.copyStream(input, out)
                out.close()
                input = out.toChunkedByteBuffer.toInputStream(dispose = true)
              } catch {
                case e: IOException =>
                  buf.release()
                  if (buf.isInstanceOf[FileSegmentManagedBuffer]
                    || corruptedBlocks.contains(blockId)) {
                    throwFetchFailedException(blockId, address, e)
                  } else {
                    logWarning(s"got an corrupted block $blockId from $address, fetch again", e)
                    corruptedBlocks += blockId
                    fetchRequests += FetchRequest(address, Array((blockId, size)))
                    result = null
                  }
              } finally {
                // TODO: release the buf here to free memory earlier
                originalInput.close()
                in.close()
              }
            }
    
            // 拉取失败,抛异常
            // 这里思考一下:拉取块数据肯定是有重试机制的,但是这里拉取失败之后直接抛异常是为何??
            // 答案是:重试机制并不是正在这里实现 的,而是在rpc客户端发送拉取请求时实现了重试机制
            // 也就是说如果到这里是失败的话,说明已经经过重试后还是失败的,所以这里直接抛异常就行了
          case FailureFetchResult(blockId, address, e) =>
            throwFetchFailedException(blockId, address, e)
        }
    
        // Send fetch requests up to maxBytesInFlight
        // 这里再次发送拉取请求,因为前面已经有成功拉取到的数据,
        // 所以正在拉取中的数据量就会减小,所以就能为新的请求腾出空间
        fetchUpToMaxBytes()
      }
    
      currentResult = result.asInstanceOf[SuccessFetchResult]
      (currentResult.blockId, new BufferReleasingInputStream(input, this))
      }

to sum up

This, of course we'll shuffle read about the analysis done. Overall down, feeling the trunk logic is not very complicated, but there are many fine logic, so the above analysis is quite broken, where the main logic of the whole process of re-refining it in order to have a full understanding of:

  • Firstly, some type of shuffle RDD, the method of its calculation will compute a block of data acquired by the reader BlockStoreShuffleReader ShuffleManager
  • BlockStoreShuffleReader by the read method of reading data, a data partition reduce end generally depends on the map data of all partitions of the output terminal, so the data will normally be a plurality of executor (note the executor node uniquely identified by BlockManagerId, a physical node may be run on multiple executor node) node, and each node may have multiple executor block, writing in the shuffle of the analysis process, we also mentioned, when the output of each map a data file and index last file, which is a block, but because of a node
  • This method encapsulates the complex logic pulls data through ShuffleBlockFetcherIterator remote objects, and finally pulled into the data package into a stream iterator
  • Decorative layers for all of the block flow, comprising deserialization, the task metrics (number of pieces of read data) statistics, each data can be interrupted,
  • Polymerization data
  • Data sorting after polymerization

So, from here we can see, the new version of the shuffle mechanism, that is SortShuffleManager, user code for rdd after shuffle to get the data is sorted (if specified sequencer words).

Guess you like

Origin www.cnblogs.com/zhuge134/p/11032647.html