spark BlockManager如何实现shuffle

当上游的算子完成了shuffle操作，下游的rdd如何获取shuffle所需要的数据。

首先在上游进行写入操作时，写入到的是executor本地的BlockManager。

val blockId = ShuffleBlockId(dep.shuffleId, mapId, IndexShuffleBlockResolver.NOOP_REDUCE_ID)
val partitionLengths = sorter.writePartitionedFile(blockId, tmp)
shuffleBlockResolver.writeIndexFileAndCommit(dep.shuffleId, mapId, partitionLengths, tmp)
mapStatus = MapStatus(blockManager.shuffleServerId, partitionLengths,
  writeMetrics.recordsWritten)

以SortShuffleWriter为例子，当其在写入过程中，以当前task的shuffle id，分区id（mapid）以及reduceid构造了一个BlockId，并根据此blockid在BLockManager上将shuffle数据写入，返回的写入写入结果MapStatus中记载了当前的BlockManager的具体id，并记载各个分区在文件中的长度，该结果将会返回给driver端，并有driver端根据shuffle id等建立索引，使下游rdd能够在向driver端获取shuffle数据的索引后，能够直接定位到对应的BlockManager。

在上述的写的过程中，shuffle数据的文件持久化以及索引文件的产生，实则都是通过BlockManager来生成的。

在shuffle结果的写入过程中，会生成数据文件和索引文件，在BlockManager中，根据blockId来进行区分。

@DeveloperApi
case class ShuffleBlockId(shuffleId: Int, mapId: Int, reduceId: Int) extends BlockId {
  override def name: String = "shuffle_" + shuffleId + "_" + mapId + "_" + reduceId
}

@DeveloperApi
case class ShuffleDataBlockId(shuffleId: Int, mapId: Int, reduceId: Int) extends BlockId {
  override def name: String = "shuffle_" + shuffleId + "_" + mapId + "_" + reduceId + ".data"
}

@DeveloperApi
case class ShuffleIndexBlockId(shuffleId: Int, mapId: Int, reduceId: Int) extends BlockId {
  override def name: String = "shuffle_" + shuffleId + "_" + mapId + "_" + reduceId + ".index"
}

BlockId不仅是文件名，更是后续BlockManager定位文件的索引。

在这些文件的持久化中，由于是直接落在磁盘上的文件，将会通过BlockManager的DiskBlockManager的getFile()来在指定目录下创建文件。

def getFile(filename: String): File = {
  // Figure out which local directory it hashes to, and which subdirectory in that
  val hash = Utils.nonNegativeHash(filename)
  val dirId = hash % localDirs.length
  val subDirId = (hash / localDirs.length) % subDirsPerLocalDir

  // Create the subdirectory if it doesn't already exist
  val subDir = subDirs(dirId).synchronized {
    val old = subDirs(dirId)(subDirId)
    if (old != null) {
      old
    } else {
      val newDir = new File(localDirs(dirId), "%02x".format(subDirId))
      if (!newDir.exists() && !newDir.mkdir()) {
        throw new IOException(s"Failed to create local dir in $newDir.")
      }
      subDirs(dirId)(subDirId) = newDir
      newDir
    }
  }

  new File(subDir, filename)
}

在DiskBlockManager中维护着一个二维数组，其中每个本地目录下又都维护着若干子目录，将会根据Blockid进行hash定位到二维数组中其中一个子目录下，新建一个文件准备进行写入，该文件也是维护在BlockManager中，最后提供shuffle结果的数据文件。

当下游的rdd在尝试获取的时候，将会将当前rdd的shuffle id，分区号，组装成blockId从对应的BlockManager进行拉取，准备处理接下来的流程。

具体的数据拉取通过ShuffleBlockFetcherIterator来进行拉取，在其initialize()方法，发送远程shuffle数据的拉取。

在BlockManager中，存在一个BlockTransferService成员，负责远程BlockManager之间的数据传递，各个BlockManager之间的数据获取，其实就是各个BlockTransferService之间的数据传递，在这shuffle数据的获取流程里，被称为shuffleClient。

shuffleClient.fetchBlocks(address.host, address.port, address.executorId, blockIds.toArray,
  blockFetchingListener, this)

而在ShuffleBlockFetcherIterator，也是通过BlockTransferService的fetchBlocks()方法从远程获取对应的数据。BlockTransferService在这里的实现，是通过NettyBlockTransferService，这里实际上就是通过netty向目标BlockManager发送了远程请求。而远程的NettyBlockTransferService也将根据申请的blockid寻到对应的本地文件以流的形式返回。

在完成远程拉取请求的发送后后会尝试从本地的BlockManager中通过fetchLocalBlocks()方法根据BlockId尝试拉取数据。

因为是本地，直接可以通过上述的getFiles()方法获取文件，不需要BlockManager之间的网络通信即可获取对应的数据。

tydhot

发布了143 篇原创文章 · 获赞 19 · 访问量 11万+

私信关注

spark BlockManager如何实现shuffle

猜你喜欢