spark source reading --shuffle process analysis

ShuffleManager(一)

Manual carefully we look at another important spark kernel module, Shuffle manager ShuffleManager. shuffle can be said to be the most important in distributed computing the concept, join data, weight and other operations to the polymerization This step is needed. On the other hand, the reason why spark mapReduce higher than the performance of one of the main process is to optimize the shuffle, shuffle the one hand, the process of spark better use of memory (that is, we said earlier in the execution memory analysis memory management ), on the other hand for the shuffle during the overflow write disk files and merge sort introduction of index files. Of course, another main reason for the high performance spark as well as the optimization of the calculation chain, the chain with the multi-step type of computing map, greatly reducing the intermediate plate down process, which is significantly different from the local spark of mr.
spark new version of the Shuffle manager default SortShuffleManager.

SparkEnv initialization part of the code:

  val shortShuffleMgrNames = Map(
  "sort" -> classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName,
  "tungsten-sort" -> classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName)

ShuffleMapTask.runTask

Look shuffle Manager source code, we should be called as soon as the first ShuffleManager. Think shuffle process, nothing more than two steps, write and read. Written in the map phase, the data is classified according to certain zoning rules to different partitions, read in the reduce phase, each partition map from the output stage of the pull of their own data, we analyze the basic source also ShuffleManager along the way you can. We first analyze the process of writing, because for a complete shuffle process, certainly the first to write before reading.
Review the analysis of the process before the job is running, we should also remember that after the job was cut into task execution is executor end, while the stage Shuffle phase is cut into the ShuffleMapTask, shuffle write process is precisely in this class completed, we look at the code:

ShuffleManager.getWriter acquired can be seen through a shuffle writer, thereby rdd calculated data written to disk.

override def runTask(context: TaskContext): MapStatus = {
// Deserialize the RDD using the broadcast variable.
val threadMXBean = ManagementFactory.getThreadMXBean
val deserializeStartTime = System.currentTimeMillis()
val deserializeStartCpuTime = if (threadMXBean.isCurrentThreadCpuTimeSupported) {
  threadMXBean.getCurrentThreadCpuTime
} else 0L
val ser = SparkEnv.get.closureSerializer.newInstance()
// 反序列化RDD和shuffle, 关键的步骤
// 这里思考rdd和shuffle反序列化时,内部的SparkContext对象是怎么反序列化的
val (rdd, dep) = ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])](
  ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader)
_executorDeserializeTime = System.currentTimeMillis() - deserializeStartTime
_executorDeserializeCpuTime = if (threadMXBean.isCurrentThreadCpuTimeSupported) {
  threadMXBean.getCurrentThreadCpuTime - deserializeStartCpuTime
} else 0L

var writer: ShuffleWriter[Any, Any] = null
try {
  // shuffle管理器
  val manager = SparkEnv.get.shuffleManager
  // 获取一个shuffle写入器
  writer = manager.getWriter[Any, Any](dep.shuffleHandle, partitionId, context)
  // 这里可以看到rdd计算的核心方法就是iterator方法
  // SortShuffleWriter的write方法可以分为几个步骤:
  // 将上游rdd计算出的数据(通过调用rdd.iterator方法)写入内存缓冲区,
  // 在写的过程中如果超过 内存阈值就会溢写磁盘文件,可能会写多个文件
  // 最后将溢写的文件和内存中剩余的数据一起进行归并排序后写入到磁盘中形成一个大的数据文件
  // 这个排序是先按分区排序,在按key排序
  // 在最后归并排序后写的过程中,没写一个分区就会手动刷写一遍,并记录下这个分区数据在文件中的位移
  // 所以实际上最后写完一个task的数据后,磁盘上会有两个文件:数据文件和记录每个reduce端partition数据位移的索引文件
  writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]])
  // 主要是删除中间过程的溢写文件,向内存管理器释放申请的内存
  writer.stop(success = true).get
} catch {
  case e: Exception =>
    try {
      if (writer != null) {
        writer.stop(success = false)
      }
    } catch {
      case e: Exception =>
        log.debug("Could not stop writer", e)
    }
    throw e
}
}

SortShuffleManager.getWriter

According to shuffle here for different types of objects ShuffleWriter, in most cases, are SortShuffleWriter type, so we look directly SortShuffleWriter.write method.

/** Get a writer for a given partition. Called on executors by map tasks. */
// 获取一个shuffle存储器,在executor端被调用,在执行一个map task调用
override def getWriter[K, V](
  handle: ShuffleHandle,
  mapId: Int,
  context: TaskContext): ShuffleWriter[K, V] = {
numMapsForShuffle.putIfAbsent(
  handle.shuffleId, handle.asInstanceOf[BaseShuffleHandle[_, _, _]].numMaps)
val env = SparkEnv.get
handle match {
  case unsafeShuffleHandle: SerializedShuffleHandle[K @unchecked, V @unchecked] =>
    new UnsafeShuffleWriter(
      env.blockManager,
      shuffleBlockResolver.asInstanceOf[IndexShuffleBlockResolver],
      context.taskMemoryManager(),
      unsafeShuffleHandle,
      mapId,
      context,
      env.conf)
  case bypassMergeSortHandle: BypassMergeSortShuffleHandle[K @unchecked, V @unchecked] =>
    new BypassMergeSortShuffleWriter(
      env.blockManager,
      shuffleBlockResolver.asInstanceOf[IndexShuffleBlockResolver],
      bypassMergeSortHandle,
      mapId,
      context,
      env.conf)
  case other: BaseShuffleHandle[K @unchecked, V @unchecked, _] =>
    new SortShuffleWriter(shuffleBlockResolver, other, mapId, context)
}
}

SortShuffleWriter.write

Summarize the main logic of this approach:

  • Get a sequencer, the polymerization pass different parameters depending on whether or map-side
  • Insert data into the sequencer, this process is more or overflow write disk files
  • Gets a disk file name based on shuffleid and partition id,
  • Multiple overflow write disk files and sort data in memory were merge sort, and written to a file, and returns reduce the displacement data for each partition in the end of this document
  • The index write an index file, the file name and the data file into a formal document by the name of the temporary file name.
  • MapStatus object is a final packaging, for the return value of ShuffleMapTask.runTask.
  • In the stop method will do some finishing work, disk io statistics time-consuming, delete the middle file write overflow

      override def write(records: Iterator[Product2[K, V]]): Unit = {
      sorter = if (dep.mapSideCombine) {
        // map端进行合并的情况,此时用户应该提供聚合器和顺序
        require(dep.aggregator.isDefined, "Map-side combine without Aggregator specified!")
        new ExternalSorter[K, V, C](
          context, dep.aggregator, Some(dep.partitioner), dep.keyOrdering, dep.serializer)
      } else {
        // In this case we pass neither an aggregator nor an ordering to the sorter, because we don't
        // care whether the keys get sorted in each partition; that will be done on the reduce side
        // if the operation being run is sortByKey.
        new ExternalSorter[K, V, V](
          context, aggregator = None, Some(dep.partitioner), ordering = None, dep.serializer)
      }
      // 将map数据全部写入排序器中,
      // 这个过程中可能会生成多个溢写文件
      sorter.insertAll(records)
    
      // Don't bother including the time to open the merged output file in the shuffle write time,
      // because it just opens a single file, so is typically too fast to measure accurately
      // (see SPARK-3570).
      // mapId就是shuffleMap端RDD的partitionId
      // 获取这个map分区的shuffle输出文件名
      val output = shuffleBlockResolver.getDataFile(dep.shuffleId, mapId)
      // 加一个uuid后缀
      val tmp = Utils.tempFileWith(output)
      try {
        val blockId = ShuffleBlockId(dep.shuffleId, mapId, IndexShuffleBlockResolver.NOOP_REDUCE_ID)
        // 这一步将溢写到的磁盘的文件和内存中的数据进行归并排序,
        // 并溢写到一个文件中,这一步写的文件是临时文件名
        val partitionLengths = sorter.writePartitionedFile(blockId, tmp)
        // 这一步主要是写入索引文件,使用move方法原子第将临时索引和临时数据文件重命名为正常的文件名
        shuffleBlockResolver.writeIndexFileAndCommit(dep.shuffleId, mapId, partitionLengths, tmp)
        // 返回一个状态对象,包含shuffle服务Id和各个分区数据在文件中的位移
        mapStatus = MapStatus(blockManager.shuffleServerId, partitionLengths)
      } finally {
        if (tmp.exists() && !tmp.delete()) {
          logError(s"Error while deleting temp file ${tmp.getAbsolutePath}")
        }
      }
      }

IndexShuffleBlockResolver

We first look at the shuffle obtain the output file name is acquired through IndexShuffleBlockResolver components, and its interior is I mentioned in the previous analysis block allocation manager by DiskBlockManager file name, this DiskBlockManager internal BlockManager, its role is the name of the file allocation management, and directory spark used to create a subdirectory deletion, and so on. We see that for naming data files and index files are not the same, their naming conventions are defined in ShuffleDataBlockId and ShuffleIndexBlockId in.

def getDataFile(shuffleId: Int, mapId: Int): File = {
  blockManager.diskBlockManager.getFile(ShuffleDataBlockId(shuffleId, mapId, NOOP_REDUCE_ID))
}

private def getIndexFile(shuffleId: Int, mapId: Int): File = {
  blockManager.diskBlockManager.getFile(ShuffleIndexBlockId(shuffleId, mapId, NOOP_REDUCE_ID))
}

ExternalSorter.insertAll

We call according to the order in SortShuffleWriter, first look at the ExternalSorter.insertAll method:

  • The preferred depending on whether the merger at the end of love map is divided into two cases, memory storage structure used in both cases are not the same, in the case of the merging of the map using the PartitionedAppendOnlyMap structure, the merger is not in the map using PartitionedPairBuffer. Wherein, PartitionedAppendOnlyMap map is an array of structures and linear detection method implemented.
  • The data are then inserted one by one cycle stored in the memory structure, taking into account the case where the map of the merging

      def insertAll(records: Iterator[Product2[K, V]]): Unit = {
      // TODO: stop combining if we find that the reduction factor isn't high
      val shouldCombine = aggregator.isDefined
    
      // 在map端进行合并的情况
      if (shouldCombine) {
        // Combine values in-memory first using our AppendOnlyMap
        val mergeValue = aggregator.get.mergeValue
        val createCombiner = aggregator.get.createCombiner
        var kv: Product2[K, V] = null
        val update = (hadValue: Boolean, oldValue: C) => {
          if (hadValue) mergeValue(oldValue, kv._2) else createCombiner(kv._2)
        }
        while (records.hasNext) {
          addElementsRead()
          kv = records.next()
          // 向内存缓冲中插入一条数据
          map.changeValue((getPartition(kv._1), kv._1), update)
          // 如果缓冲超过阈值,就会溢写到磁盘生成一个文件
          // 每写入一条数据就检查一遍内存
          maybeSpillCollection(usingMap = true)
        }
      } else {// 不再map端合并的情况
        // Stick values into our buffer
        while (records.hasNext) {
          addElementsRead()
          val kv = records.next()
          buffer.insert(getPartition(kv._1), kv._1, kv._2.asInstanceOf[C])
          maybeSpillCollection(usingMap = false)
        }
      }
      }

AppendOnlyMap.changeValue

We look at a slightly more complex structure a little, AppendOnlyMap,

  • First, consider the null case
  • Calculation of key hash, then the capacity to take over. Note that, since this capacity is an integer power of 2, the modulo operation is equivalent to the capacity of the bit-operation with the capacity and -1, in operation java HashMap.
  • If the old value does not exist, then directly inserted,
  • If there is the old value, update the old value
  • If the hash collision occurs, we need to detect backward, and is probing jumping,

As can be seen, this design is very sophisticated, and there are a method of heavy, incrementSize process will determine the size of the current amount of data, if exceeded, will expansion, this expansion method is relatively complex, is a re-hash redistribution process, but one thing, whether it is made in inserting new data or re-hash the process of redistribution, for the processing strategy hash collision must be the same, or it may result in inconsistencies.

// 向数组中插入一个kv对,
def changeValue(key: K, updateFunc: (Boolean, V) => V): V = {
assert(!destroyed, destructionMessage)
val k = key.asInstanceOf[AnyRef]
// 处理key为空的情况
if (k.eq(null)) {
  // 如果是第一次插入空值,那么需要将大小增加1
  if (!haveNullValue) {
    incrementSize()
  }
  nullValue = updateFunc(haveNullValue, nullValue)
  haveNullValue = true
  return nullValue
}
var pos = rehash(k.hashCode) & mask
// 线性探测法处理hash碰撞
// 这里是一个加速的线性探测,即第一次碰撞时走1步,
// 第二次碰撞时走2步,第三次碰撞时走3步
var i = 1
while (true) {
  val curKey = data(2 * pos)
  if (curKey.eq(null)) {// 如果旧值不存在,直接插入
    val newValue = updateFunc(false, null.asInstanceOf[V])
    data(2 * pos) = k
    data(2 * pos + 1) = newValue.asInstanceOf[AnyRef]
    incrementSize()
    return newValue
  } else if (k.eq(curKey) || k.equals(curKey)) {// 如果旧值存在,需要更新
    val newValue = updateFunc(true, data(2 * pos + 1).asInstanceOf[V])
    data(2 * pos + 1) = newValue.asInstanceOf[AnyRef]
    return newValue
  } else {// 发生hash碰撞,向后探测,跳跃性的探测
    val delta = i
    pos = (pos + delta) & mask
    i += 1
  }
}
null.asInstanceOf[V] // Never reached but needed to keep compiler happy
}

ExternalSorter.maybeSpillCollection

We returned to the insertion method ExternalSorter, the data should not insert a check memory usage, determine the need overflow to disk, if you need to spill to disk.
This method was called map.estimateSize to estimate the size of the currently inserted memory footprint data for this trait tracking and estimation function memory footprint is achieved in SizeTracker qualities I mentioned in the previous analysis MemoryStore, in the object type used when the data into the memory in a data structure DeserializedValuesHolder intermediate state, it has an internal SizeTrackingVector, this class is characterized by a succession SizeTracker enabling tracking and the estimated object size.

private def maybeSpillCollection(usingMap: Boolean): Unit = {
var estimatedSize = 0L
if (usingMap) {
  estimatedSize = map.estimateSize()
  if (maybeSpill(map, estimatedSize)) {
    map = new PartitionedAppendOnlyMap[K, C]
  }
} else {
  estimatedSize = buffer.estimateSize()
  if (maybeSpill(buffer, estimatedSize)) {
    buffer = new PartitionedPairBuffer[K, C]
  }
}

if (estimatedSize > _peakMemoryUsedBytes) {
  _peakMemoryUsedBytes = estimatedSize
}
}

ExternalSorter.maybeSpill

First, check the current memory usage exceeds a threshold value, will apply if more than one execution memory, if enough did not apply to the execution memory, then still need to be written to disk overflow

protected def maybeSpill(collection: C, currentMemory: Long): Boolean = {
var shouldSpill = false
// 每写入32条数据检查一次
if (elementsRead % 32 == 0 && currentMemory >= myMemoryThreshold) {
  // Claim up to double our current memory from the shuffle memory pool
  val amountToRequest = 2 * currentMemory - myMemoryThreshold
  // 向内存管理器申请执行内存
  val granted = acquireMemory(amountToRequest)
  myMemoryThreshold += granted
  // If we were granted too little memory to grow further (either tryToAcquire returned 0,
  // or we already had more memory than myMemoryThreshold), spill the current collection
  // 如果内存占用超过了阈值,那么就需要溢写
  shouldSpill = currentMemory >= myMemoryThreshold
}
shouldSpill = shouldSpill || _elementsRead > numElementsForceSpillThreshold
// Actually spill
if (shouldSpill) {
  _spillCount += 1
  logSpillage(currentMemory)
  // 溢写到磁盘
  spill(collection)
  _elementsRead = 0
  _memoryBytesSpilled += currentMemory
  // 释放内存
  releaseMemory()
}
shouldSpill
}

ExternalSorter.spill

Following the above method,

override protected[this] def spill(collection: WritablePartitionedPairCollection[K, C]): Unit = {
// 获取一个排序后的迭代器
val inMemoryIterator = collection.destructiveSortedWritablePartitionedIterator(comparator)
// 将数据写入磁盘文件中
val spillFile = spillMemoryIteratorToDisk(inMemoryIterator)
spills += spillFile
}

WritablePartitionedPairCollection.destructiveSortedWritablePartitionedIterator

This method returns the sorted according to the key partition and the iterator, the specific ordering logic in AppendOnlyMap.destructiveSortedIterator

AppendOnlyMap.destructiveSortedIterator

All code is divided into two, the array is first pressed, data transfer is sparse array to the head;
and the array is sorted by the comparator, the comparator compares the first partition in accordance with, the same as if only the partition Key according to comparing;
then returns an iterator, the iterator is just an array of package. Through this method, we probably know the sort of logic AppendonlyMap.

def destructiveSortedIterator(keyComparator: Comparator[K]): Iterator[(K, V)] = {
destroyed = true
// Pack KV pairs into the front of the underlying array
// 这段代码将稀疏的数据全部转移到数组头部,将数据压紧
var keyIndex, newIndex = 0
while (keyIndex < capacity) {
  if (data(2 * keyIndex) != null) {
    data(2 * newIndex) = data(2 * keyIndex)
    data(2 * newIndex + 1) = data(2 * keyIndex + 1)
    newIndex += 1
  }
  keyIndex += 1
}
assert(curSize == newIndex + (if (haveNullValue) 1 else 0))

// 根据比较器对数据进行排序
new Sorter(new KVArraySortDataFormat[K, AnyRef]).sort(data, 0, newIndex, keyComparator)

new Iterator[(K, V)] {
  var i = 0
  var nullValueReady = haveNullValue
  def hasNext: Boolean = (i < newIndex || nullValueReady)
  def next(): (K, V) = {
    if (nullValueReady) {
      nullValueReady = false
      (null.asInstanceOf[K], nullValue)
    } else {
      val item = (data(2 * i).asInstanceOf[K], data(2 * i + 1).asInstanceOf[V])
      i += 1
      item
    }
  }
}
}

ExternalSorter.spillMemoryIteratorToDisk

Back ExternalSorter.spill method, after obtaining the iterator after ordering, we can overflow the data written to the disk.
The code for this method, I do not stick, and summarize the main steps:

  • BlockId first obtain a temporary block and the temporary file name by DiskBlockManager
  • By obtaining a disk writer BlockManager, i.e. DiskBlockObjectWriter object inside the package java call flow logic of the write file api
  • Each data cycle is written to disk, and writing brush regularly (every predetermined number of data pieces of the data in memory to disk brush)
  • If an exception occurs, the file will be written before the rollback

summary

To sum up the whole process of writing data to the disk overflow by ExternalSorter:

  • First, data is inserted one by one into the interior of the structure map
  • Each insert a memory usage data will be checked, if the memory usage exceeds the threshold value, and are not eligible to perform enough memory, it will overflow the data currently in memory to disk
  • For a write overflow process: first partition and will follow the key data sorting, data of the same row with a partition, then discharged according to the sorting order of the key is provided; DiskBlockManager and then obtaining DiskBlockWriter by writing data to disk is formed BlockManager a file. And the overflow write file information
  • Throughout the writing process, the overflow will write multiple files

ExternalSorter.writePartitionedFile

Summarize the main steps:

  • Still get a disk writer by blockManager
  • The internal overflow write multiple disks and files remain in the memory of the data merge sort, and distributed into an iterator in accordance with the zoning classification of
  • Cycles write data to disk, whenever a data partition finished, flashing once, the data from the file buffer os synchronized to disk, and then obtain the length of the file at this time, each record in the file partition the displacement

      def writePartitionedFile(
        blockId: BlockId,
        outputFile: File): Array[Long] = {
    
      // Track location of each range in the output file
      val lengths = new Array[Long](numPartitions)
      val writer = blockManager.getDiskWriter(blockId, outputFile, serInstance, fileBufferSize,
        context.taskMetrics().shuffleWriteMetrics)
    
      // 如果前面没有数据溢写到磁盘中,
      // 则只需要将内存中的数据溢写到磁盘
      if (spills.isEmpty) {
        // Case where we only have in-memory data
        val collection = if (aggregator.isDefined) map else buffer
        // 返回排序后的迭代器
        val it = collection.destructiveSortedWritablePartitionedIterator(comparator)
        while (it.hasNext) {
          val partitionId = it.nextPartition()
          while (it.hasNext && it.nextPartition() == partitionId) {
            it.writeNext(writer)
          }
          // 写完一个分区刷写一次
          val segment = writer.commitAndGet()
          // 记录下分区的数据在文件中的位移
          lengths(partitionId) = segment.length
        }
      } else {// 有溢写到磁盘的文件
        // We must perform merge-sort; get an iterator by partition and write everything directly.
        // 封装一个用于归并各个溢写文件以及内存缓冲区数据的迭代器
        // TODO 这个封装的迭代器是实现归并排序的关键
        for ((id, elements) <- this.partitionedIterator) {
          if (elements.hasNext) {
            for (elem <- elements) {
              writer.write(elem._1, elem._2)
            }
            // 每写完一个分区,主动刷写一次,获取文件位移,
            // 这个位移就是写入的分区的位移,
            // reduce端在拉取数据时就会根据这个位移直接找到应该拉取的数据的位置
            val segment = writer.commitAndGet()
            lengths(id) = segment.length
          }
        }
      }
    
      writer.close()
      // 写完后更新一些统计信息
      context.taskMetrics().incMemoryBytesSpilled(memoryBytesSpilled)
      context.taskMetrics().incDiskBytesSpilled(diskBytesSpilled)
      context.taskMetrics().incPeakExecutionMemory(peakMemoryUsedBytes)
    
      // 返回每个reduce端分区数据在文件中的位移信息
      lengths
      }

IndexShuffleBlockResolver.writeIndexFileAndCommit

Still back SortShuffleWriter.write method, the last step is called IndexShuffleBlockResolver.writeIndexFileAndCommit method,
the role of this method is mainly written to the displacement value of each of the partition to an index file, index files and temporary files and temporary data rename normal file name (rename operation is an atomic operation)

to sum up

I summarize the shuffle process of writing data can be divided into two main steps:

  • First, it will overflow due to insufficient memory in order to write multiple data files in the process of writing data to disk, and all files are sorted according to partitions and key, which lay the foundation for the second part merge sort
  • The second part is to write a small spill of these documents and the final remaining memory data merge sort, and then write a large file, and the displacement of each data partition in the file is recorded in the process of writing,
  • Finally, write an index file, the index file is recorded each reduce the displacement of the end partition in the data file, so that reduce the time to pull the data in order to quickly locate the data they need to partition

Guess you like

Origin www.cnblogs.com/zhuge134/p/11026040.html