spark源码分析之ExternalSorter

在SortShuffleWriter中调用ExternalSorter的两个方法insertAll和writePartitionedFile

1】、blockManager

2】、diskBlockManager

3】、serializerManager

4】、fileBufferSize

spark.shuffle.file.buffer=32k

5】、serializerBatchSize

spark.shuffle.spill.batchSize=10000

6】、map（PartitionedAppendOnlyMap）

private var data = new Array[AnyRef](2 * capacity)

即消耗的并不是Storage的内存

7】、buffer（PartitionedPairBuffer）

8】、forceSpillFiles（ArrayBuffer[SpilledFile]）

PartitionedAppendOnlyMap 放不下，要落地，那么不能硬生生的写磁盘，所以需要个buffer,然后把buffer再一次性写入磁盘文件，buffer的大小由fileBufferSize决定

9】、spills（ArrayBuffer[SpilledFile]）

10】、insertAll

if (shouldCombine) {
  // Combine values in-memory first using our AppendOnlyMap
  val mergeValue = aggregator.get.mergeValue
  val createCombiner = aggregator.get.createCombiner
  var kv: Product2[K, V] = null
  val update = (hadValue: Boolean, oldValue: C) => {
    if (hadValue) mergeValue(oldValue, kv._2) else createCombiner(kv._2)
  }
  while (records.hasNext) {
    addElementsRead()
    kv = records.next()
    map.changeValue((getPartition(kv._1), kv._1), update)
    maybeSpillCollection(usingMap = true)
  }
} else {
  // Stick values into our buffer
  while (records.hasNext) {
    addElementsRead()
    val kv = records.next()
    buffer.insert(getPartition(kv._1), kv._1, kv._2.asInstanceOf[C])
    maybeSpillCollection(usingMap = false)
  }
}

override def changeValue(key: K, updateFunc: (Boolean, V) => V): V = {
  val newValue = super.changeValue(key, updateFunc)
  super.afterUpdate()

当被混入的集合的每次update操作以后，需要执行SizeTracker的afterUpdate方法，afterUpdate会判断这是第几次更新，需要的话就会使用SizeEstimator的estimate方法来估计下集合的大小。由于SizeEstimator的调用开销比较大，注释上说会是数毫秒，所以不能频繁调用。所以SizeTracker会记录更新的次数，发生estimate的次数是指数级增长的，基数是1.1，所以调用estimate时更新的次数会是1.1, 1.1 * 1.1, 1.1 * 1.1 *1.1, ....

这是指数的初始增长是很慢的， 1.1的96次方会是1w, 1.1 ^ 144次方是100w，即对于1w次update，它会执行96次estimate，对10w次update执行120次estimate, 对100w次update执行144次estimate，对1000w次update执行169次。

11】、maybeSpillCollection

每放一条记录就会检查一次PartitionedAppendOnlyMap是否需要spill。

var estimatedSize = 0L
if (usingMap) {
  estimatedSize = map.estimateSize()
  if (maybeSpill(map, estimatedSize)) {
    map = new PartitionedAppendOnlyMap[K, C]
  }
} else {
  estimatedSize = buffer.estimateSize()
  if (maybeSpill(buffer, estimatedSize)) {
    buffer = new PartitionedPairBuffer[K, C]
  }
}
if (estimatedSize > _peakMemoryUsedBytes) {
  _peakMemoryUsedBytes = estimatedSize
}

12】、maybeSpill

如果每放一条记录就检查一次 PartitionedAppendOnlyMap的内存，假设检查一次内存1ms, 1kw 就不得了的时间了。所以肯定是不行的，所以 estimateSize其实是使用采样算法来做的。

（1）、放入数据每32次且currentMemory （estimatedSize）大于myMemoryThreshold

（2）、满足1，则向 shuffleMemoryManager 要 2 * currentMemory - myMemoryThreshold 的内存

（3）、满足1、2或者内存中放入的记录大于numElementsForceSpillThreshold时可以进行spill

注：

currentMemory 通过map的estimatedSize获取

myMemoryThreshold可设置spark.shuffle.spill.initialMemoryThreshold配置，默认5 * 1024 * 1024

shuffleMemoryManager 可分配的内存是ExecutorHeapMemeory * 0.2 * 0.8

numElementsForceSpillThreshold通过spark.shuffle.spill.numElementsForceSpillThreshold配置，默认值Long.MaxValue

var shouldSpill = false
if (elementsRead % 32 == 0 && currentMemory >= myMemoryThreshold) {
  val amountToRequest = 2 * currentMemory - myMemoryThreshold
  val granted = acquireMemory(amountToRequest)
  myMemoryThreshold += granted
  shouldSpill = currentMemory >= myMemoryThreshold
}
shouldSpill = shouldSpill || _elementsRead > numElementsForceSpillThreshold
if (shouldSpill) {
  _spillCount += 1
  logSpillage(currentMemory)
  spill(collection)
  _elementsRead = 0
  _memoryBytesSpilled += currentMemory
  releaseMemory()
}

13】、writePartitionedFile

将in memory（map）以及spillFiles真实的写入文件

val lengths = new Array[Long](numPartitions)
val writer = blockManager.getDiskWriter(blockId, outputFile, serInstance, fileBufferSize,
  context.taskMetrics().shuffleWriteMetrics)
if (spills.isEmpty) {
  // Case where we only have in-memory data
  val collection = if (aggregator.isDefined) map else buffer
  val it = collection.destructiveSortedWritablePartitionedIterator(comparator)
  while (it.hasNext) {
    val partitionId = it.nextPartition()
    while (it.hasNext && it.nextPartition() == partitionId) {
      it.writeNext(writer)
    }
    val segment = writer.commitAndGet()
    lengths(partitionId) = segment.length
  }
} else {
  // We must perform merge-sort; get an iterator by partition and write everything directly.
  for ((id, elements) <- this.partitionedIterator) {
    if (elements.hasNext) {
      for (elem <- elements) {
        writer.write(elem._1, elem._2)
      }
      val segment = writer.commitAndGet()
      lengths(id) = segment.length
    }
  }
}

destructiveSortedWritablePartitionedIterator调用partitionedDestructiveSortedIterator对map进行排序，首先构建comparator，如果传入key比较器则进行partitionID排序之后进行key排序，反之仅仅按照partitionID排序。接着调用destructiveSortedIteratordestructiveSortedIterator是真正的排序器，根据传入comparator的comparator以破坏map特性为代价使对map排序时不需要占用额外空间

val comparator = keyComparator.map(partitionKeyComparator).getOrElse(partitionComparator)
destructiveSortedIterator(comparator)

并返回一个 WritablePartitionedIterator对象。WritablePartitionedIterator可以使用 BlockObjectWriter来写入它的元素。

14】、partitionedIterator

partitionedIterator返回一个对应partitionID的iterators

def partitionedIterator: Iterator[(Int, Iterator[Product2[K, C]])] = {
  val usingMap = aggregator.isDefined
  val collection: WritablePartitionedPairCollection[K, C] = if (usingMap) map else buffer
  if (spills.isEmpty) {
    // Special case: if we have only in-memory data, we don't need to merge streams, and perhaps
    // we don't even need to sort by anything other than partition ID
    if (!ordering.isDefined) {
      // The user hasn't requested sorted keys, so only sort by partition ID, not key
      groupByPartition(destructiveIterator(collection.partitionedDestructiveSortedIterator(None)))
    } else {
      // We do need to sort by both partition ID and key
      groupByPartition(destructiveIterator(collection.partitionedDestructiveSortedIterator(Some(keyComparator))))
    }
  } else {
    // Merge spilled and in-memory data
    merge(spills, destructiveIterator(collection.partitionedDestructiveSortedIterator(comparator)))
  }
}

groupByPartition:

  (0 until numPartitions).iterator.map(p => (p, new IteratorForPartition(p, buffered)))

15】、merge

将in memory（map）的数据以及spillFiles中的数据按照partitionID 读取到内存中合并并按照comparator重新排序（堆外排序），返回一个对应partitionID的iterators

private def merge(spills: Seq[SpilledFile], inMemory: Iterator[((Int, K), C)])
    : Iterator[(Int, Iterator[Product2[K, C]])] = {
  val readers = spills.map(new SpillReader(_))
  val inMemBuffered = inMemory.buffered
  (0 until numPartitions).iterator.map { p =>
    val inMemIterator = new IteratorForPartition(p, inMemBuffered)
    val iterators = readers.map(_.readNextPartition()) ++ Seq(inMemIterator)
    if (aggregator.isDefined) {
      // Perform partial aggregation across partitions
      (p, mergeWithAggregation(iterators, aggregator.get.mergeCombiners, keyComparator, ordering.isDefined))
    } else if (ordering.isDefined) {
      // No aggregator given, but we have an ordering (e.g. used by reduce tasks in sortByKey);
      // sort the elements without trying to merge them
      (p, mergeSort(iterators, ordering.get))
    } else {
      (p, iterators.iterator.flatten)
    }
  }
}

16】、mergeWithAggregation

将传入的iterator根据是否需要排序，返回一个对应partitionID的iterator。如果iterator不需要排序，则在next时进行combine需要多做一些工作，而如果iterator进行过排序，则在直接combine时直接combine下一个key值相同的value即可

private def mergeWithAggregation(
    iterators: Seq[Iterator[Product2[K, C]]],
    mergeCombiners: (C, C) => C,
    comparator: Comparator[K],
    totalOrder: Boolean)
    : Iterator[Product2[K, C]] =
{
  if (!totalOrder) {
    // We only have a partial ordering, e.g. comparing the keys by hash code, which means that
    // multiple distinct keys might be treated as equal by the ordering. To deal with this, we
    // need to read all keys considered equal by the ordering at once and compare them.
    new Iterator[Iterator[Product2[K, C]]] {
      val sorted = mergeSort(iterators, comparator).buffered
      ........
    }
  } else {
    // We have a total ordering, so the objects with the same key are sequential.
    new Iterator[Product2[K, C]] {
      val sorted = mergeSort(iterators, comparator).buffered
      ............................
    }
  }
}

17】、mergeSort

将传入的 iterators按照comparator排序的具体实现

private def mergeSort(iterators: Seq[Iterator[Product2[K, C]]], comparator: Comparator[K])
    : Iterator[Product2[K, C]] =
{
  val bufferedIters = iterators.filter(_.hasNext).map(_.buffered)
  type Iter = BufferedIterator[Product2[K, C]]
  val heap = new mutable.PriorityQueue[Iter]()(new Ordering[Iter] {
    // Use the reverse of comparator.compare because PriorityQueue dequeues the max
    override def compare(x: Iter, y: Iter): Int = -comparator.compare(x.head._1, y.head._1)
  })
  heap.enqueue(bufferedIters: _*)  // Will contain only the iterators with hasNext = true
  new Iterator[Product2[K, C]] {
    override def hasNext: Boolean = !heap.isEmpty

    override def next(): Product2[K, C] = {
      if (!hasNext) {
        throw new NoSuchElementException
      }
      val firstBuf = heap.dequeue()
      val firstPair = firstBuf.next()
      if (firstBuf.hasNext) {
        heap.enqueue(firstBuf)
      }
      firstPair
    }
  }
}

------------------------------------------------------------------------------------------------------------------------------------------------

shuffleMemoryManager 是被Executor 所有正在运行的Task(Core) 共享的，能够分配出去的内存是：ExecutorHeapMemeory * 0.2 * 0.8

上面的数字可通过下面两个配置来更改：

spark.shuffle.memoryFraction=0.2

spark.shuffle.safetyFraction=0.8

PartitionedAppendOnlyMap 放不下，要落地先写入buffer,然后把buffer再一次性写入磁盘文件。这个buffer是由参数fileBufferSize决定，通过下面配置来更改：

spark.shuffle.file.buffer=32k

数据获取的过程中，序列化反序列化，也是需要空间的，所以Spark 对数量做了限制，通过如下参数serializerBatchSize决定，通过下面配置来更改：

spark.shuffle.spill.batchSize=10000

假设一个Executor的可使用的Core为 C个，那么对应需要的内存消耗为：

C * 32k + C * 10000个Record + C * PartitionedAppendOnlyMap

这么看来

，写文件的buffer不是问题，而序列化的batchSize也不是问题，几万或者十几万个Record 而已。那C * PartitionedAppendOnlyMap 到底会有多大呢？我先给个结论： C * PartitionedAppendOnlyMap <shuffleManager可分配的内存空间

PartitionedAppendOnlyMap 通过map.estimateSize()获取占用内存大小，而map.estimateSize()是近似估计，所以会出现oom的情况

如果你内存开的比较大，其实反倒风险更高，因为estimateSize 并不是每次都去真实的算缓存。它是通过采样来完成的，而采样的周期不是固定的，而是指数增长的，比如第一次采样完后，PartitionedAppendOnlyMap 要经过1.1次的update/insert操作之后才进行第二次采样，然后经过1.1*.1.1次之后进行第三次采样，以此递推，假设你内存开的大，那PartitionedAppendOnlyMap可能要经过几十万次更新之后之后才会进行一次采样，然后才能计算出新的大小，这个时候几十万次更新带来的新的内存压力，可能已经让你的GC不堪重负了。

------------------------------------------------------------------------------------------------------------------------------------------------

ExternalSorter的关键调用

注：下文中的buffer/map都是指存放inmemery obj的buffer

insertAll【将数据放入buffer，注意此处并不占用storage内存】
    ---changeValue（map的方法）
        ---maybeSpillCollection
            ---maybeSpill
    ---insert（buffer的方法）
        ---maybeSpillCollection
            ---maybeSpill
writePartitionedFile【将buffer以及spillfile中的文件合并为一个文件】
    ---destructiveSortedWritablePartitionedIterator（map的方法）
        ---partitionedDestructiveSortedIterator（map的方法）
            ---destructiveSortedIterator（map的方法）
        ---WritablePartitionedIterator （map的方法）
    ---partitionedIterator
        ---groupByPartition
            ---destructiveSortedIterator
        ---groupByPartition
            ---destructiveSortedIterator
        ---merge
            ---destructiveSortedIterator
            ---mergeWithAggregation
                ------mergeSort
            ---mergeSort

参考文章：

http://www.cnblogs.com/devos/p/4805526.html

https://www.jianshu.com/p/c83bb237caa8

spark源码分析之ExternalSorter

猜你喜欢