spark源码分析之ExternalSorter

在SortShuffleWriter中调用ExternalSorter的两个方法insertAll和writePartitionedFile

1】、blockManager
2】、diskBlockManager
3】、serializerManager
4】、fileBufferSize
spark.shuffle.file.buffer=32k
5】、serializerBatchSize
spark.shuffle.spill.batchSize=10000
6】、map(PartitionedAppendOnlyMap)
private var data = new Array[AnyRef](2 * capacity)
即消耗的并不是Storage的内存
7】、buffer(PartitionedPairBuffer)
8】、forceSpillFiles(ArrayBuffer[SpilledFile])
PartitionedAppendOnlyMap 放不下,要落地,那么不能硬生生的写磁盘,所以需要个buffer,然后把buffer再一次性写入磁盘文件,buffer的大小由fileBufferSize决定
9】、spills(ArrayBuffer[SpilledFile])
10】、insertAll
if (shouldCombine) {
  // Combine values in-memory first using our AppendOnlyMap
  val mergeValue = aggregator.get.mergeValue
  val createCombiner = aggregator.get.createCombiner
  var kv: Product2[K, V] = null
  val update = (hadValue: Boolean, oldValue: C) => {
    if (hadValue) mergeValue(oldValue, kv._2) else createCombiner(kv._2)
  }
  while (records.hasNext) {
    addElementsRead()
    kv = records.next()
    map.changeValue((getPartition(kv._1), kv._1), update)
    maybeSpillCollection(usingMap = true)
  }
} else {
  // Stick values into our buffer
  while (records.hasNext) {
    addElementsRead()
    val kv = records.next()
    buffer.insert(getPartition(kv._1), kv._1, kv._2.asInstanceOf[C])
    maybeSpillCollection(usingMap = false)
  }
}

override def changeValue(key: K, updateFunc: (Boolean, V) => V): V = {
  val newValue = super.changeValue(key, updateFunc)
  super.afterUpdate()
当被混入的集合的每次update操作以后,需要执行SizeTracker的afterUpdate方法,afterUpdate会判断这是第几次更新,需要的话就会使用SizeEstimator的estimate方法来估计下集合的大小。由于SizeEstimator的调用开销比较大,注释上说会是数毫秒,所以不能频繁调用。所以SizeTracker会记录更新的次数,发生estimate的次数是指数级增长的,基数是1.1,所以调用estimate时更新的次数会是1.1, 1.1 * 1.1, 1.1 * 1.1 *1.1, ....
这是指数的初始增长是很慢的, 1.1的96次方会是1w, 1.1 ^ 144次方是100w,即对于1w次update,它会执行96次estimate,对10w次update执行120次estimate, 对100w次update执行144次estimate,对1000w次update执行169次。

11】、maybeSpillCollection

每放一条记录就会检查一次PartitionedAppendOnlyMap是否需要spill。

var estimatedSize = 0L
if (usingMap) {
  estimatedSize = map.estimateSize()
  if (maybeSpill(map, estimatedSize)) {
    map = new PartitionedAppendOnlyMap[K, C]
  }
} else {
  estimatedSize = buffer.estimateSize()
  if (maybeSpill(buffer, estimatedSize)) {
    buffer = new PartitionedPairBuffer[K, C]
  }
}
if (estimatedSize > _peakMemoryUsedBytes) {
  _peakMemoryUsedBytes = estimatedSize
}
12】、maybeSpill
如果每放一条记录就检查一次 PartitionedAppendOnlyMap的内存,假设检查一次内存1ms, 1kw 就不得了的时间了。所以肯定是不行的,所以 estimateSize其实是使用采样算法来做的。
(1)、 放入数据每32次且currentMemory (estimatedSize)大于myMemoryThreshold
(2)、满足1,则 向 shuffleMemoryManager 要 2 * currentMemory - myMemoryThreshold 的内存
(3)、满足1、2或者内存中放入的记录大于numElementsForceSpillThreshold时可以进行spill
注:
currentMemory 通过map的estimatedSize获取
myMemoryThreshold可设置spark.shuffle.spill.initialMemoryThreshold配置,默认5 * 1024 * 1024
shuffleMemoryManager 可分配的内存是ExecutorHeapMemeory * 0.2 * 0.8
numElementsForceSpillThreshold通过spark.shuffle.spill.numElementsForceSpillThreshold配置,默认值Long.MaxValue
var shouldSpill = false
if (elementsRead % 32 == 0 && currentMemory >= myMemoryThreshold) {
  val amountToRequest = 2 * currentMemory - myMemoryThreshold
  val granted = acquireMemory(amountToRequest)
  myMemoryThreshold += granted
  shouldSpill = currentMemory >= myMemoryThreshold
}
shouldSpill = shouldSpill || _elementsRead > numElementsForceSpillThreshold
if (shouldSpill) {
  _spillCount += 1
  logSpillage(currentMemory)
  spill(collection)
  _elementsRead = 0
  _memoryBytesSpilled += currentMemory
  releaseMemory()
}
13】、writePartitionedFile

将in memory(map)以及spillFiles真实的写入文件

val lengths = new Array[Long](numPartitions)
val writer = blockManager.getDiskWriter(blockId, outputFile, serInstance, fileBufferSize,
  context.taskMetrics().shuffleWriteMetrics)
if (spills.isEmpty) {
  // Case where we only have in-memory data
  val collection = if (aggregator.isDefined) map else buffer
  val it = collection.destructiveSortedWritablePartitionedIterator(comparator)
  while (it.hasNext) {
    val partitionId = it.nextPartition()
    while (it.hasNext && it.nextPartition() == partitionId) {
      it.writeNext(writer)
    }
    val segment = writer.commitAndGet()
    lengths(partitionId) = segment.length
  }
} else {
  // We must perform merge-sort; get an iterator by partition and write everything directly.
  for ((id, elements) <- this.partitionedIterator) {
    if (elements.hasNext) {
      for (elem <- elements) {
        writer.write(elem._1, elem._2)
      }
      val segment = writer.commitAndGet()
      lengths(id) = segment.length
    }
  }
}
destructiveSortedWritablePartitionedIterator调用partitionedDestructiveSortedIterator对map进行排序,首先构建comparator,如果传入key比较器则进行partitionID排序之后进行key排序,反之仅仅按照partitionID排序。接着调用destructiveSortedIteratordestructiveSortedIterator是真正的排序器,根据传入comparator的comparator以破坏map特性为代价使对map排序时不需要占用额外空间
val comparator = keyComparator.map(partitionKeyComparator).getOrElse(partitionComparator)
destructiveSortedIterator(comparator)

并返回一个 WritablePartitionedIterator对象。WritablePartitionedIterator可以使用 BlockObjectWriter来写入它的元素。

14】、partitionedIterator
partitionedIterator返回一个对应partitionID的iterators
def partitionedIterator: Iterator[(Int, Iterator[Product2[K, C]])] = {
  val usingMap = aggregator.isDefined
  val collection: WritablePartitionedPairCollection[K, C] = if (usingMap) map else buffer
  if (spills.isEmpty) {
    // Special case: if we have only in-memory data, we don't need to merge streams, and perhaps
    // we don't even need to sort by anything other than partition ID
    if (!ordering.isDefined) {
      // The user hasn't requested sorted keys, so only sort by partition ID, not key
      groupByPartition(destructiveIterator(collection.partitionedDestructiveSortedIterator(None)))
    } else {
      // We do need to sort by both partition ID and key
      groupByPartition(destructiveIterator(collection.partitionedDestructiveSortedIterator(Some(keyComparator))))
    }
  } else {
    // Merge spilled and in-memory data
    merge(spills, destructiveIterator(collection.partitionedDestructiveSortedIterator(comparator)))
  }
}
groupByPartition:
  (0 until numPartitions).iterator.map(p => (p, new IteratorForPartition(p, buffered)))
15】、merge
将in memory(map)的数据以及spillFiles中的数据按照partitionID 读取到内存中合并并按照comparator重新排序(堆外排序), 返回一个对应partitionID的iterators

private def merge(spills: Seq[SpilledFile], inMemory: Iterator[((Int, K), C)])
    : Iterator[(Int, Iterator[Product2[K, C]])] = {
  val readers = spills.map(new SpillReader(_))
  val inMemBuffered = inMemory.buffered
  (0 until numPartitions).iterator.map { p =>
    val inMemIterator = new IteratorForPartition(p, inMemBuffered)
    val iterators = readers.map(_.readNextPartition()) ++ Seq(inMemIterator)
    if (aggregator.isDefined) {
      // Perform partial aggregation across partitions
      (p, mergeWithAggregation(iterators, aggregator.get.mergeCombiners, keyComparator, ordering.isDefined))
    } else if (ordering.isDefined) {
      // No aggregator given, but we have an ordering (e.g. used by reduce tasks in sortByKey);
      // sort the elements without trying to merge them
      (p, mergeSort(iterators, ordering.get))
    } else {
      (p, iterators.iterator.flatten)
    }
  }
}
16】、mergeWithAggregation
将传入的iterator根据是否需要排序,返回 一个对应partitionID的iterator。如果iterator不需要排序,则在next时进行combine需要多做一些工作,而如果iterator进行过排序,则在直接combine时直接combine下一个key值相同的value即可
private def mergeWithAggregation(
    iterators: Seq[Iterator[Product2[K, C]]],
    mergeCombiners: (C, C) => C,
    comparator: Comparator[K],
    totalOrder: Boolean)
    : Iterator[Product2[K, C]] =
{
  if (!totalOrder) {
    // We only have a partial ordering, e.g. comparing the keys by hash code, which means that
    // multiple distinct keys might be treated as equal by the ordering. To deal with this, we
    // need to read all keys considered equal by the ordering at once and compare them.
    new Iterator[Iterator[Product2[K, C]]] {
      val sorted = mergeSort(iterators, comparator).buffered
      ........
    }
  } else {
    // We have a total ordering, so the objects with the same key are sequential.
    new Iterator[Product2[K, C]] {
      val sorted = mergeSort(iterators, comparator).buffered
      ............................
    }
  }
}
17】、mergeSort
将传入的 iterators按照comparator排序的具体实现
private def mergeSort(iterators: Seq[Iterator[Product2[K, C]]], comparator: Comparator[K])
    : Iterator[Product2[K, C]] =
{
  val bufferedIters = iterators.filter(_.hasNext).map(_.buffered)
  type Iter = BufferedIterator[Product2[K, C]]
  val heap = new mutable.PriorityQueue[Iter]()(new Ordering[Iter] {
    // Use the reverse of comparator.compare because PriorityQueue dequeues the max
    override def compare(x: Iter, y: Iter): Int = -comparator.compare(x.head._1, y.head._1)
  })
  heap.enqueue(bufferedIters: _*)  // Will contain only the iterators with hasNext = true
  new Iterator[Product2[K, C]] {
    override def hasNext: Boolean = !heap.isEmpty

    override def next(): Product2[K, C] = {
      if (!hasNext) {
        throw new NoSuchElementException
      }
      val firstBuf = heap.dequeue()
      val firstPair = firstBuf.next()
      if (firstBuf.hasNext) {
        heap.enqueue(firstBuf)
      }
      firstPair
    }
  }
}
------------------------------------------------------------------------------------------------------------------------------------------------
shuffleMemoryManager 是被Executor 所有正在运行的Task(Core) 共享的,能够分配出去的内存是:ExecutorHeapMemeory * 0.2 * 0.8
上面的数字可通过下面两个配置来更改:
spark.shuffle.memoryFraction=0.2
spark.shuffle.safetyFraction=0.8
PartitionedAppendOnlyMap 放不下,要落地先写入buffer,然后把buffer再一次性写入磁盘文件。这个buffer是由参数fileBufferSize决定,通过下面配置来更改:
spark.shuffle.file.buffer=32k
数据获取的过程中,序列化反序列化,也是需要空间的,所以Spark 对数量做了限制,通过如下参数serializerBatchSize决定,通过下面配置来更改:
spark.shuffle.spill.batchSize=10000
假设一个Executor的可使用的Core为 C个,那么对应需要的内存消耗为:
C * 32k + C * 10000个Record + C * PartitionedAppendOnlyMap
这么看来
,写文件的buffer不是问题,而序列化的batchSize也不是问题,几万或者十几万个Record 而已。那C * PartitionedAppendOnlyMap 到底会有多大呢?我先给个结论: C * PartitionedAppendOnlyMap <shuffleManager可分配的内存空间
PartitionedAppendOnlyMap 通过map.estimateSize()获取占用内存大小,而map.estimateSize()是近似估计,所以会出现oom的情况
如果你内存开的比较大,其实反倒风险更高,因为estimateSize 并不是每次都去真实的算缓存。它是通过采样来完成的,而采样的周期不是固定的,而是指数增长的,比如第一次采样完后,PartitionedAppendOnlyMap 要经过1.1次的update/insert操作之后才进行第二次采样,然后经过1.1*.1.1次之后进行第三次采样,以此递推,假设你内存开的大,那PartitionedAppendOnlyMap可能要经过几十万次更新之后之后才会进行一次采样,然后才能计算出新的大小,这个时候几十万次更新带来的新的内存压力,可能已经让你的GC不堪重负了。

------------------------------------------------------------------------------------------------------------------------------------------------

ExternalSorter的关键调用

注:下文中的buffer/map都是指存放inmemery obj的buffer

insertAll【将数据放入buffer,注意此处并不占用storage内存】
    ---changeValue(map的方法)
        ---maybeSpillCollection
            ---maybeSpill
    ---insert(buffer的方法)
        ---maybeSpillCollection
            ---maybeSpill
writePartitionedFile【将buffer以及spillfile中的文件合并为一个文件】
    ---destructiveSortedWritablePartitionedIterator(map的方法)
        ---partitionedDestructiveSortedIterator(map的方法)
            ---destructiveSortedIterator(map的方法)
        ---WritablePartitionedIterator (map的方法)
    ---partitionedIterator
        ---groupByPartition
            ---destructiveSortedIterator
        ---groupByPartition
            ---destructiveSortedIterator
        ---merge
            ---destructiveSortedIterator
            ---mergeWithAggregation
                ------mergeSort
            ---mergeSort
参考文章:





猜你喜欢

转载自blog.csdn.net/cclucc/article/details/79910996