spark中一些不是很有意义的数据结构

Spark版本2.4.0

CompactBuffer是一个基于scala的ArrayBuffer进行了优化的object数组。

原生的ArrayBuffer在缺省情况就会构造一个大小为16的数组，这在一些小数据量，只有1个2个的情况，其实并不是很优雅的做法。

private var element0: T = _
private var element1: T = _

// Number of elements, including our two in the main object
private var curSize = 0

// Array for extra elements
private var otherElements: Array[T] = null

在CompactBuffer中，当数据量小于2的时候，只用到element0和element1字段即可，用来存放最前面两个元素。当元素大于三个的时候再申请一个otherElements数组用来存放后续数据，达到在小数据量下对于object数组的内存优化。

MedianHeap可以快速从一个有序集合中得到中位数。

/**
 * Stores all the numbers less than the current median in a smallerHalf,
 * i.e median is the maximum, at the root.
 */
private[this] var smallerHalf = PriorityQueue.empty[Double](ord)

/**
 * Stores all the numbers greater than the current median in a largerHalf,
 * i.e median is the minimum, at the root.
 */
private[this] var largerHalf = PriorityQueue.empty[Double](ord.reverse)

MedianHeap存在两个集合，smallerHalf用来存放比中位数小的有序队列，largerHalf则反之。

def median: Double = {
  if (isEmpty) {
    throw new NoSuchElementException("MedianHeap is empty.")
  }
  if (largerHalf.size == smallerHalf.size) {
    (largerHalf.head + smallerHalf.head) / 2.0
  } else if (largerHalf.size > smallerHalf.size) {
    largerHalf.head
  } else {
    smallerHalf.head
  }
}

当两个集合相等，则前后两集合最大最小数的平均值就是所需要的中位数，反之则是数量较大队列的队首元素。

private[this] def rebalance(): Unit = {
  if (largerHalf.size - smallerHalf.size > 1) {
    smallerHalf.enqueue(largerHalf.dequeue())
  }
  if (smallerHalf.size - largerHalf.size > 1) {
    largerHalf.enqueue(smallerHalf.dequeue)
  }
}

由于本身两个队列实现为PriorityQueue，本身则为有序的。因此，当通过reblance()方法平衡两个集合时，只要将数量较大的集合元素往较小的队列不断插入直到两者数量相差小于1即可。

扫描二维码关注公众号，回复： 8831098 查看本文章

TimeStampedHashMap是一个自带最近一次键值对访问时间的Map，可以达到去除长时间没有用到的键值对的目的。

override def += (kv: (A, B)): this.type = {
  kv match { case (a, b) => internalMap.put(a, TimeStampedValue(b, currentTime)) }
  this
}

private[spark] case class TimeStampedValue[V](value: V, timestamp: Long)

该Map的key没有特殊，而是在value的存放为一个value和时间戳的case class，在原本基础上增加了最近一次访问时间的存储，当需要去除长时间未使用的键值对的时候只需要遍历一遍，去掉目标时间之前的键值对即可。

BoundedPriorityQueue在PriorityQueue基础上做了一层封装，当队列数量满的时候，如果新的数据加进来，将会和队列中最小值相比，如果大于就将其替换，达到队列当中一直是要存储的集合中最大的几个。

private def maybeReplaceLowest(a: A): Boolean = {
  val head = underlying.peek()
  if (head != null && ord.gt(a, head)) {
    underlying.poll()
    underlying.offer(a)
  } else {
    false
  }
}

tydhot

发布了141 篇原创文章 · 获赞 19 · 访问量 10万+

私信关注

spark中一些不是很有意义的数据结构

猜你喜欢