Spark2.3 RDD之 filter源码解析

spark filter源码:
 /**
   * Return a new RDD containing only the elements that satisfy a predicate.
   */
  def filter(f: T => Boolean): RDD[T] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[T, T](
      this,
      (context, pid, iter) => iter.filter(cleanF),
      preservesPartitioning = true)
  }

context, pid, iter 代表 TaskContext, partition index, iterator

scala filter 源码:

  /** Returns an iterator over all the elements of this iterator that satisfy the predicate `p`.
   *  The order of the elements is preserved.
   *
   *  @param p the predicate used to test values.
   *  @return  an iterator which produces those values of this iterator which satisfy the predicate `p`.
   *  @note    Reuse: $consumesAndProducesIterator
   */
  def filter(p: A => Boolean): Iterator[A] = new AbstractIterator[A] {
    // TODO 2.12 - Make a full-fledged FilterImpl that will reverse sense of p
    private var hd: A = _
    private var hdDefined: Boolean = false

    def hasNext: Boolean = hdDefined || {
      do {
        if (!self.hasNext) return false
        hd = self.next()
      } while (!p(hd))
      hdDefined = true
      true
    }

    def next() = if (hasNext) { hdDefined = false; hd } else empty.next()
  }

标红部分其实就是 将满足p函数的元素单独拿出来组成新迭代器(元素的顺序不改变),不满足的直接抛弃。最后这些迭代器

组成新的RDD。

例子:

object Test extends App {

  val sparkConf = new SparkConf().
    setAppName("Test")
    .setMaster("local[6]")

  val spark = SparkSession
    .builder()
    .config(sparkConf)
    .getOrCreate()

  val value: RDD[Int] = spark.sparkContext.parallelize(List(1, 2, 3, 5, 8, 9), 3)
  println(value.filter(_ != 2).getNumPartitions)

}
分区不回被改变。

猜你喜欢

转载自blog.csdn.net/dpnice/article/details/80094747
今日推荐