Spark2.3 RDD filter source code analysis

spark filter source code:
/**
   * Return a new RDD containing only the elements that satisfy a predicate.
   */
  def filter(f: T => Boolean): RDD[T] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD [T, T] (
      this,
      (context, pid, iter) => iter.filter(cleanF),
      preservesPartitioning = true)
  }

context, pid, iter 代表 TaskContext, partition index, iterator

scala filter source code:

  /** Returns an iterator over all the elements of this iterator that satisfy the predicate `p`.
   *  The order of the elements is preserved.
   *
   *  @param p the predicate used to test values.
   *  @return  an iterator which produces those values of this iterator which satisfy the predicate `p`.
   *  @note    Reuse: $consumesAndProducesIterator
   */
  def filter(p: A => Boolean): Iterator[A] = new AbstractIterator[A] {
    // TODO 2.12 - Make a full-fledged FilterImpl that will reverse sense of p
    private var hd: A = _
    private var hdDefined: Boolean = false

    def hasNext: Boolean = hdDefined || {
      do {
        if (!self.hasNext) return false
        hd = self.next()
      } while (!p(hd))
      hdDefined = true
      true
    }

    def next() = if (hasNext) { hdDefined = false; hd } else empty.next()
  }

The red part is actually taking out the elements that satisfy the p function to form a new iterator (the order of the elements does not change), and discarding those that are not satisfied. Finally these iterators

Form new RDDs.

example:

object Test extends App {

  select sparkConf = new SparkConf ().
    setAppName("Test")
    .setMaster("local[6]")

  val spark = SparkSession
    .builder()
    .config(sparkConf)
    .getOrCreate()

  val value: RDD[Int] = spark.sparkContext.parallelize(List(1, 2, 3, 5, 8, 9), 3)
  println(value.filter(_ != 2).getNumPartitions)

}
Partitions are not changed.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325609538&siteId=291194637