spark filter source code:
/**
* Return a new RDD containing only the elements that satisfy a predicate.
*/
def filter(f: T => Boolean): RDD[T] = withScope {
val cleanF = sc.clean(f)
new MapPartitionsRDD [T, T] (
this,
(context, pid, iter) => iter.filter(cleanF),
preservesPartitioning = true)
}
context, pid, iter 代表 TaskContext, partition index, iterator
scala filter source code:
/** Returns an iterator over all the elements of this iterator that satisfy the predicate `p`.
* The order of the elements is preserved.
*
* @param p the predicate used to test values.
* @return an iterator which produces those values of this iterator which satisfy the predicate `p`.
* @note Reuse: $consumesAndProducesIterator
*/
def filter(p: A => Boolean): Iterator[A] = new AbstractIterator[A] {
// TODO 2.12 - Make a full-fledged FilterImpl that will reverse sense of p
private var hd: A = _
private var hdDefined: Boolean = false
def hasNext: Boolean = hdDefined || {
do {
if (!self.hasNext) return false
hd = self.next()
} while (!p(hd))
hdDefined = true
true
}
def next() = if (hasNext) { hdDefined = false; hd } else empty.next()
}
The red part is actually taking out the elements that satisfy the p function to form a new iterator (the order of the elements does not change), and discarding those that are not satisfied. Finally these iterators
Form new RDDs.
example:
object Test extends App { select sparkConf = new SparkConf (). setAppName("Test") .setMaster("local[6]") val spark = SparkSession .builder() .config(sparkConf) .getOrCreate() val value: RDD[Int] = spark.sparkContext.parallelize(List(1, 2, 3, 5, 8, 9), 3) println(value.filter(_ != 2).getNumPartitions) }Partitions are not changed.