distinct 源码:
/** * Return a new RDD containing the distinct elements in this RDD. */ def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope { map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1) } /** * Return a new RDD containing the distinct elements in this RDD. */ def distinct(): RDD[T] = withScope { distinct(partitions.length) }
这个去重算子比较直观了,其实就是把map 和 reduceByKey 封装了 一下。distinct有一个可选参数numPartitions,这个参数是你期望的分区数。
例子:
object DistinctTest extends App { val sparkConf = new SparkConf(). setAppName("TreeAggregateTest") .setMaster("local[6]") val spark = SparkSession .builder() .config(sparkConf) .getOrCreate() val value: RDD[Int] = spark.sparkContext.parallelize(List(1, 2, 3, 5, 8, 9), 3) println(value.distinct(1).getNumPartitions) }最后的结果分区被重置为1。