distinct source code:
/** * Return a new RDD containing the distinct elements in this RDD. */ def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope { map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1) } /** * Return a new RDD containing the distinct elements in this RDD. */ def distinct(): RDD[T] = withScope { distinct(partitions.length) }
This deduplication operator is more intuitive, in fact, it encapsulates map and reduceByKey. distinct has an optional parameter numPartitions, which is the number of partitions you expect.
example:
object DistinctTest extends App { select sparkConf = new SparkConf (). setAppName("TreeAggregateTest") .setMaster("local[6]") val spark = SparkSession .builder() .config(sparkConf) .getOrCreate() val value: RDD[Int] = spark.sparkContext.parallelize(List(1, 2, 3, 5, 8, 9), 3) println(value.distinct(1).getNumPartitions) }The final result partition is reset to 1.