Talking about the distinct source code of Spark2.3 RDD

distinct source code:

/**

 * Return a new RDD containing the distinct elements in this RDD.
 */
def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
  map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1)
}

/**
 * Return a new RDD containing the distinct elements in this RDD.
 */
def distinct(): RDD[T] = withScope {
  distinct(partitions.length)
}

This deduplication operator is more intuitive, in fact, it encapsulates map and reduceByKey. distinct has an optional parameter numPartitions, which is the number of partitions you expect.

example:

object DistinctTest extends App {

  select sparkConf = new SparkConf ().
    setAppName("TreeAggregateTest")
    .setMaster("local[6]")

  val spark = SparkSession
    .builder()
    .config(sparkConf)
    .getOrCreate()

  val value: RDD[Int] = spark.sparkContext.parallelize(List(1, 2, 3, 5, 8, 9), 3)
  println(value.distinct(1).getNumPartitions)
}
The final result partition is reset to 1.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325609507&siteId=291194637