Talking about the repartition/coalesce source code of Spark RDD

repartition:

  /**
    * Return a new RDD that has exactly numPartitions partitions.
    *
    * Can increase or decrease the level of parallelism in this RDD. Internally, this uses
    * a shuffle to redistribute data.
    *
    * If you are decreasing the number of partitions in this RDD, consider using `coalesce`,
    * which can avoid performing a shuffle.
    *
    * TODO Fix the Shuffle+Repartition data loss issue described in SPARK-23207.
    */
  def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    coalesce(numPartitions, shuffle = true)
  }

coalesce :

  /**
    * Return a new RDD that is reduced into `numPartitions` partitions.
    *
    * This results in a narrow dependency, e.g. if you go from 1000 partitions
    * to 100 partitions, there will not be a shuffle, instead each of the 100
    * new partitions will claim 10 of the current partitions. If a larger number
    * of partitions is requested, it will stay at the current number of partitions.
    *
    * However, if you're doing a drastic coalesce, e.g. to numPartitions = 1,
    * this may result in your computation taking place on fewer nodes than
    * you like (e.g. one node in the case of numPartitions = 1). To avoid this,
    * you can pass shuffle = true. This will add a shuffle step, but means the
    * current upstream partitions will be executed in parallel (per whatever
    * the current partitioning is).
    *
    * I understand that if multiple partitions are to be re-partitioned into 1 partition, shuffle = true can be set; this can avoid on one executor
    * Use Netty communication to obtain the data of its Parent RDD. If shuffle = true, re-partition is 1. The execution of the partition of this rdd's parent rdd is located.
    * The executor will send data to a partition of the same executor.
    *
    * @note With shuffle = true, you can actually coalesce to a larger number
    *       of partitions. This is useful if you have a small number of partitions,
    *       say 100, potentially with a few partitions being abnormally large. Calling
    *       coalesce(1000, shuffle = true) will result in 1000 partitions with the
    *       data distributed using a hash partitioner. The optional partition coalescer
    *       passed in must be serializable.
    *
    * The abnormally large partition can be re-partitioned to more partitions using a hash partitioner
    */
  def coalesce(numPartitions: Int, shuffle: Boolean = false,
               partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
              (implicit ord: Ordering[T] = null)
  : RDD [T] = withScope {
    require(numPartitions > 0, s"Number of partitions ($numPartitions) must be positive.")
    if (shuffle) {
      /** Distributes elements evenly across output partitions, starting from a random partition. */
      /** Evenly distribute elements over the output partition starting from a random partition. */
      val distributePartition = (index: Int, items: Iterator[T]) => {
        var position: Int = new Random(hashing.byteswap32(index)).nextInt(numPartitions)
        items.map { t =>
          // Note that the hash code of the key will just be the key itself. The HashPartitioner
          // will mod it with the number of total partitions.
          position = position + 1
          (position, t)
        }
      }: Iterator[(Int, T)]

      // include a shuffle step so that our upstream tasks are still distributed
      // The previous operation is still parallel
      new CoalescedRDD(
        new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),
          new HashPartitioner(numPartitions)),
        numPartitions,
        partitionCoalescer).values
    } else {
      new CoalescedRDD(this, numPartitions, partitionCoalescer)
    }
  }
CoalescedRDD:
/**
  * Represents a coalesced RDD that has fewer partitions than its parent RDD
  * This class uses the PartitionCoalescer class to find a good partitioning of the parent RDD
  * so that each new partition has roughly the same number of parent partitions and that
  * the preferred location of each new partition overlaps with as many preferred locations of its
  * parent partitions
  * Each new partition's preferred position overlaps many of its parent's preferred positions
  * Divide as many parent partitions as possible into the same partition, such as Parent: 10 partitions should be shrunk into one partition, Parent: 1-6 partitions are under the same executor
  * Then the new partition will be in the location of partition 1-6. Other Parents: 7-10 will use Netty communication to get the data of its Parent RDD to the current location
  * Because there is no operation to land on the disk, disk I/O is avoided, so it is still much faster than Shuffle
  *
  * @param prev               RDD to be coalesced
  * @param maxPartitions      number of desired partitions in the coalesced RDD (must be positive)
  * @param partitionCoalescer [[PartitionCoalescer]] implementation to use for coalescing
  */
private[spark] class CoalescedRDD[T: ClassTag](
                                                @transient var prev: RDD [T],
                                                maxPartitions: Int,
                                                partitionCoalescer: Option[PartitionCoalescer] = None)
  extends RDD[T](prev.context, Nil) { // Nil since we implement getDependencies

  require(maxPartitions > 0 || maxPartitions == prev.partitions.length,
    s"Number of partitions ($maxPartitions) must be positive.")
  if (partitionCoalescer.isDefined) {
    require(partitionCoalescer.get.isInstanceOf[Serializable],
      "The partition coalescer passed in must be serializable.")
  }

  override def getPartitions: Array[Partition] = {
    val pc = partitionCoalescer.getOrElse(new DefaultPartitionCoalescer())

    pc.coalesce(maxPartitions, prev).zipWithIndex.map {
      case (pg, i) =>
        val ids = pg.partitions.map(_.index).toArray
        new CoalescedRDDPartition(i, prev, ids, pg.prefLoc)
    }
  }

  override def compute(partition: Partition, context: TaskContext): Iterator[T] = {
    // FlatMap the elements of multiple partitions and then spell them into one partition
    partition.asInstanceOf[CoalescedRDDPartition].parents.iterator.flatMap { parentPartition =>
      firstParent[T].iterator(parentPartition, context)
    }
  }

  override def getDependencies: Seq[Dependency[_]] = {
    Seq(new NarrowDependency(prev) {
      def getParents(id: Int): Seq[Int] =
        partitions(id).asInstanceOf[CoalescedRDDPartition].parentsIndices
    })
  }

  override def clearDependencies() {
    super.clearDependencies()
    prev = null
  }

In practical application, coalesce can be used flexibly according to one's own needs;

If the parallelism is not enough, the number of partitions can be increased;

When there are too many small tasks, you can shrink and merge partitions to reduce the number of partitions, reduce task scheduling costs, and avoid Shuffle, which is much more efficient than RDD.repartition ;

If the data of a partition is skewed, repartitioning can also be used. Repartition increases the number of partitions to alleviate the skewed situation.

example:

object CoalesceTest extends App {

  select sparkConf = new SparkConf ().
    setAppName("CoalesceTest")
    .setMaster("local[6]")

  val spark = SparkSession
    .builder()
    .config(sparkConf)
    .getOrCreate()

  val value: RDD[Int] = spark.sparkContext.parallelize(List(9, 2, 3, 5, 8, 1), 3)

   val s = new Ordering[Int] {
    override def compare(x: Int, y: Int): Int =
      if (x < y) x
      else y
  }
  val coalesceValue: RDD[Int] = value.coalesce(2)(s)

  coalesceValue.foreach(println(_))

}
coalesce has an optional parameter (implicit ord: Ordering[T] = null) here my example does not sort after custom sorting. The element position remains the same after the partition is shrunk.

https://blog.csdn.net/u012684933/article/details/51028707

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325609494&siteId=291194637