repartition:
/** * Return a new RDD that has exactly numPartitions partitions. * * Can increase or decrease the level of parallelism in this RDD. Internally, this uses * a shuffle to redistribute data. * * If you are decreasing the number of partitions in this RDD, consider using `coalesce`, * which can avoid performing a shuffle. * * TODO Fix the Shuffle+Repartition data loss issue described in SPARK-23207. */ def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope { coalesce(numPartitions, shuffle = true) }
coalesce :
/** * Return a new RDD that is reduced into `numPartitions` partitions. * * This results in a narrow dependency, e.g. if you go from 1000 partitions * to 100 partitions, there will not be a shuffle, instead each of the 100 * new partitions will claim 10 of the current partitions. If a larger number * of partitions is requested, it will stay at the current number of partitions. * * However, if you're doing a drastic coalesce, e.g. to numPartitions = 1, * this may result in your computation taking place on fewer nodes than * you like (e.g. one node in the case of numPartitions = 1). To avoid this, * you can pass shuffle = true. This will add a shuffle step, but means the * current upstream partitions will be executed in parallel (per whatever * the current partitioning is). * * 我理解的是如果将多个分区 要重分区为1个分区的时候 ,可以设置 shuffle = true ;这样可以避免 在一个执行器上 * 采用Netty通信的方式获取它的Parent RDD的数据,如果shuffle = true,重分区为1 此rdd的父rdd的分区所在的执 * 行器会将数据发送到一个同一个执行器的分区。 * * @note With shuffle = true, you can actually coalesce to a larger number * of partitions. This is useful if you have a small number of partitions, * say 100, potentially with a few partitions being abnormally large. Calling * coalesce(1000, shuffle = true) will result in 1000 partitions with the * data distributed using a hash partitioner. The optional partition coalescer * passed in must be serializable. * * 分区异常的大 可以using a hash partitioner重分区 到更多的分区 */ def coalesce(numPartitions: Int, shuffle: Boolean = false, partitionCoalescer: Option[PartitionCoalescer] = Option.empty) (implicit ord: Ordering[T] = null) : RDD[T] = withScope { require(numPartitions > 0, s"Number of partitions ($numPartitions) must be positive.") if (shuffle) { /** Distributes elements evenly across output partitions, starting from a random partition. */ /** 从随机分区开始均匀分布元素在输出分区上。 */ val distributePartition = (index: Int, items: Iterator[T]) => { var position: Int = new Random(hashing.byteswap32(index)).nextInt(numPartitions) items.map { t => // Note that the hash code of the key will just be the key itself. The HashPartitioner // will mod it with the number of total partitions. position = position + 1 (position, t) } }: Iterator[(Int, T)] // include a shuffle step so that our upstream tasks are still distributed // 之前的操作 还是并行的 new CoalescedRDD( new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition), new HashPartitioner(numPartitions)), numPartitions, partitionCoalescer).values } else { new CoalescedRDD(this, numPartitions, partitionCoalescer) } }
CoalescedRDD:
/** * Represents a coalesced RDD that has fewer partitions than its parent RDD * This class uses the PartitionCoalescer class to find a good partitioning of the parent RDD * so that each new partition has roughly the same number of parent partitions and that * the preferred location of each new partition overlaps with as many preferred locations of its * parent partitions * 每个新分区的优选位置与它的许多优选位置重叠父分区 * 将父分区尽可能多的分在 同一个 分区 比如 Parent:10 个分区 要收缩成为 一个分区 ,Parent:1-6 分区又在同一个执行器下 * 那么 新的分区会在 1-6分区 所在位置。其他 Parent:7-10 会采用Netty通信的方式获取它的Parent RDD的数据 到当前位置 * 因为没有落地到磁盘的操作,避免了磁盘I/O,所以比Shuffle还是要快不少 * * @param prev RDD to be coalesced * @param maxPartitions number of desired partitions in the coalesced RDD (must be positive) * @param partitionCoalescer [[PartitionCoalescer]] implementation to use for coalescing */ private[spark] class CoalescedRDD[T: ClassTag]( @transient var prev: RDD[T], maxPartitions: Int, partitionCoalescer: Option[PartitionCoalescer] = None) extends RDD[T](prev.context, Nil) { // Nil since we implement getDependencies require(maxPartitions > 0 || maxPartitions == prev.partitions.length, s"Number of partitions ($maxPartitions) must be positive.") if (partitionCoalescer.isDefined) { require(partitionCoalescer.get.isInstanceOf[Serializable], "The partition coalescer passed in must be serializable.") } override def getPartitions: Array[Partition] = { val pc = partitionCoalescer.getOrElse(new DefaultPartitionCoalescer()) pc.coalesce(maxPartitions, prev).zipWithIndex.map { case (pg, i) => val ids = pg.partitions.map(_.index).toArray new CoalescedRDDPartition(i, prev, ids, pg.prefLoc) } } override def compute(partition: Partition, context: TaskContext): Iterator[T] = { //将多个分区的元素 flatMap 之后 再拼成 一个分区 partition.asInstanceOf[CoalescedRDDPartition].parents.iterator.flatMap { parentPartition => firstParent[T].iterator(parentPartition, context) } } override def getDependencies: Seq[Dependency[_]] = { Seq(new NarrowDependency(prev) { def getParents(id: Int): Seq[Int] = partitions(id).asInstanceOf[CoalescedRDDPartition].parentsIndices }) } override def clearDependencies() { super.clearDependencies() prev = null }
实际运用中完全可以根据自己的需要来灵活使用coalesce ;
并行度不够的情况下可以提高分区数;
存在过多的小任务的时候,可以通过收缩合并分区,减少分区的个数,减小任务调度成本,避免Shuffle,比RDD.repartition效率提高不少;
某个分区数据倾斜也可用重分区,repartition提高分区数来缓解倾斜的情况。
例子:
object CoalesceTest extends App { val sparkConf = new SparkConf(). setAppName("CoalesceTest") .setMaster("local[6]") val spark = SparkSession .builder() .config(sparkConf) .getOrCreate() val value: RDD[Int] = spark.sparkContext.parallelize(List(9, 2, 3, 5, 8, 1), 3) val s = new Ordering[Int] { override def compare(x: Int, y: Int): Int = if (x < y) x else y } val coalesceValue: RDD[Int] = value.coalesce(2)(s) coalesceValue.foreach(println(_)) }coalesce 有一个可选参数 (implicit ord: Ordering[T] = null) 这里我的例子自定义排序后并没有排序。分区收缩后元素位置保持不变。
https://blog.csdn.net/u012684933/article/details/51028707