Spark RDD中repartition和coalesce的区别

1、repartition

repartition会根据用户传入的分区数重新通过网络分区所有数据,它会产生shuffle过程,所以是一个重型操作。

    val kv1: RDD[(String, Int)] = sc.parallelize(List(
      ("zhangsan", 11),
      ("zhangsan", 12),
      ("lisi", 13),
      ("wangwu", 14)
    ))
    val kv2: RDD[(String, Int)] = sc.parallelize(List(
      ("zhangsan", 21),
      ("zhangsan", 12),
      ("zhangsan", 22),
      ("lisi", 23),
      ("zhaoliu", 28)
    ))
    
    val value1: RDD[(String, Int)] = kv1.repartition(3)  ##结果:3
    println(value1.partitions.length)

2、coalesce

coalesce同样对用户传入的分区数进行分区,但是它不会产生shuffle过程。我们知道,DAGScheduler创建Task的数量取决于Stage的最后一个RDD的分区数,如果不进行shuffle,那么coalesce根本无法精准控制分区数。

    val kv1: RDD[(String, Int)] = sc.parallelize(List(
      ("zhangsan", 11),
      ("zhangsan", 12),
      ("lisi", 13),
      ("wangwu", 14)
    ))
    val kv2: RDD[(String, Int)] = sc.parallelize(List(
      ("zhangsan", 21),
      ("zhangsan", 12),
      ("zhangsan", 22),
      ("lisi", 23),
      ("zhaoliu", 28)
    ))
    val value: RDD[(String, Int)] = kv1.coalesce(5)
    println(value.partitions.length)  ##结果:1

在这里插入图片描述
在这里插入图片描述

猜你喜欢

转载自blog.csdn.net/qq_37163925/article/details/106225249