spark中难记忆的算子

glom(rdd):将rdd中每个分区形成一个数组,形成新的RDD类型时RDD[Array[T]]

scala> val rdd = sc.parallelize(1 to 16,4)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[9] at parallelize at <console>:24

scala> rdd.glom().collect()
res7: Array[Array[Int]] = Array(Array(1, 2, 3, 4), Array(5, 6, 7, 8), Array(9, 10, 11, 12), Array(13, 14, 15, 16))

coalesce(numPartitions) :缩减分区数,用于大数据集过滤后,提高小数据集的执行效率。只能减少分区不能增加

val rdd = sc.parallelize(1 to 16,4)

scala> rdd.coalesce(3)
res8: org.apache.spark.rdd.RDD[Int] = CoalescedRDD[12] at coalesce at <console>:27

scala> res8.partitions.size
res9: Int = 3


repartition: 底层调用coalesce,区别就是repartition 是一定会使用shuffle
所以分区数可以增加也可以减少

发布了53 篇原创文章 · 获赞 4 · 访问量 978

猜你喜欢

转载自blog.csdn.net/weixin_43548518/article/details/103403252