Key-Value类型RDD转换算子2——sortByKey、mapValues、join & cogroup

接上一篇

7. sortByKey(true/false,[numTasks])

8. mapValues

9. join(内联结)

10. cogroup(外联结)


7. sortByKey(true/false,[numTasks])

创建一个pairRDD,按照key的大小,正序(true)和倒序(false)进行排序,默认正序

scala> val rdd7 = sc.makeRDD(Array((3,"aa"),(6,"cc"),(2,"bb"),(1,"dd")))
rdd7: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[4] at makeRDD at <console>:24

scala> rdd7.sortByKey(true).collect
res4: Array[(Int, String)] = Array((1,dd), (2,bb), (3,aa), (6,cc))

scala> rdd7.sortByKey(false).collect
res5: Array[(Int, String)] = Array((6,cc), (3,aa), (2,bb), (1,dd))

8. mapValues

针对(K,V)的map形式数据,无视key,只对V进行操作

eg:创建一个pairRDD,将Value添加“~”

scala> val rdd8 = sc.makeRDD(Array((1,"a"),(1,"d"),(2,"b"),(3,"c")))
rdd8: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[11] at makeRDD at <console>:24

scala> rdd8.mapValues(_+"~").collect
res6: Array[(Int, String)] = Array((1,a~), (1,d~), (2,b~), (3,c~))

9. join(内联结)

将两个相同key的RDD(K,V)和(K,W),返回(K,(V,W))的RDD;内联结,求交集

scala> val rddA = sc.makeRDD(Array((1,"a"),(2,"b"),(3,"c")))
rddA: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[14] at makeRDD at <console>:24

scala> val rddB = sc.makeRDD(Array((1,6),(2,7),(3,8),(4,9),(5,10)))
rddB: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[15] at makeRDD at <console>:24

scala> rddA.join(rddB).collect
res3: Array[(Int, (String, Int))] = Array((1,(a,6)), (2,(b,7)), (3,(c,8)))

只返回主表中key对应的数据联结

10. cogroup(外联结)

将两个RDD(K,V)、(K,W)联结,返回(K,(Iterable<V>,Iterable<W>))类型的RDD

eg:创建两个pairRDD,将key相同的数据聚合到一个迭代器

scala> val rdd1 = sc.makeRDD(Array((1,"a"),(2,"b"),(3,"c")))
rdd1: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[9] at makeRDD at <console>:24

scala> val rdd2 = sc.makeRDD(Array((1,2),(2,3),(3,4),(4,5),(5,6)))
rdd2: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[10] at makeRDD at <console>:24

scala> rdd1.cogroup(rdd2).collect
res2: Array[(Int, (Iterable[String], Iterable[Int]))] = Array((1,(CompactBuffer(a),CompactBuffer(2))), (2,(CompactBuffer(b),CompactBuffer(3))), (3,(CompactBuffer(c),CompactBuffer(4))), (4,(CompactBuffer(),CompactBuffer(5))), (5,(CompactBuffer(),CompactBuffer(6))))

主从表中所有key对应的数据都会出来,key对应的主表数据缺失的用( )表示了

猜你喜欢

转载自blog.csdn.net/wx1528159409/article/details/87427412