双Value类型RDD的转换算子,也就是两个Value类型RDD的各种交互算子
1. union
两个RDD求并集(不会去重),返回一个新的RDD
scala> val rdd1 = sc.makeRDD(1 to 5)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at makeRDD at <console>:24
scala> val rdd2 = sc.makeRDD(4 to 8)
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at makeRDD at <console>:24
scala> rdd1.union(rdd2).collect
res0: Array[Int] = Array(1, 2, 3, 4, 5, 4, 5, 6, 7, 8)
2. subtract
两个RDD的差集,对当前RDD,去除两个RDD的相同元素,保留剩余元素
scala> val rdd1 = sc.makeRDD(3 to 8)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at makeRDD at <console>:24
scala> val rdd2 = sc.makeRDD(1 to 5)
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[4] at makeRDD at <console>:24
scala> rdd1.subtract(rdd2).collect
res2: Array[Int] = Array(8, 6, 7)
scala> rdd2.subtract(rdd1).collect
res3: Array[Int] = Array(1, 2)
可以看出rdd1.subtract(rdd2)是去除相同元素3/4/5后,保留剩余数6/7/8
而rdd2.subtract(rdd1)是去除相同元素3/4/5后,保留剩余数1/2
3. intersection
对两个RDD求交集
scala> val rdd1 = sc.makeRDD(1 to 7)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[13] at makeRDD at <console>:24
scala> val rdd2 = sc.makeRDD(5 to 10)
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[14] at makeRDD at <console>:24
scala> rdd1.intersection(rdd2).collect
res4: Array[Int] = Array(5, 6, 7)
scala> rdd2.intersection(rdd1).collect
res5: Array[Int] = Array(5, 6, 7)
4. cartesian
求笛卡尔积,一般尽量避免使用
scala> val rdd1 = sc.makeRDD(1 to 3)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[27] at makeRDD at <console>:24
scala> val rdd2 = sc.makeRDD(2 to 5)
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[28] at makeRDD at <console>:24
scala> rdd1.cartesian(rdd2).collect
res6: Array[(Int, Int)] = Array((1,2), (1,3), (1,4), (1,5), (2,2), (2,3), (2,4), (2,5), (3,2), (3,3), (3,4), (3,5))
5. zip
将两个RDD的元素分别合并成key-value类型,需要两个RDD的分区数和元素个数相同,否则zip不成功会报错
scala> val rdd1 = sc.makeRDD(Array(1,2,3),4)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[30] at makeRDD at <console>:24
scala> val rdd2 = sc.makeRDD(Array("a","b","c"),4)
rdd2: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[31] at makeRDD at <console>:24
scala> rdd1.zip(rdd2).collect
res7: Array[(Int, String)] = Array((1,a), (2,b), (3,c))
scala> rdd1.zip(rdd2).glom.collect
res8: Array[Array[(Int, String)]] = Array(Array(), Array((1,a)), Array((2,b)), Array((3,c)))