Key-Value类型RDD转换算子1——partitionBy、groupByKey & reduceByKey、aggregateByKey & foldByKey & combineByKey

key-Value类型RDD,也叫pairRDD,RDD中元素不再为单一的value类型,每一行是(key, value)的格式。

eg:Value类型RDD:sc.makeRDD(Array(1 to 10))

        key-Value类型RDD:sc.makeRDD(Array((1,"a"),(2,"b"),(3,"c")))

1. partitionBy

2. groupByKey

3. reduceByKey(func,[num Tasks])

    ps:groupByKey和reduceByKey的区别

4. aggregateByKey(zeroValue)(seqOp,combOp)

5. foldByKey

6. combineByKey

    ps:aggregateByKey、foldByKey、combineByKey区别点


1. partitionBy

对key-value类型RDD重分区,若分区数改变会产生shuffle操作,打乱重组数据;

类似value类型RDD的coalesce(num)、repartition(num)重分区操作;都是默认使用HashPartitioner

区别点在于partitionBy只适用于pairRDD,相同key分一个区

                  而repartition则是随机生成一个key进行分区,分区比较随意spark中repartition和partitionBy的区别

scala> val rdd1 = sc.makeRDD(Array((1,"a"),(1,"b"),(2,"b"),(3,"c"),(4,"d")),4)
rdd1: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[51] at makeRDD at <console>:24

scala> rdd1.glom.collect
res16: Array[Array[(Int, String)]] = Array(Array((1,a)), Array((1,b)), Array((2,b)), Array((3,c), (4,d)))

scala> rdd1.partitionBy(new org.apache.spark.HashPartitioner(2)).glom.collect
res17: Array[Array[(Int, String)]] = Array(Array((2,b), (4,d)), Array((1,a), (1,b), (3,c)))

 

2. groupByKey

将相同的key对应value值聚合一个sequence中

eg:首先生成pairRDD

scala> val words = Array("one","two","two","three","three","three")
words: Array[String] = Array(one, two, two, three, three, three)

scala> val rdd2 = sc.makeRDD(words).map(x => (x,1))
rdd2: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[56] at map at <console>:26

scala> rdd2.collect
res18: Array[(String, Int)] = Array((one,1), (two,1), (two,1), (three,1), (three,1), (three,1))

然后聚合相同的key值,并对value值做sum操作

scala> val group = rdd2.groupByKey()
group: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[57] at groupByKey at <console>:28

scala> group.collect
res19: Array[(String, Iterable[Int])] = Array((two,CompactBuffer(1, 1)), (one,CompactBuffer(1)), (three,CompactBuffer(1, 1, 1)))

scala> group.map(t=>(t._1,t._2.sum)).collect
res21: Array[(String, Int)] = Array((two,2), (one,1), (three,3))

 

3. reduceByKey(func,[num Tasks])

类似groupByKey,聚合相同的key值,reduce的task数可以通过第二个参数来设置

eg:创建一个pairRDD,计算相同key对应值的sum结果

scala> val rdd2 = sc.makeRDD(List(("female",1),("male",5),("female",7),("male",3)))
rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[60] at makeRDD at <console>:24

scala> val reduce = rdd2.reduceByKey((x,y)=>x+y)
reduce: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[61] at reduceByKey at <console>:26

scala> reduce.collect
res22: Array[(String, Int)] = Array((female,8), (male,8))

 

ps:groupByKey和reduceByKey的区别

都是聚合相同的key值对应的value,对相同key对应的value进行操作

reduceByKey在shuffle之前有一个combine的预聚合操作,这样就减少了shuffle的IO传递;而groupByKey则是直接shuffle;

所以日常工作中在不影响业务逻辑的前提下,优先使用reduceByKey(不需要group分组)

如果仅仅只是需要分组,foldByKey和combineByKey都优于groupByKey

spark 算子之 reduceByKey与groupByKey的区别

4. aggregateByKey(zeroValue)(seqOp,combOp)

聚合,先对分区内相同key的value值进行合并,合并时value与初始值zeroValue作为seq函数的参数,按seqOp函数聚合;

再对分区间的相同key的value值按combOp函数聚合。

初始值:指给key先设定一个虚拟的value值zeroValue作为初始值,然后初始值再与key实际对应的value进行联合运算,比较、相加等函数

  def aggregateByKey[U: ClassTag](zeroValue: U)(seqOp: (U, V) => U,
      combOp: (U, U) => U): RDD[(K, U)] = self.withScope {
    aggregateByKey(zeroValue, defaultPartitioner(self))(seqOp, combOp)
  }

分区内元素聚合:seqOp函数;

分区间元素聚合:combOp函数;

eg:创建一个pairRDD,先取出分区相同的key对应value的最大值,再对不同分区key对应value的值进行sum

scala> val rdd4 = sc.makeRDD(List(("a",3),("a",2),("c",4),("b",3),("c",6),("c",8)),2)
rdd4: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[62] at makeRDD at <console>:24

scala> rdd4.glom.collect
res23: Array[Array[(String, Int)]] = Array(Array((a,3), (a,2), (c,4)), Array((b,3), (c,6), (c,8)))

scala> rdd4.aggregateByKey(0)(math.max(_,_),_+_).collect
res25: Array[(String, Int)] = Array((b,3), (a,3), (c,12))

创建了两个分区Array,初始值设定为0,分区内取max值,得出 (a,3) (c,4)和(b,3) (c,8)

然后分区间相同key聚合,函数是sum,变成(a,3) (b,3) (c,12)

5. foldByKey

aggrgateByKey的简化操作,seqOp和combOp相同,

从源码可以看出分区内和分区间都是cleanedFunc这个函数

    val cleanedFunc = self.context.clean(func)
    combineByKeyWithClassTag[V]((v: V) => cleanedFunc(createZero(), v),
      cleanedFunc, cleanedFunc, partitioner)
  }

eg:创建一个pairRDD,计算相同key对应值的累加结果

scala> val rdd5 = sc.makeRDD(List((1,3),(1,2),(1,4),(2,3),(3,6),(3,8)),3)
rdd5: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[65] at makeRDD at <console>:24

scala> rdd5.glom.collect
res26: Array[Array[(Int, Int)]] = Array(Array((1,3), (1,2)), Array((1,4), (2,3)), Array((3,6), (3,8)))

scala> rdd5.foldByKey(0)(_+_).glom.collect
res27: Array[Array[(Int, Int)]] = Array(Array((3,14)), Array((1,9)), Array((2,3)))

scala> rdd5.foldByKey(0)(_+_).collect
res28: Array[(Int, Int)] = Array((3,14), (1,9), (2,3))

scala> rdd5.aggregateByKey(0)(_+_,_+_).collect
res29: Array[(Int, Int)] = Array((3,14), (1,9), (2,3))

eg:再换一个初始值,可以明显看出初始值是参与实际value的运算的,比如Array(1,3),(1,2)和Array(1,4)相同key的value相加,首先分区内sum,Array(10+3+2=15),Array(10+4=14);再分区间sum,15+14=29。

注意是先分区内运算,再分区间运算

scala> rdd5.aggregateByKey(0)(_+_,_+_).collect
res29: Array[(Int, Int)] = Array((3,14), (1,9), (2,3))

scala> rdd5.foldByKey(10)(_+_).collect
res30: Array[(Int, Int)] = Array((3,24), (1,29), (2,13))

 

6. combineByKey

把相同key的value合并成一个集合,

combineByKey源码:

  def combineByKey[C](
      createCombiner: V => C,
      mergeValue: (C, V) => C,
      mergeCombiners: (C, C) => C): RDD[(K, C)] = self.withScope {
    combineByKeyWithClassTag(createCombiner, mergeValue, mergeCombiners)(null)
  }

createCombiner:V=>C,创建累加器,生成带有运算逻辑的初始值;分区内第一次碰到key,走createCombiner函数,将key对应的value生成新数据C作为初始值,这个初始值可以写运算逻辑(对比foldByKey初始值是固定的)

mergeValue:(C,V)=>C;分区内第二次开始碰到key,走mergeValue方法,将key的累加器对应的当前值C,与key的当前value值进行合并,生成新数据C

mergeCombiners:(C,C)=>C;分区间的数据,将两个分区内的数据C合并成一个新的C

ps:类似foldByKey,都是三个参数(初始值,分区内函数,分区间函数),聚合value;区别点:

(1)combineByKey的初始值可以写逻辑运算,foldByKey的初始值固定;

(2)combineByKey的分区内和分区间函数需要写明参数类型,否则不能推断;foldByKey不需要

eg:创建pairRDD,根据key计算每种key的均值

分析:首先选定初始值,key-value初次遇到,value => (value,1);例如(a,88)生成(88,1)

           然后分区内,key-value再次遇到,(latter:(Int,Int),v) => (latter._1+v,latter._2+1);例如遇到(a,8)转变成(96,2)

           最后分区间,x:(Int,Int),y:(Int,Int) => (x._1+y._1,x._2+y._2);例如分区间遇到(2,1)转变成(98,3)

scala> val rdd6 = sc.makeRDD(Array(("a",88),("b",95),("a",91),("b",93),("a",95),("b",98)),3)
rdd6: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0] at makeRDD at <console>:24

scala> val combine = rdd6.combineByKey((_,1),(acc:(Int,Int),v)=>(acc._1+v,acc._2+1),(acc1:(Int,Int),acc2:(Int,Int))=>(acc1._1+acc2._1,acc1._2+acc2._2))
combine: org.apache.spark.rdd.RDD[(String, (Int, Int))] = ShuffledRDD[2] at combineByKey at <console>:26

scala> combine.collect
res1: Array[(String, (Int, Int))] = Array((a,(274,3)), (b,(286,3)))

scala> val result = combine.map{case(key,value)=>(key,value._1/value._2.toDouble)}
result: org.apache.spark.rdd.RDD[(String, Double)] = MapPartitionsRDD[3] at map at <console>:28

scala> result.collect
res2: Array[(String, Double)] = Array((a,91.33333333333333), (b,95.33333333333333))

ps:aggregateByKey、foldByKey、combineByKey区别点:

(1)三者都是(初始值,分区内函数,分区间函数)

(2)初始值固定用aggregateByKey和foldByKey,其中aggregateByKey分区内函数和分区间函数不一样,foldByKey分区内函数和分区间函数一样

(3)初始值需要写逻辑运算,用combineByKey

猜你喜欢

转载自blog.csdn.net/wx1528159409/article/details/87427075