spark combineByKey

Looking at the source code, you will find that combineByKey is defined as follows:

def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C)
    : RDD[(K, C)] = {
    combineByKey(createCombiner, mergeValue, mergeCombiners, defaultPartitioner(self))
  }

example:

spark group calculation average

object ColumnValueAvg extends App {
  /**
    * ID,Name,ADDRESS,AGE
    * 001,zhangsan,chaoyang,20
    * 002,zhangsa,chaoyang,27
    * 003,zhangjie,chaoyang,35
    * 004,lisi,haidian,24
    * 005,lier,haidian,40
    * 006,wangwu,chaoyang,90
    * 007,wangchao,haidian,80
    */
  val conf = new SparkConf().setAppName("test column value sum and avg").setMaster("local[1]")
  val sc = new SparkContext(conf)

  val textRdd = sc.textFile(args(0))

  //be careful the toInt here is necessary ,if no cast ,then it will be age string append
  val addressAgeMap = textRdd.map(x => (x.split(",")(2), x.split(",")(3).toInt))

  val sumAgeResult = addressAgeMap.reduceByKey(_ + _).collect().foreach(println)

  val avgAgeResult = addressAgeMap.combineByKey(
    (v) => (v, 1),
    (accu: (Int, Int), v) => (accu._1 + v, accu._2 + 1),
    (accu1: (Int, Int), accu2: (Int, Int)) => (accu1._1 + accu2._1, accu1._2 + accu2._2)
  ).mapValues(x => (x._1 / x._2).toDouble).collect().foreach(println)

  println("Sum and Avg calculate successfuly")

  sc.stop()

}

The combineByKey function needs to pass three functions as parameters, namely createCombiner, mergeValue, and mergeCombiner. You need to understand the meaning of these three functions.

In terms of data, combineByKey defaults to combine elements according to the key. The three parameters here are some operations on the value.

1> The first parameter createCombiner, as defined in the code is: (v) => (v, 1)

Here, a combiner is created. The function is to generate a (v,1) combiner when traversing the partition of rdd and encounter the key value that appears for the first time. For example, the key here is address. When the first one is encountered

When chaoyang,20, the v in (v,1) is the value of age 20, 1 is the number of times the address appears

2> The second parameter is mergeValue, as the name implies, it is the merge value, as defined in the code: (accu: (Int, Int), v) => (accu._1 + v, accu._2 + 1)
The function here is that when processing the current partition, if you encounter a key that has already appeared, then merge the value in the combiner, pay attention here accu: (Int, Int) corresponds to the combiner that appears in the first parameter, ie (v, 1). Note that the type must be consistent,
then (accu._1 + v, accu._2 + 1) is easy to understand, accu. _1 Even if the value of age needs to be merged, and acc._2 is the number of occurrences of the key value that needs to be merged, add 1 once it occurs

3> The third parameter is mergeCombiners, which is used to merge the accumulators on each partition, because each partition After running the first two functions respectively, you need to finally merge the partition results.

ok, run the code, the results are as follows, calculate the average age according to the address (haidian

, 48.0)
(chaoyang, 43.0)

Due to the high degree of abstraction of combineByKey, you can You can customize some functions as calculation factors, so you can flexibly complete more calculation functions. reduceByKey and groupByKey are implemented based on combineByKey.

combineByKey

def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C): RDD[(K, C)]

def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, numPartitions: Int): RDD[(K, C)]

def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, partitioner: Partitioner, mapSideCombine: Boolean = true, serializerClass: String = null): RDD[(K, C)]

1, the first parameter, createCombiner: V => C. . This means that when combineByKey encounters the key with value k for the first time, the createCombiner function is called to convert V to C. (This step is similar to the initialization operation)
2, the second parameter, mergeValue: (C, V) => C. . This means that when combineByKey is not the first time to encounter the Key with the value of k, the mergeValue function is called to accumulate v into c. . (This operation is performed within each partition)
3. The third parameter, mergeCombiners: (C, C) => C. This representation combines two C's into a single C type. (This operation is performed between different partitions)
4. The return value of the operator is finally RDD[(K,C)] type. Indicates that the value value is finally converted from the original V type to the C type according to the same k.

val a = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), 3)
val b = sc.parallelize(List(1,1,2,2,2,1,2,2,2), 3)
val c = b.zip(a)  //利用拉练操作将两个rdd合并为一个值为pair类型的rdd。
 
val d = c.combineByKey(List(_), (x:List[String], y:String) => y :: x, (x:List[String], y:List[String]) => x ::: y)
//在这个combineByKey中，可以看到首先每次遇到第一个值，就将其变为一个加入到一个List中去。
//第二个函数指的是在key相同的情况下，当每次遇到新的value值，就把这个值添加到这个list中去。
//最后是一个merge函数，表示将key相同的两个list进行合并。
 
d.collect
res16: Array[(Int, List[String])] = Array((1,List(cat, dog, turkey)), (2,List(gnu, rabbit, salmon, bee, bear, wolf)))
val initialScores = Array(("Fred", 88.0), ("Fred", 95.0), ("Fred", 91.0), ("Wilma", 93.0), ("Wilma", 95.0), ("Wilma", 98.0))  
val d1 = sc.parallelize(initialScores)  
type MVType = (Int, Double) //定义一个元组类型(科目计数器,分数)  。type的意思是以后再这个代码中所有的类型为(Int, Double)都可以被记为MVType。
d1.combineByKey(  
  score => (1, score),  
  //score => (1, score)，我们把分数作为参数,并返回了附加的元组类型。 以"Fred"为列，当前其分数为88.0 =>(1,88.0)  1表示当前科目的计数器，此时只有一个科目
  (c1: MVType, newScore) => (c1._1 + 1, c1._2 + newScore),  
  //注意这里的c1就是createCombiner初始化得到的(1,88.0)。在一个分区内，我们又碰到了"Fred"的一个新的分数91.0。当然我们要把之前的科目分数和当前的分数加起来即//c1._2 + newScore,然后把科目计算器加1即c1._1 + 1
 
  (c1: MVType, c2: MVType) => (c1._1 + c2._1, c1._2 + c2._2)  
  //注意"Fred"可能是个学霸,他选修的科目可能过多而分散在不同的分区中。所有的分区都进行mergeValue后,接下来就是对分区间进行合并了,分区间科目数和科目数相加分数和分数相加就得到了总分和总科目数
).map 
{ 
case (name, (num, socre)) 
=> (name, socre / num)
 }.collect

reduceByKey

def reduceByKey(func: (V, V) => V): RDD[(K, V)]

def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)]

def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)]

The data acting on the key-value pair type is aggregated according to the data with the same key. Pass in a function that acts on key-value pairs with two identical keys, and then performs functional operations on the value.

val a = sc.parallelize(List("dog", "cat", "owl", "gnu", "ant"), 2)
val b = a.map(x => (x.length, x)) //生成一个键值对类型的数据，键为字符串长度，值为字符串。
b.reduceByKey(_ + _).collect  //对于有相同的键的元祖进行累加，由于所有的数据的长度都是3，所以最后得到了如下的结果
res86: Array[(Int, String)] = Array((3,dogcatowlgnuant))
 
val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)
val b = a.map(x => (x.length, x)) //同样的，将数据变为元祖。
b.reduceByKey(_ + _).collect //长度为3的数据有dog，cat，长度为4的数据有lion。长度为5的有tiger和eagle。长度为7的有一个panther。

groupByKey

def groupByKey(): RDD[(K, Iterable[V])]  //讲一个rdd进行有键值，进行group操作，最后返回的value值是一个迭代器，其内容包含所有key值为K的元祖的value值。
 
def groupByKey(numPartitions: Int): RDD[(K, Iterable[V])]

def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])]

val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "spider", "eagle"), 2)
val b = a.keyBy(_.length) //keyBy算子的意思是以_.length这个值作为key。其中value的返回值为ArrayBuffer。
b.groupByKey.collect
 
res11: Array[(Int, Seq[String])] = Array((4,ArrayBuffer(lion)), (6,ArrayBuffer(spider)), (3,ArrayBuffer(dog, cat)), (5,ArrayBuffer(tiger, eagle)))  //

The groupByKey data is not merged, so the performance is the lowest.

combineByKey

reduceByKey

groupByKey

Guess you like