Spark Key-Value type

1) groupByKey case
1. Role: groupByKey operates on each one, but only generates one sequence
2. Requirements: Create a pairRDD, aggregate the corresponding values ​​of the same key into a sequence, and calculate the addition result of the same corresponding values

//创建一个RDD算子,指定分区数2
val ListRDD: RDD[String] = sc.makeRDD(List("Abo", "Spark", "Hadoop",
      "Python", "Python", "Scala", "Spark", "Spark"), 2)
//转换结构将ListRDD集合中每一个元素映射成二元组("Abo",1)形式
    val MapRDD: RDD[(String, Int)] = ListRDD.map((_, 1))
    //将上述MapRDD集合中的元素排序(二元组第一个元素为Key)
    val GroupByKeyRDD: RDD[(String, Iterable[Int])] = MapRDD.groupByKey()

Returns the result:
(the Python, CompactBuffer (. 1,. 1))
(Scala, CompactBuffer (. 1))
(of Abo, CompactBuffer (. 1))
(the Spark, CompactBuffer (. 1,. 1,. 1))
(the Hadoop, CompactBuffer (. 1))
on Sum the value of the same key on a one-step basis

 val groupBySum: RDD[(String, Int)] = GroupByKeyRDD.map(x => (x._1, x._2.sum))
 groupBySum.collect().foreach(println)

(Python, 2)
(Scala, 1)
(Abo, 1)
(Spark, 3)
(Hadoop, 1)

2) reduceByKey(func,[numTasks]) Case
1. Call on a (K,V) RDD, return a (K,V) RDD, use the specified reduce function to aggregate the same key values ​​together, The reduce task can be set by the optional parameter numTasks
2. On demand, create a pairRDD and calculate the sum result of the corresponding value of the same key

 val ListRDD: RDD[(String, Int)] = sc.makeRDD(List(("Abo", 1), ("Spark", 1), ("Hadoop", 1),
      ("Python", 1), ("Python", 1), ("Scala", 1), ("Spark", 1), ("Spark", 1)), 2)
      //计算相同key对应值的相加结果
    val reduceByKeyRDD: RDD[(String, Int)] = ListRDD.reduceByKey(_ + _)

Return result:
(Python,4)
(Scala,2)
(Abo,3)
(Spark,4)
(Hadoop,1)
The difference between
reduceByKey and groupByKey 1.reduceByKey: Aggregate according to the key, and there is a combination (preset) between shuffles. (Aggregation) operation, the return result is RDD[k,v]
2. groupByKey: group by key and directly perform shuffle.
3. Naturally, reduceByKey is recommended in development.
3) The case
parameters of aggregateByKey : (zeroValue: U, partitioner: Partitioner)( seqOp: (U, V) => U,
combOp: (U, U) => U)
1. Function: In the RDD of the kv pair, the value is grouped and merged by Key. When merging, each value is combined with the initial Calculate the parameters of the value seq function, and the returned result is used as a new kv pair, then the results are combined according to the key, and finally the value of each group is passed to the combin function for calculation (first two values ​​are calculated , The return result and the next value are passed to the combin function, and so on), the key and the calculation result are output as a new kv pair
2. Parameter description:
(1) zeroValue: Give each key in each partition an initial Value
(2) seqOp: The function is used to iterate step by step with the initial value in each partition value
(3) combOp: The function is used to merge the results in each partition
3. Requirements: Create a pariRDD, take out the maximum value of the same key in each partition, and add it.

val ListRDD: RDD[(String, Int)] = sc.makeRDD(Array(("a", 3), ("c", 3), ("c", 6), ("a", 4), ("a", 7), ("b", 10)), 2)
ListRDD.glom.collect().foreach(data => println(data.mkString(",")))
println("-----------------------")
//分区内相同key求最大值,分区间求和
//  def aggregateByKey[U: ClassTag](zeroValue: U, partitioner: Partitioner)(seqOp: (U, V) => U,
//      combOp: (U, U) => U): RDD[(K, U)] = self.withScope {
    
    
//zeroValue:初始值, (seqOp: (U, V) => U:分区内   combOp: (U, U) => U)分区间
val aggRDD: RDD[(String, Int)] = ListRDD.aggregateByKey(0)(math.max(_, _), _ + _)
    val aggRDD: RDD[(String, Int)] = ListRDD.aggregateByKey(0)(math.max(_, _), _ + _)
    aggRDD.glom().collect().foreach(println)

(a,3),(c,3),(c,6)
(a,4),(a,7),(b,10)


(b,10)
(a,10),(c,6)

4) foldByKey case
(zeroValue: V) (func: (V, V) => V)
aggregateByKey operator (seqOp: (U, V) => U,
combOp: (U, U) => U) these two The parameter is treated as a parameter in foldByKey, that is, only one operation can be completed, and the operation in the partition is the same as that of merging each partition.

val foldRDD: RDD[(String, Int)] = ListRDD.foldByKey(0)(_ + _)
    foldRDD.glom().collect().foreach(data=>println(data.mkString(",")))

Return result:
(a,3),(c,3),(c,6)
(a,4),(a,7),(b,10)


(b,10)
(a,14),(c,9)

  1. The combineByKey case
    This operator is still more complicated
    createCombiner: V => C,
    mergeValue: (C, V) => C,
    mergeCombiners: (C, C) => C)
    createCombiner: Combiner function, used to convert V type The input parameter is V in RDD[K,V], and the output is C (called once for each key in the partition)
    mergeValue: merge value function, merge a C type and a V type value into a C type, The input parameters are (C,V), and the output is C (the result of createCombiner() is merged with the value corresponding to the same key in the
    partition ) mergeCombiners: merge combiner function, used to merge two C type values ​​into one C type , The input parameters are (C, C), and the output is C (aggregate the results corresponding to the same key between each partition)

Requirements: Create a pairRDD and calculate the average value of each key according to the key. (First calculate the number of occurrences of each key and the sum of the corresponding values, and then divide the result)

val ListRDD: RDD[(String, Int)] = sc.makeRDD(Array(("c", 10), ("c", 20), ("a",4), ("c", 10),("a", 7),("b",20),("a",28)), 2)
    ListRDD.glom.collect().foreach(data => println(data.mkString(",")))
    println("*" * 30)
    //求ListRDD中的每一个key出现的次数,以及key对应value值的总和
    val combineRDD: RDD[(String, (Int, Int))] = ListRDD.combineByKey((_, 1),
      (sum1: (Int, Int), v) => (sum1._1 + v, sum1._2 + 1),
      (sum2: (Int, Int), sum3: (Int, Int)) => (sum2._1 + sum3._1, sum2._2 + sum3._2))
  val resultRDD: RDD[(String, Double)] = combineRDD.map(data => (data._1, data._2._1 / data._2._2.toDouble))
    resultRDD.glom().collect().foreach(data=>println(data.mkString(",")))

createCombiner: V => C (_, 1): Map the value corresponding to key into a two-tuple ("c",10)=>(10,1)
mergeValue: (C, V) => C (sum1: (Int, Int), v) => (sum1._1 + v, sum1._2 + 1) The first bit of the tuple is added to v, and the second bit of the tuple is incremented by 1 (10,1) 20== "(30, 2)

mergeCombiners: (C, C) => C (sum2: (Int, Int), sum3: (Int, Int)) The first and second bits of the tuple are accumulated respectively (30, 2) (10, 1) = =>(40,3)
Running result:
(c,10),(c,20),(a,4)
(c,10),(a,7),(b,20),(a,28)


(b,20.0)
(a,13.0),(c,13.333333333333334)

Guess you like

Origin blog.csdn.net/changshupx/article/details/108509295