首先byKey的所有算子都是执行在kv类型的RDD上的~~~
aggregateByKey算子的定义如下,此算子中有俩个方法seqOp combOp
seqOp函数用于在每一个分区中用初始值(zeroValue)逐步迭代value,combOp函数用于合并每个分区中的结果
注意这俩个参数方法都是bykey计算的
aggregateByKey(zeroValue:U,[partitioner: Partitioner]) (seqOp: (U, V) => U,combOp: (U, U) => U)
例子
scala> val rdd = sc.parallelize(List((1,3),(1,2),(1,4),(2,3),(3,6),(3,8)),3)
rdd: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[12] at parallelize at <console>:24
scala> val agg = rdd.aggregateByKey(0)(math.max(_,_),_+_)
agg: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[13] at aggregateByKey at <console>:26
scala> agg.collect()
res7: Array[(Int, Int)] = Array((3,8), (1,7), (2,3))
scala> agg.partitions.size
res8: Int = 3
scala> val rdd = sc.parallelize(List((1,3),(1,2),(1,4),(2,3),(3,6),(3,8)),1)
rdd: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[10] at parallelize at <console>:24
scala> val rdd = sc.parallelize(List((1,3),(1,2),(1,4),(2,3),(3,6),(3,8)),3)
rdd: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[12] at parallelize at <console>:24
scala> val agg = rdd.aggregateByKey(0)(math.max(_,_),_+_)
agg: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[13] at aggregateByKey at <console>:26
scala> agg.collect()
res7: Array[(Int, Int)] = Array((3,8), (1,7), (2,3))
scala> agg.partitions.size
res8: Int = 3
scala> val rdd = sc.parallelize(List((1,3),(1,2),(1,4),(2,3),(3,6),(3,8)),1)
rdd: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[10] at parallelize at <console>:24
scala> val agg = rdd.aggregateByKey(0)(math.max(_,_),_+_).collect()
agg: Array[(Int, Int)] = Array((1,4), (3,8), (2,3))
例子的图解
foldByKey 就是aggregateByKey的简化操作,seqop和combop相同
foldBykey(0)(+):即seqop,combop都是_+_操作
aggregateByKey & foldByKey 都是先对分区内进行操作,再对不同分区间进行操作