键值对的actions算子
combineByKey()
//参数:
//创建聚合器,如果K已经创建,则调mergeValue,没创建,则创建,将V生成一个新值
//作用于分区内的数据,将相同K对应的V聚合
//作用于各分区,将各分区K相同的V聚合
def combineByKey[C](createCombiner: V => C,
mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C) : RDD[(K, C)] = {
combineByKey(createCombiner, mergeValue, mergeCombiners, defaultPartitioner(self))
}
//numSplits分区数,聚合完成后生成的RDD有几个分区
def combineByKey[C](createCombiner: V => C,
mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C,
numSplits: Int): RDD[(K, C)] = {
combineByKey(createCombiner, mergeValue, mergeCombiners, new HashPartitioner(numSplits))
}
def combineByKey[C](createCombiner: V => C,
mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C,
partitioner: Partitioner): RDD[(K, C)] = {
val aggregator = new Aggregator[K, V, C](createCombiner, mergeValue, mergeCombiners)
new ShuffledRDD(self, aggregator, partitioner)
}
可以看到combineByKey()有三个重载的方法,最终都会调用第三个然后创建一个ShuffledRDD对象,
Partitioner默认按HashPartitioner分区,有关分区的源码可看Spark源码《一》RDD,
mergeValue与mergeCombiners两个方法与aggregate传的两个方法挺类似的,一个是处理某个分区内数据,一个处理所有分区的结果数据,有关aggregate可看RDD算子源码《一》返回一个结果的actions算子,spark的聚合算子大都调用combineByKey()算子,所以我们先去看别的算子createCombiner,mergeValue,mergeCombiners如何定义,更方便理解。
reduceByKey()
def reduceByKey(func: (V, V) => V): RDD[(K, V)] = {
reduceByKey(defaultPartitioner(self), func)
}
def reduceByKey(func: (V, V) => V, numSplits: Int): RDD[(K, V)] = {
//新RDD分区数将改变
reduceByKey(new HashPartitioner(numSplits), func)
}
def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = {
combineByKey[V]((v: V) => v, func, func, partitioner)
}
可以看到reduceByKey()有3个重载方法,一个传方法,一个传方法+新RDD分区数,一个传Partitioner+方法,最终调用
combineByKey(),mergeValue,mergeCombiners都为传入的方法。
val w=sc.parallelize(List(("tom",1),("tom",3),("jerry",1),("jerry",7),("tom",5)),3)
val s=w.reduceByKey(_*_)//将key相同的value相乘
println(s.partitions.length)//打印新RDD分区数
s.foreach(println)//打印新RDD元素
打印结果为:3 (jerry,7)(tom,15)
计算过程:假设3个分区数据为:
分区1 ("tom",1),("tom",3) 给tom创建聚合器(tom,1),因tom已经创建过,直接(tom,1*3) 该分区结果为(tom,3)
分区2 ("jerry",1),("jerry",7) 同上,该分区结果为(jerry,7)
分区3 ("tom",5) 该分区结果为(tom,5)
最后再聚合各分区结果,(tom,3),(jerry,7),(tom,5) 最终结果为:(tom,3*5),(jerry,7)
接下来我们用combineByKey()实现一下:
val w=sc.parallelize(List(("tom",1),("tom",3),("jerry",1),("jerry",7),("tom",5)),3)
w.combineByKey(
x => (1, x),
(c1: (Int,Int), y) => (c1._1 + 1, c1._2*y),//分区内数据计算方式
(c1: (Int,Int), c2: (Int,Int)) => (c1._1 + c2._1, c1._2 * c2._2)//各分区结果之间计算方式
).foreach(println)
打印结果为:(jerry,(2,7))(tom,(3,15))
combineByKey()与reduceByKey()的区别类似 fold()与aggregate()的区别,fold()与reduceByKey()是传一个方法,不仅作用于分区内数据,也作用于分区间结果数据,而aggregate()与combineByKey()是传两个方法,一个作用于分区内数据,另一个作用于分区间结果数据。
groupBy()&&groupByKey()
groupBy()
def groupBy[K: ClassManifest](f: T => K): RDD[(K, Seq[T])] = groupBy[K](f, sc.defaultParallelism)
//参数为,方法+新RDD分区数
def groupBy[K: ClassManifest](f: T => K, numSplits: Int): RDD[(K, Seq[T])] = {
val cleanF = sc.clean(f)
this.map(t => (cleanF(t), t)).groupByKey(numSplits)
}
可以看到groupBy()是每个元素执行完方法后的结果作为key,元素作为value,再调用groupByKey()方法
val w=sc.parallelize(0 to 9,3)
w.groupBy(x=>x%3).foreach(println)
打印结果为:(0,CompactBuffer(0, 3, 6, 9)) (1,CompactBuffer(1, 4, 7)) (2,CompactBuffer(2, 5, 8))
元素除3余数相同的会放在同一个集合。
groupByKey()
def groupByKey(): RDD[(K, Seq[V])] = {
groupByKey(defaultPartitioner(self))
}
def groupByKey(numSplits: Int): RDD[(K, Seq[V])] = {
groupByKey(new HashPartitioner(numSplits))
}
def groupByKey(partitioner: Partitioner): RDD[(K, Seq[V])] = {
def createCombiner(v: V) = ArrayBuffer(v),
def mergeValue(buf: ArrayBuffer[V], v: V) = buf += v//放入变长数组中
def mergeCombiners(b1: ArrayBuffer[V], b2: ArrayBuffer[V]) = b1 ++= b2//拼接两个数组
val bufs = combineByKey[ArrayBuffer[V]](
createCombiner _, mergeValue _, mergeCombiners _, partitioner)
bufs.asInstanceOf[RDD[(K, Seq[V])]]
}
可以看到最终调用的是combineByKey()方法,每个分区先将Key相同的value放入数组,最后再将各分区key相同的合并一个数组。
val w=sc.parallelize(List(("tom",1),("tom",3),("jerry",1),("jerry",7),("tom",5)),2)
w.groupByKey().foreach(println)
println(w.groupByKey(5).partitions.length)//改变分区数
打印结果为:(tom,CompactBuffer(1, 3, 5)) (jerry,CompactBuffer(1, 7)) 5(分区数)
基于combineByKey()的算子到此为止,现在回过头看combineByKey()
def combineByKey[C](createCombiner: V => C,
mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C,
partitioner: Partitioner): RDD[(K, C)] = {
val aggregator = new Aggregator[K, V, C](createCombiner, mergeValue, mergeCombiners)
new ShuffledRDD(self, aggregator, partitioner)
}
创建了Aggregator对象,然后创建了ShuffledRDD对象。
ShuffledRDD类
class ShuffledRDD[K, V, C](
parent: RDD[(K, V)],
aggregator: Aggregator[K, V, C],
part : Partitioner)
extends RDD[(K, C)](parent.context) {
//override val partitioner = Some(part)
override val partitioner = Some(part)
@transient
val splits_ = Array.tabulate[Split](part.numPartitions)(i => new ShuffledRDDSplit(i))
//长度为分区数的数组,存储各分区
override def splits = splits_
override def preferredLocations(split: Split) = Nil
//创建shuffle依赖,注册shuffleId
val dep = new ShuffleDependency(context.newShuffleId, parent, aggregator, part)
override val dependencies = List(dep)
override def compute(split: Split): Iterator[(K, C)] = {
val combiners = new JHashMap[K, C]//存储key,聚合值的map
def mergePair(k: K, c: C) {
val oldC = combiners.get(k)
if (oldC == null) {//判断key之前是否遍历过
combiners.put(k, c)//没的话,将k,v放入map中
} else {
//否则,将k对应的v,与之前的v的聚合值聚合
combiners.put(k, aggregator.mergeCombiners(oldC, c))
}
}
val fetcher = SparkEnv.get.shuffleFetcher//取数据
fetcher.fetch[K, C](dep.shuffleId, split.index, mergePair)
return new Iterator[(K, C)] {
var iter = combiners.entrySet().iterator()
def hasNext(): Boolean = iter.hasNext()
def next(): (K, C) = {
val entry = iter.next()
(entry.getKey, entry.getValue)
}
}
}
}