版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/sdut406/article/details/88673241
JavaPairRDD的combineByKey方法讲解
官方文档说明
/**
* Generic function to combine the elements for each key using a custom set of aggregation
* functions. Turns a JavaPairRDD[(K, V)] into a result of type JavaPairRDD[(K, C)], for a
* "combined type" C.
*
* Users provide three functions:
*
* - `createCombiner`, which turns a V into a C (e.g., creates a one-element list)
* - `mergeValue`, to merge a V into a C (e.g., adds it to the end of a list)
* - `mergeCombiners`, to combine two C's into a single one.
*
* In addition, users can control the partitioning of the output RDD, the serializer that is use
* for the shuffle, and whether to perform map-side aggregation (if a mapper can produce multiple
* items with the same key).
*
* @note V and C can be different -- for example, one might group an RDD of type (Int, Int) into
* an RDD of type (Int, List[Int]).
*/
中文含义
通用函数,使用自定义聚合集组合每个键的元素函数。将JavaPairRDD[(k,v)]转换为JavaPairRDD[(k,c)]类型的结果
createCombiner: V => C ,这个函数把当前的值作为参数,此时我们可以对其做些附加操作(类型转换)并把它返回 (这一步类似于初始化操作)
mergeValue: (C, V) => C,该函数把元素V合并到之前的元素C(createCombiner)上 (这个操作在每个分区内进行)
mergeCombiners: (C, C) => C,该函数把2个元素C合并 (这个操作在不同分区间进行)
方法原型
//scala
// CombineByKey的简化版本,它使用现有的分区器/并行度级别和映射端聚合对生成的RDD进行哈希分区。
def combineByKey[C](createCombiner: Function[V, C], mergeValue: Function2[C, V, C], mergeCombiners: Function2[C, C, C]): JavaPairRDD[K, C]
// CombineBykey的简化版本,它散列划分输出RDD并使用映射端聚合。
def combineByKey[C](createCombiner: Function[V, C], mergeValue: Function2[C, V, C], mergeCombiners: Function2[C, C, C], numPartitions: Int): JavaPairRDD[K, C]
// 通用函数,使用一组自定义聚合函数组合每个键的元素。
def combineByKey[C](createCombiner: Function[V, C], mergeValue: Function2[C, V, C], mergeCombiners: Function2[C, C, C], partitioner: Partitioner): JavaPairRDD[K, C]
def combineByKey[C](createCombiner: Function[V, C], mergeValue: Function2[C, V, C], mergeCombiners: Function2[C, C, C], partitioner: Partitioner, mapSideCombine: Boolean, serializer: Serializer):
//java
public <C> JavaPairRDD<K,C> combineByKey(Function<V,C> createCombiner,
Function2<C,V,C> mergeValue,
Function2<C,C,C> mergeCombiners)
public <C> JavaPairRDD<K,C> combineByKey(Function<V,C> createCombiner,
Function2<C,V,C> mergeValue,
Function2<C,C,C> mergeCombiners,
int numPartitions)
public <C> JavaPairRDD<K,C> combineByKey(Function<V,C> createCombiner,
Function2<C,V,C> mergeValue,
Function2<C,C,C> mergeCombiners,
Partitioner partitioner)
public <C> JavaPairRDD<K,C> combineByKey(Function<V,C> createCombiner,
Function2<C,V,C> mergeValue,
Function2<C,C,C> mergeCombiners,
Partitioner partitioner,
boolean mapSideCombine,
Serializer serializer)
说明
第一个和第二个函数都是基于第三个函数实现的,使用的是HashPartitioner,Serializer为null。而第三个函数我们可以指定分区,如果需要使用Serializer的话也可以指定。combineByKey函数比较重要,我们熟悉地诸如aggregateByKey、foldByKey、reduceByKey等函数都是基于该函数实现的。默认情况会在Map端进行组合操作。
public class CombineByKey {
public static void main(String[] args) {
System.setProperty("hadoop.home.dir", "E:\\hadoop-2.7.1");
SparkConf sparkConf = new SparkConf().setMaster("local").setAppName("Spark_DEMO");
JavaSparkContext sc = new JavaSparkContext(sparkConf);
// 示例1 演示过程
JavaPairRDD<String, String> javaPairRDD1 = sc.parallelizePairs(Lists.newArrayList(
new Tuple2<String, String>("cat", "11"), new Tuple2<String, String>("dog", "22"),
new Tuple2<String, String>("cat", "33"), new Tuple2<String, String>("pig", "44"),
new Tuple2<String, String>("duck", "55"), new Tuple2<String, String>("cat", "66")), 2);
JavaPairRDD<String,String> javaPairRDD = javaPairRDD1.combineByKey(new Function<String, String>() {
public String call(String s) throws Exception {
System.out.println("createCombiner :"+s);
return s;
}
}, new Function2<String, String, String>() {
public String call(String s, String s2) throws Exception {
System.out.println("mergeValue :"+s+"-->"+s2);
return s+s2;
}
}, new Function2<String, String, String>() {
public String call(String s, String s2) throws Exception {
System.out.println("mergeCombiners :"+s+"-->"+s2);
return s+s2;
}
});
// 遍历RDD
javaPairRDD.foreach(new VoidFunction<Tuple2<String, String>>() {
public void call(Tuple2<String, String> stringStringTuple2) throws Exception {
System.out.println(stringStringTuple2);
}
});
// 求平均值
JavaPairRDD<String, Integer> javaPairRDD2 = sc.parallelizePairs(Lists.newArrayList(
new Tuple2<String, Integer>("cat", 11), new Tuple2<String, Integer>("dog", 22),
new Tuple2<String, Integer>("cat", 12), new Tuple2<String, Integer>("duck", 22),
new Tuple2<String, Integer>("cat", 33), new Tuple2<String, Integer>("pig", 44),
new Tuple2<String, Integer>("duck", 55), new Tuple2<String, Integer>("dog", 66)), 2);
// 获取中间结果
JavaPairRDD<String,Tuple2<Integer,Integer>> pairRDD = javaPairRDD2.combineByKey(new Function<Integer, Tuple2<Integer, Integer>>() {
public Tuple2<Integer, Integer> call(Integer integer) throws Exception {
return new Tuple2<Integer, Integer>(1, integer);
}
}, new Function2<Tuple2<Integer, Integer>, Integer, Tuple2<Integer, Integer>>() {
public Tuple2<Integer, Integer> call(Tuple2<Integer, Integer> tup1, Integer integer) throws Exception {
return new Tuple2<Integer, Integer>(tup1._1 + 1, tup1._2 + integer);
}
}, new Function2<Tuple2<Integer, Integer>, Tuple2<Integer, Integer>, Tuple2<Integer, Integer>>() {
public Tuple2<Integer, Integer> call(Tuple2<Integer, Integer> tup1, Tuple2<Integer, Integer> tup2) throws Exception {
return new Tuple2<Integer, Integer>(tup1._1 +tup2._1, tup1._2 + tup2._2);
}
});
// 打印中间结果
pairRDD.foreach(new VoidFunction<Tuple2<String, Tuple2<Integer, Integer>>>() {
public void call(Tuple2<String, Tuple2<Integer, Integer>> stringTuple2Tuple2) throws Exception {
System.out.println(stringTuple2Tuple2);
}
});
// 计算平均值
JavaPairRDD<String,Double> avgRDD = pairRDD.mapToPair(new PairFunction<Tuple2<String, Tuple2<Integer, Integer>>, String, Double>() {
public Tuple2<String, Double> call(Tuple2<String, Tuple2<Integer, Integer>> stt) throws Exception {
String key = stt._1;
double avg = stt._2._2*1.0/stt._2._1;
System.out.println("key:"+key+",value:"+avg);
return new Tuple2<String, Double>(key, avg);
}
});
avgRDD.foreach(new VoidFunction<Tuple2<String, Double>>() {
public void call(Tuple2<String, Double> stringDoubleTuple2) throws Exception {
System.out.println(stringDoubleTuple2);
}
});
}
}
结果
// createCombiner 过程
createCombiner :11
createCombiner :22
createCombiner :44
createCombiner :55
createCombiner :66
// mergeValue 过程
mergeValue :11-->33
// mergeCombiners 过程
mergeCombiners :1133-->66
// 最终结果
(dog,22)
(pig,44)
(cat,113366)
(duck,55)
// 求平均值示例
// 生成的中间结果
(dog,(2,88))
(pig,(1,44))
(cat,(3,56))
(duck,(2,77))
// 平均值计算过程
key:dog,value:44.0
key:pig,value:44.0
key:cat,value:18.666666666666668
key:duck,value:38.5
// 输出结果
(dog,44.0)
(pig,44.0)
(cat,18.666666666666668)
(duck,38.5)
19/03/19 20:50:33 INFO Executor: Finished task 1.0 in stage 5.0 (TID 9). 1009 bytes result sent to driver