There is often need to do id to get through the scene, such as user id to get through and so on,
Abstract problem can be resolved from each of the data is one or more of kv pair: (id_type, id), and a plurality of data need to be a certain kv pair matching is Merge;
such as:
data1: Array(('type1', 'id1'), ('type2', 'id2'))
data2: Array(('type1', 'id1'), ('type3', 'id3'))
data3: Array(('type2', 'id2'), ('type4', 'id4'))
Wherein by data1 and data2 ( 'type1', 'id1') open, data1 and data3 by ( 'type2', 'id2') open up, the final data1, data2, data3 open into a data
data_union: Array(('type1', 'id1'), ('type2', 'id2'), , ('type3', 'id3'), , ('type4', 'id4'))
Define base class and method
class Data { def getId : String = "" } def merge(dataArr : Array[(Map[Byte, String], Data)]) : (Map[Byte, String], Data) = dataArr.head def generateUUID : String = ""
among them
1) Data represents data abstraction, each has a data ID;
2) Map [Byte, String] represents kv pair data, i.e. Map [id_type, id]
3) merge to open up into a plurality of data transactions;
Look at the most simple recursive implementation
def unionDataRDD1(rdd : RDD[(Map[Byte, String], Data)]) : RDD[(Map[Byte, String], Data)] = { var result = rdd.keyBy(_._2.getId).groupByKey.map(item => merge(item._2.toArray)).cache //Array[id_type] val idTypes = result.flatMap(item => item._1.keys).distinct.collect idTypes.foreach(item => result = result.filter(_._1.contains(item)).keyBy(_._1.get(item).get).groupByKey.map(item => merge(item._2.toArray)).union(result.filter(!_._1.contains(item)))) result }
Performance is not very good, non-recursive optimization to achieve the look
def unionDataRDD2(rdd : RDD[(Map[Byte, String], Data)]) : RDD[(Map[Byte, String], Data)] = { val result = rdd.keyBy(_._2.getId).groupByKey.map(item => merge(item._2.toArray)).cache //((id_type, id), group) val idGroupRDD = result.flatMap(item => {val uuid = generateUUID; item._1.toArray.map(entry => (entry, uuid))}).cache //Array(Array(group)) val unionMap = idGroupRDD.groupByKey.map(_._2.toArray.distinct).filter(_.length > 1).collect //Map(group -> union_group) .foldLeft(Map[String, String]())((resultUnion, arr) => { val existingGroupMap = arr.collect({case group : String if resultUnion.contains(group) => (group, resultUnion.get(group).get)}).toMap if (existingGroupMap == null || existingGroupMap.isEmpty) resultUnion ++ arr.collect({case group : String => (group -> arr.head)}).toMap else if (existingGroupMap.size == 1) resultUnion ++ arr.collect({case group : String => (group -> existingGroupMap.head._2)}).toMap else { val newUnionMap = existingGroupMap.map(_._2).collect({case group : String => (group -> existingGroupMap.head._2)}).toMap resultUnion.collect({case entry : (String, String) => if (newUnionMap.contains(entry._2)) (entry._1, newUnionMap.get(entry._2).get) else entry}) ++ arr.collect({case group : String => (group -> newUnionMap.head._2)}).toMap } })
over the