[Original] Uncle Case Studies (5) id get through

There is often need to do id to get through the scene, such as user id to get through and so on,

Abstract problem can be resolved from each of the data is one or more of kv pair: (id_type, id), and a plurality of data need to be a certain kv pair matching is Merge;

such as:

data1: Array(('type1', 'id1'), ('type2', 'id2'))

data2: Array(('type1', 'id1'), ('type3', 'id3'))

data3: Array(('type2', 'id2'), ('type4', 'id4'))

Wherein by data1 and data2 ( 'type1', 'id1') open, data1 and data3 by ( 'type2', 'id2') open up, the final data1, data2, data3 open into a data

data_union: Array(('type1', 'id1'), ('type2', 'id2'), , ('type3', 'id3'), , ('type4', 'id4'))

Define base class and method

  class Data {
    def getId : String = ""
  }

  def merge(dataArr : Array[(Map[Byte, String], Data)]) : (Map[Byte, String], Data) = dataArr.head
  def generateUUID : String = ""

among them

1) Data represents data abstraction, each has a data ID;

2) Map [Byte, String] represents kv pair data, i.e. Map [id_type, id]

3) merge to open up into a plurality of data transactions;

Look at the most simple recursive implementation

  def unionDataRDD1(rdd : RDD[(Map[Byte, String], Data)]) : RDD[(Map[Byte, String], Data)] = {
    var result = rdd.keyBy(_._2.getId).groupByKey.map(item => merge(item._2.toArray)).cache
    //Array[id_type]
    val idTypes = result.flatMap(item => item._1.keys).distinct.collect
    idTypes.foreach(item => result = result.filter(_._1.contains(item)).keyBy(_._1.get(item).get).groupByKey.map(item => merge(item._2.toArray)).union(result.filter(!_._1.contains(item))))
    result
  }

Performance is not very good, non-recursive optimization to achieve the look

  def unionDataRDD2(rdd : RDD[(Map[Byte, String], Data)]) : RDD[(Map[Byte, String], Data)] = {
    val result = rdd.keyBy(_._2.getId).groupByKey.map(item => merge(item._2.toArray)).cache

    //((id_type, id), group)
    val idGroupRDD = result.flatMap(item => {val uuid = generateUUID; item._1.toArray.map(entry => (entry, uuid))}).cache
    //Array(Array(group))
    val unionMap = idGroupRDD.groupByKey.map(_._2.toArray.distinct).filter(_.length > 1).collect
      //Map(group -> union_group)
      .foldLeft(Map[String, String]())((resultUnion, arr) => {
      val existingGroupMap = arr.collect({case group : String if resultUnion.contains(group) => (group, resultUnion.get(group).get)}).toMap
      if (existingGroupMap == null || existingGroupMap.isEmpty) resultUnion ++ arr.collect({case group : String => (group -> arr.head)}).toMap
      else if (existingGroupMap.size == 1) resultUnion ++ arr.collect({case group : String => (group -> existingGroupMap.head._2)}).toMap
      else {
        val newUnionMap = existingGroupMap.map(_._2).collect({case group : String => (group -> existingGroupMap.head._2)}).toMap
        resultUnion.collect({case entry : (String, String) => if (newUnionMap.contains(entry._2)) (entry._1, newUnionMap.get(entry._2).get) else entry}) ++ arr.collect({case group : String => (group -> newUnionMap.head._2)}).toMap
      }
    })

over the

 

Guess you like

Origin www.cnblogs.com/barneywill/p/10987452.html