Spark code readability and performance optimization - eight exemplary (a service logic, a variety of solution)

Spark code readability and performance optimization - eight exemplary (a service logic, a variety of solution)

1. Housewives

  • In the example of seven at the end kind of made a demand "while the total value of a statistical table all the fields corresponding to the total number of de-duplicated, and requires a corresponding field value is not empty." If you've seen examples of Seven, obviously you should know how to solve.
  • The purpose of writing this article as follows:
    • And then describe in detail the needs of business, in order to avoid misunderstanding
    • Provide an analysis of the problems of the business
    • Show examples of a variety of solutions

2. Demand show

  • A table prior tb_express, examples are as follows:
name address trade fix express
Wang Wu Chengdu, Sichuan d72_network ty002 zto
null Yubei z03_locker bk213 sf-express
Li Lei Changsha, Hunan null null sf-express
…… …… …… …… ……
John Doe Guangzhou, Guangdong t92_locker tu87 table
  • Table description
    • A total of 50 full table fields, here shown only 5 (hereinafter show for convenience, will only to five as an example)
    • Full table a total of 1 billion data
    • Since the data source issues, there will be a lot of field null value circumstances. (According to the last statistics to know: fix field a total of 950 million non-null value, the total number of non-null values ​​in other fields in the range of 200-300 million)
  • Business needs:
    • Need to count all the total value of the field to the total number of field values after de-duplication , and requested field is not empty

3. Analysis

3.1 One problem (lower performance SQL)

  • In fact, the business itself is very simple needs, first of all may be the first thought is to be treated with SQL, examples are as follows:
    SELECT count(name), count(distinct name)
    FROM tb_express
    WHERE name IS NOT NULL AND name != '';
    
  • However, because the null of each field is different, so SQL can not count all the fields at once, 50 fields had to run again. You can see the downside to this: cluster resource consumption, it takes a long time.

3.2 Second problem (data skew)

  • So, you might try to write code to solve a problem. General examples prepared as follows:
    import org.apache.spark.SparkConf
    import org.apache.spark.sql.SparkSession
    import org.apache.spark.sql.types.{StringType, StructField, StructType}
    
    /**
      * Description: '字段值总数'与'字段值去重后的总数'的统计(错误示例)
      * <br/>
      * Date: 2019/12/2 16:57
      *
      * @author ALion
      */
    object CountDemo {
    
      val expressSchema: StructType = StructType(Array(
        StructField("name", StringType),
        StructField("address", StringType),
        StructField("trade", StringType),
        StructField("fix", StringType),
        StructField("express", StringType)
      ))
    
      def main(args: Array[String]): Unit = {
        val conf = new SparkConf()
          .setAppName("CountDemo")
        val spark = SparkSession.builder()
          .config(conf)
          .getOrCreate()
    
        val expressDF = spark.read.schema(expressSchema).table("tb_express")
    
        val resultRDD = expressDF.rdd
    //      .flatMap { row =>
    //        val name = row.get(row.fieldIndex("name"))
    //        val address = row.get(row.fieldIndex("address"))
    //        val trade = row.get(row.fieldIndex("trade"))
    //        val fix = row.get(row.fieldIndex("fix"))
    //        val express = row.get(row.fieldIndex("express"))
    //
    //        val buffer = ArrayBuffer[(String, String)]()
    //        // 去除null值
    //        // 字段名设置为key,字段值设置为value
    //        if (name != null) buffer.append(("name", name.toString))
    //        if (address != null) buffer.append(("address", address.toString))
    //        if (trade != null) buffer.append(("trade", trade.toString))
    //        if (fix != null) buffer.append(("fix", fix.toString))
    //        if (express != null) buffer.append(("express", express.toString))
    //
    //        buffer
    //      }
          .flatMap { row =>
            // 更函数式的写法,也更简短
            Array("name", "address", "trade", "fix", "express")
              .flatMap { name =>
                Option(row.get(row.fieldIndex(name))) match {
                  case Some(v) => Some((name, v.toString))
                  case None => None
                }
              }
          }.groupByKey()
          .mapValues { iter => (iter.size, iter.toSet.size) } // 此处计算'字段值总数'与'字段值去重后的总数'
    
        // 拉取数据,打印结果
        resultRDD.collect()
          .foreach { case (fieldName, (count, distinctCount)) =>
            println(s"字段名 = $fieldName, 字段值总数 = $count, 字段值去重后的总数 = $distinctCount")
          }
    
        spark.stop()
      }
    
    }
    
  • Write code, usually the first thought is this way, each grouping field names, field values ​​are statistics on it.
  • However, the problem here is that the code data leads groupByKey inclined, as mentioned in the foregoing we "fix field Total 950 000 000 non-null value, the total number of non-null values ​​of other fields in the range of 200-300 million." So, if the field is key, it is groupByKey, will lead to a shuffle node data is much larger than other nodes.

3.3 Problems tris (data within a data inclined obliquely)

  • This time you will probably want to, "I put the key in addition to the field names, add a field value, make a deposit value, so it does not tilt up?." Some examples are the body:
        val resultRDD = expressDF.rdd
          .flatMap { row =>
            // 更函数式的写法,也更简短
            Array("name", "address", "trade", "fix", "express")
              .flatMap { name =>
                Option(row.get(row.fieldIndex(name))) match {
                  case Some(v) => Some(((name, v.toString), 1))
                  case None => None
                }
              }
          }.groupByKey()
          .map { case ((name, value), iter) => (name, (value, iter.size)) }
          .groupByKey() // 第二次groupByKey虽然还是以字段名为key,但是因为数据量很小,所以会很快处理完
          .map {case (name, iter) =>
            // (字段名, 字段值总数, 字段值去重后的总数)
            (name, iter.map(_._2).sum, iter.size)
          }
    
  • However, in front of a hidden secret, that "fix field a total of 950 million non-null value, and 80 percent are the same value, while the value of these business requirements can not be ruled out", so once up and running, the data will still tilt, card a point on. (But you obviously have to think of using reduceByKey solved)
  • In addition, you can use a digital number (Int) to represent the field name (String), in order to reduce memory consumption. But subsequent examples for easy viewing, or field names used.

Example 4. A variety of solutions

  • Read previous demand analysis, obviously we know that the problem lies in data skew, need to find ways to solve it. Then the general solution is reduceByKey, but we can also think of other ways, here I will write some examples for reference.

4.1 The use of a random number is added to the way the key, to solve the problem of data skew

val resultRDD = expressDF.rdd
  .flatMap { row =>
    val random = new Random()
    // 更函数式的写法,也更简短
    Array("name", "address", "trade", "fix", "express")
      .flatMap { name =>
        Option(row.get(row.fieldIndex(name))) match {
          // 随机范围取100,最终会导致数据分成100份。根据当前集群启动的节点数合理取值,可以达到更好的效果。
          case Some(v) => Some((name + "_" + random.nextInt(100), v.toString))
          case None => None
        }
      }
  }.groupByKey()
  .map { case (k, v) =>
    // 完成本次聚合,并去掉随机数
    (k.split("_")(0), (v.size, v.toSet))
  }.groupByKey() // 同样的,你也可以写reduceByKey,不过此处几乎没有效率影响
    .mapValues {iter =>
      // (字段值总数, 字段值去重后的总数)
      (iter.map(_._1).sum, iter.map(_._2).reduce(_ ++ _).size)
    }

4.2 Use reduceByKey, modify the key data structure, and then change subsequent treatment

val resultRDD = expressDF.rdd
  .flatMap { row =>
    // 更函数式的写法,也更简短
    Array("name", "address", "trade", "fix", "express")
      .flatMap { name =>
        Option(row.get(row.fieldIndex(name))) match {
          case Some(v) => Some(((name, v.toString), 1))
          case None => None
        }
      }
  }.reduceByKey(_ + _) // 将问题分析最后示例中的groupByKey替换为reduceByKey即可解决
  .map { case ((name, value), count) => (name, (value, count)) }
  // 第二次groupByKey虽然还是以字段名为key,但是因为数据量很小,所以会很快处理完。
  // 当然你这里也可以使用reduceByKey。
  .groupByKey()  
  .map { case (name, iter) =>
    // (字段名, 字段值总数, 字段值去重后的总数)
    (name, iter.map(_._2).sum, iter.size)
  }

4.3 does not modify the key data structure, prepare aggregator

  • Aggregator classes CountAggregator
    class CountAggregator(var count: Int, var countSet: mutable.HashSet[String]) {
      
      def +=(element: (Int, String)) : CountAggregator = {
        this.count += element._1
        this.countSet += element._2
    
        this
      }
      
      def ++=(that: CountAggregator): CountAggregator = {
        this.count += that.count
        this.countSet ++= that.countSet
    
        this
      }
    
    }
    
    object CountAggregator {
    
      def apply(): CountAggregator =
        new CountAggregator(0, mutable.HashSet[String]())
    
    }
    
  • Spark subject code
    val resultRDD = expressDF.rdd
      .flatMap { row =>
        // 更函数式的写法,也更简短
        Array("name", "address", "trade", "fix", "express")
          .flatMap { name =>
            Option(row.get(row.fieldIndex(name))) match {
              case Some(v) => Some((name, (1, v.toString)))
              case None => None
            }
          }
      }.aggregateByKey(CountAggregator())(
        (agg, v) => agg += v,
        (agg1, agg2) => agg1 ++= agg2
      ).mapValues { aggregator =>
        // (字段值总数, 字段值去重后的总数)
        (aggregator.count, aggregator.countSet.size)
      }
    

4.4 does not modify the key data structure, <number field value, the field value> value stored Map

 val resultRDD = expressDF.rdd
   .flatMap { row =>
     // 更函数式的写法,也更简短
     Array("name", "address", "trade", "fix", "express")
       .flatMap { name =>
         Option(row.get(row.fieldIndex(name))) match {
           case Some(v) => Some((name, (v.toString, 1)))
           case None => None
         }
       }
   }.aggregateByKey(mutable.HashMap[String, Int]())(
     (map, kv) => map += (kv._1 -> (map.getOrElse(kv._1, 0) + kv._2)),
     (map1, map2) => {
       for ((k, v) <- map2) {
         map1 += (k -> (map1.getOrElse(k, 0) + v))
       }
       map1
     }
   ).mapValues { map => 
      // 字段值总数, 字段值去重后的总数
        (map.values.sum, map.keySet.size)
   }

4.5 does not modify the data structure of key, value storage (Set <field value> 1)

 val resultRDD = expressDF.rdd
   .flatMap { row =>
     // 更函数式的写法,也更简短
     Array("name", "address", "trade", "fix", "express")
       .flatMap { name =>
         Option(row.get(row.fieldIndex(name))) match {
           case Some(v) => Some((name, (v.toString, 1)))
           case None => None
         }
       }
   }.aggregateByKey((mutable.HashSet[String](), 0))(
     (agg, v) => (agg._1 += v._1, agg._2 + v._2),
     (agg1, agg2) => (agg1._1 ++= agg2._1, agg1._2 + agg2._2)
   )
   .mapValues { case (set, count)  =>
     // (字段值总数, 字段值去重后的总数)
     (count, set.size)
   }

4.6 Other programs use treeAggregate

  • Sample code is as follows (for the current business logic, not recommended treeAggregate)
  • Aggregator
    import scala.collection.mutable
    
    case class Counter(var count: Int, set: mutable.HashSet[String])
    
    class CountAggregator2(var counter1: Counter,
                           var counter2: Counter,
                           var counter3: Counter,
                           var counter4: Counter,
                           var counter5: Counter) {
    
      def +=(element: (Any, Any, Any, Any, Any)): CountAggregator2 = {
        def countFunc(counter: Counter, e: Any): Unit = {
          if (e != null) {
            counter.count += 1
            counter.set += e.toString
          }
        }
    
        countFunc(counter1, element._1)
        countFunc(counter2, element._2)
        countFunc(counter3, element._3)
        countFunc(counter4, element._4)
        countFunc(counter5, element._5)
    
        this
      }
    
      def ++=(that: CountAggregator2): CountAggregator2 = {
        this.counter1.count += that.counter1.count
        this.counter1.set ++= that.counter1.set
        this.counter2.count += that.counter2.count
        this.counter2.set ++= that.counter2.set
        this.counter3.count += that.counter3.count
        this.counter3.set ++= that.counter3.set
        this.counter4.count += that.counter4.count
        this.counter4.set ++= that.counter4.set
        this.counter5.count += that.counter5.count
        this.counter5.set ++= that.counter5.set
    
        this
      }
    
    }
    
    
    object CountAggregator2 {
    
      def apply(): CountAggregator2 =
        new CountAggregator2(
          Counter(0, mutable.HashSet[String]()),
          Counter(0, mutable.HashSet[String]()),
          Counter(0, mutable.HashSet[String]()),
          Counter(0, mutable.HashSet[String]()),
          Counter(0, mutable.HashSet[String]())
        )
    
    }
    
  • spark subject code
     val result = expressDF.rdd
       .map { row =>
         val rowAny = (name: String) => row.get(row.fieldIndex(name))
         
         (rowAny("name"), rowAny("address"), rowAny("trade"), rowAny("fix"), rowAny("express"))
       }
       // 请根据业务、数据量,来调整treeReduce的深度depth,默认为2
       .treeAggregate(CountAggregator2())(
         (agg, v) => agg += v,
         (agg1, agg2) => agg1 ++= agg2
       )
    
  • advantage
    • treeAggregate using parallel computing, to reduce performed at each node, and finally merged into one node reduce
    • Do not like the previous example, to generate a value for each field name (data line number field, the number of fields will be generated almost name), it is used as the polymerization key, reducing memory footprint
  • Shortcoming
    • Need to reduce the final result as the data becomes small, for example, a plurality of base values ​​(maximum value, the count value), a small amount of data set (10 before sorting)
    • The example here, the end of each field to a weight of polymerized Set, if a large amount of data within the Set memory footprint increases, it may cause the driver to hang end
Published 128 original articles · won praise 45 · Views 150,000 +

Guess you like

Origin blog.csdn.net/alionsss/article/details/103349983