Accumulator analysis of the production Spark four commonly used

Accumulator analysis of the production Spark four commonly used

Symptom

val acc = sc.accumulator(0, “Error Accumulator”)
val data = sc.parallelize(1 to 10)
val newData = data.map(x => {
  if (x % 2 == 0) {
 accum += 1
}
})
newData.count
acc.value
newData.foreach(println)
acc.value

The above phenomenon will result in changes to the final value acc.value 10

Cause Analysis

Spark in a series of operations will be configured to transform a long list of tasks chain, then you need to trigger (lazy properties) through an action operation, accumulator, too.

  • Therefore, after an action operation, call the value way to see, there is no change
  • After the first action operation, call the value way to see turned into 5
  • After the second operation action, calling value way to see, it becomes 10

The reason is that the second action operation, when they perform an operation accumulator, with accumulators, on the basis of the original added a 5, which became 10

solution

Through the above description of the phenomenon, we can quickly know the solution: only once action operation. Based on this, we just cut dependencies between tasks can be, that the use of cache, persist. After this operation, the subsequent accumulator operations are not affected by the operation of the previous transform

related case

  • demand

    Accumulators use statistics the number of emp table NULL appears normal times and normal print information data & data

  • data

    7369  SMITH   CLERK   7902    1980-12-17  800.00      20
    7499  ALLEN   SALESMAN    7698    1981-2-20   1600.00 300.00  30
    7521  WARD    SALESMAN    7698    1981-2-22   1250.00 500.00  30
    7566  JONES   MANAGER 7839    1981-4-2    2975.00     20
    7654  MARTIN  SALESMAN    7698    1981-9-28   1250.00 1400.00 30
    7698  BLAKE   MANAGER 7839    1981-5-1    2850.00     30
    7782  CLARK   MANAGER 7839    1981-6-9    2450.00     10
    7788  SCOTT   ANALYST 7566    1987-4-19   3000.00     20
    7839  KING    PRESIDENT       1981-11-17  5000.00     10
    7844  TURNER  SALESMAN    7698    1981-9-8    1500.00 0.00    30
    7876  ADAMS   CLERK   7788    1987-5-23   1100.00     20
    7900  JAMES   CLERK   7698    1981-12-3   950.00      30
    7902  FORD    ANALYST 7566    1981-12-3   3000.00     20
    7934  MILLER  CLERK   7782    1982-1-23   1300.00     10
  • Pit & solutions encountered

    Symptom & Analysis:

    As we all know, spark a series of transform operations would constitute a string of long quest chain, then you need to trigger an action by operation; accumulator is the same, operation is performed only when the action will trigger execution of the accumulator; Therefore, in an action before the operation, we call value method accumulator is unable to view its value, and certainly without any change; so after conduct of normalData foreach operation, that action after the operation, we will find that the value of the accumulator becomes It became 11; then, after we normalData to conduct a count operation, ie, after the action once again operating, in fact, this time, went to perform in front of a transform operation; therefore the accumulator value has increased by 11, to become 22

    Solution:

    After the above analysis, we can know, using accumulator, we only use one action operation to be able to guarantee the accuracy of the results, therefore, we are faced with this situation, is there a way, their approach is to cut off each other's when dependency can therefore normalData use cache method, when RDD is first calculated, it will be cached and then directly call the same computing operations will not be recalculated again

    import org.apache.spark.{SparkConf, SparkContext}
    /**
    * 使用Spark Accumulators完成Job的数据量处理
    * 统计emp表中NULL出现的次数以及正常数据的条数 & 打印正常数据的信息
    */
    object AccumulatorsApp {
    def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setMaster("local[2]").setAppName("AccumulatorsApp")
    val sc = new SparkContext(conf)
    val lines = sc.textFile("E:/emp.txt")
    // long类型的累加器值
    val nullNum = sc.longAccumulator("NullNumber")
    val normalData = lines.filter(line => {
      var flag = true
      val splitLines = line.split("\t")
      for (splitLine <- splitLines){
        if ("".equals(splitLine)){
          flag = false
          nullNum.add(1)
        }
      }
      flag
    })
    // 使用cache方法,将RDD的第一次计算结果进行缓存;防止后面RDD进行重复计算,导致累加器的值不准确
    normalData.cache()
    // 打印每一条正常数据
    normalData.foreach(println)
    // 打印正常数据的条数
    println("NORMAL DATA NUMBER: " + normalData.count())
    // 打印emp表中NULL出现的次数
    println("NULL: " + nullNum.value)
    sc.stop()
    }
    }

Guess you like

Origin blog.51cto.com/14309075/2414001