Accumulator operation

Introduction to Accumulator

Accumulator is an accumulator provided by spark. As the name suggests, this variable can only be increased. 
Only the driver can get the value of the Accumulator (using the value method), and the Task can only add it (using +=). You can also name the Accumulator (no Python support), which will show up in the spark web ui and help you understand what's going on with the program.

Accumulator use

Example of use

Here is an example of the simplest use of accumulator:

//在driver中定义
val accum = sc.accumulator(0, "Example Accumulator")
//在task中进行累加
sc.parallelize(1 to 10).foreach(x=> accum += 1)

//在driver中输出
accum.value
//结果将返回10
res: 10

Incorrect use of accumulators

val accum= sc.accumulator(0, "Error Accumulator")
val data = sc.parallelize(1 to 10)
//用accumulator统计偶数出现的次数,同时偶数返回0,奇数返回1
val newData = data.map{x => {
  if(x%2 == 0){
    accum += 1
      0
    }else 1
}}
//使用action操作触发执行
newData.count
//此时accum的值为5,是我们要的结果
accum.value

//继续操作,查看刚才变动的数据,foreach也是action操作
newData.foreach(println)
//上个步骤没有进行累计器操作,可是累加器此时的结果已经是10了
//这并不是我们想要的结果
accum.value

Cause Analysis

The official explanation for the problem is as follows:

For accumulator updates performed inside actions only, Spark guarantees that each task’s update to the accumulator will only be applied once, i.e. restarted tasks will not update the value. In transformations, users should be aware of that each task’s update may be applied more than once if tasks or job stages are re-executed.

We all know that a series of transform operations in spark will constitute a long task chain, which needs to be triggered by an action operation, and the accumulator is the same. Therefore, before an action operation, you call the value method to view its value, and there must be no change.

So after the first count (action operation), we found that the value of the accumulator has become 5, which is the answer we want.

After that, a foreach (action operation) is performed on the newly generated newData. In fact, a map (transform) operation is performed again at this time, so the accumulator is increased by 5 again. The final result obtained became 10.

write picture description here

Solution

After reading the above analysis, everyone has the impression that only one action operation can be used in the process of using the accumulator to ensure the accuracy of the result.

In fact, there is still a solution, as long as the dependencies between tasks are cut off. What method has this function? You must have thought of, cache, persist. When this method is called, the previous dependencies will be removed, and the subsequent accumulators will not be affected by the previous transfrom operation.

write picture description here

//
val accum= sc.accumulator(0, "Error Accumulator")
val data = sc.parallelize(1 to 10)

//代码和上方相同
val newData = data.map{x => {...}}
//使用cache缓存数据,切断依赖。
newData.cache.count
//此时accum的值为5
accum.value

newData.foreach(println)
//此时的accum依旧是5
accum.value

Summarize

When using the Accumulator, in order to ensure accuracy, only one action operation is used. If you need to use it multiple times, use cache or persist operations to cut off dependencies.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324437752&siteId=291194637