Accumulator analysis of the production Spark four commonly used
Symptom
val acc = sc.accumulator(0, “Error Accumulator”)
val data = sc.parallelize(1 to 10)
val newData = data.map(x => {
if (x % 2 == 0) {
accum += 1
}
})
newData.count
acc.value
newData.foreach(println)
acc.value
The above phenomenon will result in changes to the final value acc.value 10
Cause Analysis
Spark in a series of operations will be configured to transform a long list of tasks chain, then you need to trigger (lazy properties) through an action operation, accumulator, too.
- Therefore, after an action operation, call the value way to see, there is no change
- After the first action operation, call the value way to see turned into 5
- After the second operation action, calling value way to see, it becomes 10
The reason is that the second action operation, when they perform an operation accumulator, with accumulators, on the basis of the original added a 5, which became 10
solution
Through the above description of the phenomenon, we can quickly know the solution: only once action operation. Based on this, we just cut dependencies between tasks can be, that the use of cache, persist. After this operation, the subsequent accumulator operations are not affected by the operation of the previous transform
related case
-
demand
Accumulators use statistics the number of emp table NULL appears normal times and normal print information data & data
-
data
7369 SMITH CLERK 7902 1980-12-17 800.00 20 7499 ALLEN SALESMAN 7698 1981-2-20 1600.00 300.00 30 7521 WARD SALESMAN 7698 1981-2-22 1250.00 500.00 30 7566 JONES MANAGER 7839 1981-4-2 2975.00 20 7654 MARTIN SALESMAN 7698 1981-9-28 1250.00 1400.00 30 7698 BLAKE MANAGER 7839 1981-5-1 2850.00 30 7782 CLARK MANAGER 7839 1981-6-9 2450.00 10 7788 SCOTT ANALYST 7566 1987-4-19 3000.00 20 7839 KING PRESIDENT 1981-11-17 5000.00 10 7844 TURNER SALESMAN 7698 1981-9-8 1500.00 0.00 30 7876 ADAMS CLERK 7788 1987-5-23 1100.00 20 7900 JAMES CLERK 7698 1981-12-3 950.00 30 7902 FORD ANALYST 7566 1981-12-3 3000.00 20 7934 MILLER CLERK 7782 1982-1-23 1300.00 10
-
Pit & solutions encountered
Symptom & Analysis:
As we all know, spark a series of transform operations would constitute a string of long quest chain, then you need to trigger an action by operation; accumulator is the same, operation is performed only when the action will trigger execution of the accumulator; Therefore, in an action before the operation, we call value method accumulator is unable to view its value, and certainly without any change; so after conduct of normalData foreach operation, that action after the operation, we will find that the value of the accumulator becomes It became 11; then, after we normalData to conduct a count operation, ie, after the action once again operating, in fact, this time, went to perform in front of a transform operation; therefore the accumulator value has increased by 11, to become 22
Solution:
After the above analysis, we can know, using accumulator, we only use one action operation to be able to guarantee the accuracy of the results, therefore, we are faced with this situation, is there a way, their approach is to cut off each other's when dependency can therefore normalData use cache method, when RDD is first calculated, it will be cached and then directly call the same computing operations will not be recalculated again
import org.apache.spark.{SparkConf, SparkContext} /** * 使用Spark Accumulators完成Job的数据量处理 * 统计emp表中NULL出现的次数以及正常数据的条数 & 打印正常数据的信息 */ object AccumulatorsApp { def main(args: Array[String]): Unit = { val conf = new SparkConf().setMaster("local[2]").setAppName("AccumulatorsApp") val sc = new SparkContext(conf) val lines = sc.textFile("E:/emp.txt") // long类型的累加器值 val nullNum = sc.longAccumulator("NullNumber") val normalData = lines.filter(line => { var flag = true val splitLines = line.split("\t") for (splitLine <- splitLines){ if ("".equals(splitLine)){ flag = false nullNum.add(1) } } flag }) // 使用cache方法,将RDD的第一次计算结果进行缓存;防止后面RDD进行重复计算,导致累加器的值不准确 normalData.cache() // 打印每一条正常数据 normalData.foreach(println) // 打印正常数据的条数 println("NORMAL DATA NUMBER: " + normalData.count()) // 打印emp表中NULL出现的次数 println("NULL: " + nullNum.value) sc.stop() } }