Big data development-Spark-shared variable accumulator and broadcast variable

Spark accumulator and broadcast variables

1. Introduction

In Spark, two types of shared variables are provided: accumulator and broadcast variable:

  • Accumulator : used to aggregate information, mainly used in scenarios such as cumulative counting;

  • Broadcast variables : Mainly used to efficiently distribute large objects between nodes.

Second, the accumulator

Let's look at a specific scenario first. For normal cumulative summation, if you use the following code to calculate in cluster mode, you will find that the execution result is not expected:

var counter = 0
val data = Array(1, 2, 3, 4, 5)
sc.parallelize(data).foreach(x => counter += x)
 println(counter)

The final result of counter is 0. The main cause of this problem is closures.

file

2.1 Understanding closures

1. The concept of closures in Scala

Here first introduce the concept of closures in Scala:

var more = 10
val addMore = (x: Int) => x + more

As a function addMoreof two variables x and more:

  • x : is a bound variable, because it is an input parameter of the function, which is clearly defined in the context of the function;

  • more : is a free variable, because the function literal Bunsen doesn't give any meaning to more.

By definition: When creating a function, if you need to capture a free variable, then the function that contains a reference to the captured variable is called a closure function.

2. Closures in Spark

You can also refer to: https://blog.csdn.net/hu_lichao/article/details/112451982

In actual calculations, Spark will decompose RDD operations into Tasks, which run on the Worker Node. Before execution, Spark will close the task. If free variables are involved in the closure, the program will copy it and place the copy variable in the closure. After that, the closure is serialized and sent to each executor . Therefore, when the foreach function reference counter, it will no longer be on the Driver node counter, but close copy of the package counter, by default, a copy of the counterupdated values will not be back to the Driver, so counterthe final value is still zero.

Note that: in Local mode, it is possible to perform foreachWorker Node with Diver in the same JVM, and refer to the same original counter, this time the update may be correct, but certainly not correct in a clustered mode. Therefore, the accumulator should be used first when encountering such problems.

The principle of the accumulator is actually very simple: the final value of each copy variable is passed back to the Driver, the final value is obtained after aggregation by the Driver, and the original variable is updated.

file

2.2 Using the accumulator

SparkContext All methods for creating accumulators are defined in. It should be noted that the accumulator methods crossed out by the horizontal line are marked as obsolete after Spark 2.0.0.

The usage examples and execution results are as follows:

val data = Array(1, 2, 3, 4, 5)
// 定义累加器
val accum = sc.longAccumulator("My Accumulator")
sc.parallelize(data).foreach(x => accum.add(x))
// 获取累加器的值
accum.value

Three, broadcast variables

In the process of closure in the introduction above, we said that the closure of each Task task will hold a copy of the free variable. If the variable is large and there are many Task tasks, this will inevitably cause pressure on the network IO. In order to solve this In case, Spark provides broadcast variables.

The method of broadcasting variables is very simple: instead of distributing the copy variable to each Task, it is distributed to each Executor. All Tasks in the Executor share a copy variable.

// 把一个数组定义为一个广播变量
val broadcastVar = sc.broadcast(Array(1, 2, 3, 4, 5))
// 之后用到该数组时应优先使用广播变量,而不是原值
sc.parallelize(broadcastVar.value).map(_ * 10).collect()

Four, observation variables

The value of the created Accumulator variable can be seen on the Spark Web UI. You should try to name it when you create it. The following discusses how to view the value of the accumulator on the Spark Web UI

file

Five, reference materials

RDD Programming Guide

https://www.cnblogs.com/cc11001100/p/9901606.html

https://www.cnblogs.com/zz-ksw/p/12448650.html

Wu Xie, Xiao San Ye, a little rookie in the background, big data, and artificial intelligence. Please pay attention to morefile

Guess you like

Origin blog.csdn.net/hu_lichao/article/details/112760233