Spark series (six) - accumulator and variable broadcast

I. Introduction

In Spark, we offer two types of shared variables: the accumulator (accumulator) with variable broadcast (broadcast variable):

  • Accumulator : information to be polymerized, is mainly used for counting total scene;
  • Broadcast variables : mainly used between nodes and efficient distribution of large objects.

Second, the accumulator

Here we look at a particular scene, for normal cumulative summing, if calculated using the following code in a trunked mode, the execution result is not expected to be found:

var counter = 0
val data = Array(1, 2, 3, 4, 5)
sc.parallelize(data).foreach(x => counter += x)
 println(counter)

The end result counter is 0, the main cause of this problem is the closure.

2.1 understand closures

Concept 1. Scala Closures

Here to tell us about the concept of Scala closures:

var more = 10
val addMore = (x: Int) => x + more

As a function addMoreof two variables x and more:

  • X : it is a binding variable (bound variable), because it is the parameters of the function, in the context of a clear definition of the function;
  • More : it is a free variable (free variable), because the function literal Bunsen does not give any meaning to more.

By definition: When you create a function, if you need to capture a free variable, then include links function reference is captured variable is called a closure function.

2. Spark Closures

In actual calculation, Spark will be decomposed into operation for RDD Task, Task runs on Worker Node. Before execution, the task will be the Spark closure, if the bag closure relates to the free variables, the program will be copied, and the copy of the variable in the closure after the closure is serialized and sent to each performer . Therefore, when the foreach function reference counter, it will no longer be on the Driver node counter, but close copy of the package counter, by default, a copy of the counterupdated values will not be back to the Driver, so counterthe final value is still zero.

Note that: in Local mode, it is possible to perform foreachWorker Node with Diver in the same JVM, and refer to the same original counter, this time the update may be correct, but certainly not correct in a clustered mode. Therefore, priority should use the accumulator encountered such problems.

The accumulator is actually very simple principle: that is, the final value of each copy of the variable returns Driver, to get the final value after the polymerization of Driver, and updates the original variable.

2.2 Accumulator

SparkContext Defines all the methods to create accumulator, should be noted that: Accumulator methods are crossed out in horizontal line after Spark 2.0.0 is identified as abandoned.

Example of use and the execution results are as follows:

val data = Array(1, 2, 3, 4, 5)
// 定义累加器
val accum = sc.longAccumulator("My Accumulator")
sc.parallelize(data).foreach(x => accum.add(x))
// 获取累加器的值
accum.value

Third, the broadcast variables

In the above description of the closure process, we said the closures will each hold a copy of the Task task of free variables, if in many cases large and Task task variables, which will inevitably put pressure on network IO, in order to solve this case, Spark provides broadcast variables.

The practice of broadcasting variables is simple: do not copy distributed to each Task variables, but rather distribute it to each Executor, all Task Executor shared a copy of the variable.

// 把一个数组定义为一个广播变量
val broadcastVar = sc.broadcast(Array(1, 2, 3, 4, 5))
// 之后用到该数组时应优先使用广播变量,而不是原值
sc.parallelize(broadcastVar.value).map(_ * 10).collect()

Reference material

Eet Programming Guide

More big data series can be found GitHub open source project : Big Data Getting Started

Guess you like

Origin www.cnblogs.com/heibaiying/p/11330382.html