1. Understanding closures from Scala
A closure is a function whose return value depends on one or more variables declared outside the function. Closures can generally be simply thought of as another function that can access local variables in a function.
Such as the following anonymous function:
val multiplier = (i:Int) => i * 10
There is a variable i in the function body, which serves as a parameter of the function. As another piece of code below:
val multiplier = (i:Int) => i * factor
In multiplier
There are two variables the: i and factor. Wherein i is a form of a function of the parameters, the multiplier
function is called, i have been given a new value. However, factor is not a formal parameter, but a free variable. Consider the following code:
var factor = 3 val multiplier = (i:Int) => i * factor
Here we introduce a free variable factor
, which is defined outside the function.
Function variables thus defined multiplier
as a "closure", since it refers to the variable outside of the function definition, the definition of this process function is to capture the free variable to form a closed function
Complete example:
object Test {
def main(args: Array[String]) {
println( "muliplier(1) value = " + multiplier(1) )
println( "muliplier(2) value = " + multiplier(2) )
}
var factor = 3
val multiplier = (i:Int) => i * factor
}
2. Understanding of closures in Spark
First look at the following piece of code:
val data=Array(1, 2, 3, 4, 5)
var counter = 0
var rdd = sc.parallelize(data)
// ???? 这样做会怎么样
rdd.foreach(x => counter += x)
println("Counter value: " + counter)
The first thing to be sure is that the result of the above output is 0, and park decomposes the processing of RDD operations into tasks, and each task is Executor
executed. Before execution, Spark calculates the closure of the task. Closures are Executor
those variables and methods that must be visible when performing calculations on RDD (in this case, foreach()). The closure will be serialized and sent to each Executor, but the copy sent to the Executor, so the output on the Driver is still counter
itself, if you want to update the global, use the accumulator and spark-streaming
use it updateStateByKey
to update the public status.
In addition, closures in Spark have other functions.
1. Clear the useless global variables sent by the Driver to the Executor, and only copy the useful variable information to the Executor
2. Ensure that the data sent to Executor is serialized data
For example, when using the DataSet, the definition of the case class must be under the class, not within the method. Even if there is no problem with the syntax, if you have used json4s to serialize, implicit val formats = DefaultFormats
the import should be placed under the class, otherwise the format sequence must be separately Even if you don’t use it for anything else.
3. Summary
Closures are visible everywhere in the entire life cycle of Spark, for example Driver
, all data copied from above needs to be serialized + closures Executor
.
Wu Xie, Xiao San Ye, a little rookie in the background, big data, and artificial intelligence. Please pay attention to more