Big data development-Spark-understanding of closures

1. Understanding closures from Scala

A closure is a function whose return value depends on one or more variables declared outside the function. Closures can generally be simply thought of as another function that can access local variables in a function.

Such as the following anonymous function:

val multiplier = (i:Int) => i * 10  

There is a variable i in the function body, which serves as a parameter of the function. As another piece of code below:

val multiplier = (i:Int) => i * factor

In multiplierThere are two variables the: i and factor. Wherein i is a form of a function of the parameters, the multiplierfunction is called, i have been given a new value. However, factor is not a formal parameter, but a free variable. Consider the following code:

var factor = 3  val multiplier = (i:Int) => i * factor 

Here we introduce a free variable factor, which is defined outside the function.

Function variables thus defined multiplieras a "closure", since it refers to the variable outside of the function definition, the definition of this process function is to capture the free variable to form a closed function

Complete example:

object Test {  
   def main(args: Array[String]) {  
      println( "muliplier(1) value = " +  multiplier(1) )  
      println( "muliplier(2) value = " +  multiplier(2) )  
   }  
   var factor = 3  
   val multiplier = (i:Int) => i * factor  
}

2. Understanding of closures in Spark

First look at the following piece of code:

val data=Array(1, 2, 3, 4, 5)
var counter = 0
var rdd = sc.parallelize(data)

// ???? 这样做会怎么样
rdd.foreach(x => counter += x)

println("Counter value: " + counter)

The first thing to be sure is that the result of the above output is 0, and park decomposes the processing of RDD operations into tasks, and each task is Executorexecuted. Before execution, Spark calculates the closure of the task. Closures are Executorthose variables and methods that must be visible when performing calculations on RDD (in this case, foreach()). The closure will be serialized and sent to each Executor, but the copy sent to the Executor, so the output on the Driver is still counteritself, if you want to update the global, use the accumulator and spark-streaminguse it updateStateByKeyto update the public status.

In addition, closures in Spark have other functions.

1. Clear the useless global variables sent by the Driver to the Executor, and only copy the useful variable information to the Executor

2. Ensure that the data sent to Executor is serialized data

For example, when using the DataSet, the definition of the case class must be under the class, not within the method. Even if there is no problem with the syntax, if you have used json4s to serialize, implicit val formats = DefaultFormatsthe import should be placed under the class, otherwise the format sequence must be separately Even if you don’t use it for anything else.

3. Summary

Closures are visible everywhere in the entire life cycle of Spark, for example Driver, all data copied from above needs to be serialized + closures Executor.

Wu Xie, Xiao San Ye, a little rookie in the background, big data, and artificial intelligence. Please pay attention to morefile

Guess you like

Origin blog.csdn.net/hu_lichao/article/details/112451982