Spark-- transfer function of the closure

        In Scala, you can define a function within, the body of the function, can be accessed at any appropriate action within the scope of any variable; not stop, you can no longer function in the role of time is called a variable, which is the most basic understanding of closures.

A, transform, action promoter operator function parameters

        In a cluster spark, spark application written by the user is responsible for running the main function, and a variety of running on a cluster of parallel operation driver program (Driver) and run in parallel in the work process of each node in the cluster (Executor) together form. Operator action is triggered spark job submission, during submission of job, Transform operator and operator action in func are encapsulated into a closure, and then sent to each worker performs up node (principle of proximity data).

        Clearly, the closure is stateful, mainly those free variables, and the free dependent variables to other variables, thus, in a simple function or piece of code fragment is passed to the operator as a former argument, spark detects closure All the variables involved, and the sequence of the variable, and then to the worker node, and then perform deserialization. (Detection - serialization - variable transmission - deserialization)

Function parameters as:

val f:(Double)=>Double = 2*_

F is the type (Double) => Double, Double pass a type parameter, a return value of type Double. The transform and spark action operators have used the function parameters, of which the most frequent use of closures.

val f(x:Int) = (x:Int) => 2*x
val rdd = sc.parallelize(1 to 10)
val rdd1 = rdd.map(x => f(x))

The results rdd1 value Array (2,4,6,8,10,12,14,16,18,20), this knowledge does not seem to relate to something closure, do not worry, here to introduce transform, action considered how to call sub-function parameters.

Second, closure of understanding

def mulBy (factor : Double) = (x:Double) => factor * x
val triple = mulBy(3)
val half = mulBy(0.5)
println(s"${triple(14)}, ${half(14)}")

         It defines a function mulBy, type Double, the value of (x: Double) => factor * x;

        First, mulBy for the first time is called, will pass 3 parameters (x: Double) => factor * x, factor = 3, the variable is referenced in mulBy, and function parameters into the triple. Then factor parameter variables are popped from the runtime stack;

        Then, mulBy is invoked again, the value is set to a factor of 0.5, the same, the new parameters of the function into half, the factor parameter variables are popped from the runtime stack;

        Because after every mulBy function is called, its value is stored in a variable (e.g., triple and half above), when a half or triple functions outside the scope function is equivalent factor, which is the "closure", closure any non-local variables used by the code and configuration code defines. Therefore, the output is: 42,7

        Although the appearance triple and half the calls, still use factor variables, but can be understood, factor triple and half is not a function of variables, but real, unchanging a constant value 3 and 0.5.

Third, the closure further understanding: spark local mode trunked mode VS

        Through the above spark can understand how to traverse operator calls a function, a function of how the variables involved to reach the worker nodes, and the concept of closure. When the code is executed on a cluster, the scope of variables and methods as well as the life cycle, it is more difficult to understand where spark.

var counter = 0
var rdd = sc.parallelize(data)
rdd.foreach(x=>counter += x)
println(s"Counter value : $counter")

        For the simple sum of RDD elements, depending on whether a virtual run on the same machine, they show a completely different behavior.

        In local mode, in some cases, the driver runs the JVM in the same, i.e., each program counter operations belong to the same, which can obtain the sum of "expected" RDD element results.

        In the cluster mode, in order to perform the job, spark RDD will split into a plurality of task, each task by one actuator (Executor, i.e., a task can only be a Executor digested Executor can digest a plurality of task) to perform operations. Before execution, spark computation closures (detecting closure variables and methods, the code above refers to the counter and the foreach), the closure will be serialized, and distributed to each of the actuators. In other words, each actuator to give the respective counter, when the counter to be modified only modify their own counter, and the counter drive (Driver) and is not modified, so the final output of the counter does not meet expectations, the output 0. This can be understood as Driver of the counter variable is a global variable, Executor of the counter is a local variable.

        Therefore Spark In response to this closure is generated due to the influence of support is defined using global shared variables, broadcast (Broadcast) variable, a value for the buffer memory to all the nodes. For accumulate operations, also possible to use an accumulator (accumulator).

var accum = sc.accumulator(0)
val value = sc.parallelize(Array(1,2,3,4)).foreach(x => accum+=x).value
println(s"accum = $accum")
//accum的输出结果为10

 

Guess you like

Origin www.cnblogs.com/SysoCjs/p/11345121.html