Accumulator analysis of the production of two common Spark

Driver-side

  1. Driver side initialization Accumulator constructed and initialized, and completed the Accumulator register, when Accumulators.register (this) Accumulator Executor sent to the end of the sequence
  2. After completion ResultTask Driver received status updates to update the value of Value can then be obtained at a value of Accumulator operation performed after the Action

Executor end

  1. Executor will conduct after deserialization receives Task, resulting RDD and deserialization function. While also deserialized to deserialize Accumulator (created readObject method), but also to complete the registration TaskContext
  2. After completing the mission computing, together with the results returned to the Task Driver

Combined with source code analysis

Driver-side initialization

  Driver main terminal through the following steps to complete the initialization operation:

val accum = sparkContext.accumulator(0, “AccumulatorTest”)
val acc = new Accumulator(initialValue, param, Some(name))
Accumulators.register(this)

Executor end deserialization obtained Accumulator

  Deserialization is calling ResultTask runTask way of doing the following:

// 会反序列化出来RDD和自己定义的function
val (rdd, func) = ser.deserialize[(RDD[T], (TaskContext, Iterator[T]) => U)](
   ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader)

  In deserialization process will call readObject method Accumulable in:

private def readObject(in: ObjectInputStream): Unit = Utils.tryOrIOException {
    in.defaultReadObject()
    // value的初始值为zero;该值是会被序列化的
    value_ = zero
    deserialized = true
    // Automatically register the accumulator when it is deserialized with the task closure.
    //
    // Note internal accumulators sent with task are deserialized before the TaskContext is created
    // and are registered in the TaskContext constructor. Other internal accumulators, such SQL
    // metrics, still need to register here.
    val taskContext = TaskContext.get()
    if (taskContext != null) {
      // 当前反序列化所得到的对象会被注册到TaskContext中
      // 这样TaskContext就可以获取到累加器
      // 任务运行结束之后,就可以通过context.collectAccumulators()返回给executor
      taskContext.registerAccumulator(this)
    }
  }

note

Accumulable.scala in value_, will not be serialized, @ transient modification of the keyword

@volatile @transient private var value_ : R = initialValue // Current value on master

The accumulator accumulates in each node operation

Incoming function for different operations corresponding to different calling methods, some listed below (in the Accumulator.scala):

def += (term: T) { value_ = param.addAccumulator(value_, term) }
def add(term: T) { value_ = param.addAccumulator(value_, term) }
def ++= (term: R) { value_ = param.addInPlace(value_, term)}

Depending on the parameters of the accumulator, there AccumulableParam (in the Accumulator.scala) of different implementations:

trait AccumulableParam[R, T] extends Serializable {
  /**
  def addAccumulator(r: R, t: T): R
  def addInPlace(r1: R, r2: R): R
  def zero(initialValue: R): R
}

Different implementations as shown below:
Accumulator analysis of the production of two common Spark
In IntAccumulatorParam Example:

implicit object IntAccumulatorParam extends AccumulatorParam[Int] {
  def addInPlace(t1: Int, t2: Int): Int = t1 + t2
  def zero(initialValue: Int): Int = 0
}

We found IntAccumulatorParam realize that trait AccumulatorParam [T]:

trait AccumulatorParam[T] extends AccumulableParam[T, T] {
  def addAccumulator(t1: T, t2: T): T = {
    addInPlace(t1, t2)
  }
}

After accumulate operations on each node is completed, the value will be followed value_ return the updated after Accumulators

Polymerization operation

In Task.scala the run method, the following happens:

// 返回累加器,并运行task
// 调用TaskContextImpl的collectAccumulators,返回值的类型为一个Map
(runTask(context), context.collectAccumulators())

Executor has been completed at the end of the series of operations, the need to return to their values Driver end of the polymerization are summarized, the entire sequence is shown accumulator execution flow:
Accumulator analysis of the production of two common Spark
flow of execution, we can see that, after performing collectAccumulators method, will eventually DAGScheduler call updateAccumulators (event), and in which method calls Accumulators add method, thereby completing the polymerization operation:

def add(values: Map[Long, Any]): Unit = synchronized {
  // 遍历传进来的值
  for ((id, value) <- values) {
    if (originals.contains(id)) {
      // Since we are now storing weak references, we must check whether the underlying data
      // is valid.
      // 根据id从注册的Map中取出对应的累加器
      originals(id).get match {
        // 将值给累加起来,最终将结果加到value里面
       // ++=是被重载了
        case Some(accum) => accum.asInstanceOf[Accumulable[Any, Any]] ++= value
        case None =>
          throw new IllegalAccessError("Attempted to access garbage collected Accumulator.")
      }
    } else {
      logWarning(s"Ignoring accumulator update for unknown accumulator id $id")
    }
  }
}

Gets the value of the accumulator

Accumulator value can be obtained by a method accum.value

At this point, the accumulator is finished.

Guess you like

Origin blog.51cto.com/14309075/2413995