Spark accumulator and variable broadcast

accumulator

The information accumulator for polymerization, usually at the time of the transfer function to the Spark, such as using a map () function, or a filter () when the transmission conditions, using the variable defined in the driver program, each task running in the cluster are obtained a new copy of these variables, the updated value of these copies will not affect the corresponding variable drive. Update the shared variable function if you want to achieve all the fragmentation process, the accumulator can achieve the desired effect.

System Accumulator
for an input file, and if we want to calculate the number of files in all the blank lines, write the following program:

scala> val notice = sc.textFile("/hyk/spark/words.txt")
notice: org.apache.spark.rdd.RDD[String] = /hyk/spark/words.txt MapPartitionsRDD[1] at textFile at <console>:24

scala> val blanklines = sc.longAccumulator("MyAccumulator")
blanklines: org.apache.spark.util.LongAccumulator = LongAccumulator(id: 0, name: Some(MyAccumulator), value: 0)

scala> :paste
// Entering paste mode (ctrl-D to finish)

 val tmp = notice.flatMap(line => {
         if (line == "") {
            blanklines.add(1)
         }
         line.split(" ")
      })

// Exiting paste mode, now interpreting.

tmp: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at flatMap at <pastie>:27

scala> tmp.count()
res0: Long = 17

scala> blanklines.value
res1: Long = 5

Usage of the accumulator is as follows:
by calling SparkContext.accumulator (initialValue) in the drive method, there to create an initial value of the accumulator. Return value org.apache.spark.Accumulator [T] objects, where T is the type of the initial value initialValue. Spark closed bag can be used execution code + = accumulator process (in Java is add) increase the value of the accumulator. Drive program can call the accumulator value property (use value in Java () or setValue ()) to access the value of the accumulator.

Note: The tasks on the working node can not access the value of the accumulator. From the perspective of these tasks, the accumulator is a write only variable.

For accumulators to be used in mobile operations, Spark will only modify applications for each task once each accumulator. So, if you want a failure in terms of double-counting or accumulators are absolutely reliable, we must put it in foreach (such action operation). Conversion operations of the accumulator may occur more than once update

Custom accumulator
custom accumulator type of functionality in version 1.X has been provided, but use up more trouble, after version 2.0, have been greatly improved ease of use of the accumulator, but also to provide official a new abstract class: AccumulatorV2 to provide a more friendly way to achieve a custom type accumulator. Implement a custom type accumulators need to extend at least AccumulatorV2 overwriting method in Examples appearing below the accumulator can be used to collect a number of text-based information in the program is running, the final return to Set [String] form.

package cn.zut.bigdata
import java.util

import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.util.AccumulatorV2

class WordAccumulator extends AccumulatorV2[String, util.ArrayList[String]] {

  val list = new util.ArrayList[String]()

  // 当前的累加器是否为初始化状态
  override def isZero: Boolean = {
    list.isEmpty
  }

  // 复制累加器对象
  override def copy(): AccumulatorV2[String, util.ArrayList[String]] = {
    new WordAccumulator()
  }

  // 重置累加器对象
  override def reset(): Unit = {
    list.clear()
  }

  // 向累加器中增加数据
  override def add(v: String): Unit = {
    if (v.contains("h")){
      list.add(v)
    }
  }

  // 合并
  override def merge(other: AccumulatorV2[String, util.ArrayList[String]]): Unit = {
    list.addAll(other.value)
  }

  // 获取累加器的结果
  override def value: util.ArrayList[String] = {
    list
  }
}

// 使用字定义的累加器
object WordAccumulator {
  def main(args: Array[String]) {
    val conf=new SparkConf().setAppName("wordAccumulator").setMaster("local[*]")
    val sc=new SparkContext(conf)

    val dataRDD: RDD[String] = sc.makeRDD(List("hadoop","hive","hbase","scala","spark"))

    // TODO 创建累加器
    val accumulator = new WordAccumulator
    // TODO 注册累加器
    sc.register(accumulator)

    dataRDD.foreach{
      case word => {
        // TODO 执行累加器的累加功能
        accumulator.add(word)
      }
    }
    // TODO 获取累加器的值
    println("sum=" + accumulator.value)
    sc.stop()
  }
}

Broadcast variables (Tuning Strategy)

Broadcast variables used to efficiently distribute large object. Transmitting a larger read-only value to all nodes work, for the one or more operations using Spark. For example, if your application needs to send a large table read-only queries to all nodes, or even a machine learning algorithm in a lot of feature vectors, broadcast variables together very smoothly. You use the same variable in the plurality of parallel operations, but Spark sent separately for each task.

scala>  val broadcastVar = sc.broadcast(Array(1, 2, 3))
broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(2)

scala> broadcastVar.value
res2: Array[Int] = Array(1, 2, 3)

Broadcast variable is as follows:
(1) create a Broadcast [T] SparkContext.broadcast object by invoking an object of type T. Any serializable type can be so achieved.
(2) access the value of the object (as in Java value () method) by the value attribute.
(3) variable is sent only once to each node, it should be treated as read-only value (this value is modified without affecting the other node).

RDD relationship concepts

Here Insert Picture Description
A plurality of input may be stored as files on the HDFS, each File contains a number of blocks, called Block. Spark When these files are read as an input, the data is parsed according to the specific format corresponding to the InputFormat, is typically combined into a plurality of input Block fragments, called InputSplit, attention span can not InputSplit file. Task will then generate specific fragments of these inputs. InputSplit with Task is one to one relationship. Then each of these specific Task Executor will be assigned to a certain node on the cluster to perform.

  1. Each node may serve one or more Executor.
  2. Each Executor by a number of core, each core can only be performed once for each Executor of a Task.
  3. The result of each Task execution is to generate a partiton goal of RDD.
    Note: core here is that the core physical CPU core virtual machine instead, can be understood as a worker thread is the Executor. And concurrency executed Task Executor * number = number of cores per Executor. As for the partition number:
  4. For data reading stage, for example sc.textFile, the input file is divided into a number of initial Task InputSplit how much it will need.
  5. In the Map stage the number of partition remains unchanged.
  6. Reduce the stage, the polymerization trigger RDD shuffle operation, the number of partition RDD after polymerization with specific operations related to, for example, the number of partitions specified repartition operation convergent synthesis, some operators are configurable.
    RDD in the calculation of the time, each district will play a task, so the number of partitions rdd determines the total number of task. Compute nodes (Executor) the number of applications and the number of nodes per compute core, the same time you decide the task can be executed in parallel.
    For example, the RDD has 100 partitions, then the calculation of the time it will generate 100 task, configure your resources for the 10 compute nodes, each of the two two-core, the same time the number of task parallel is 20, calculated on the RDD We need five rounds. If the same computing resources, you have 101 task, then you need six rounds in the final round, only to perform a task in the remaining core are idle. If the same resources, you RDD only two partitions, one time only two task running, the remaining 18 nuclear idle, resulting in waste of resources. This is the spark tuning, the RDD increase the number of partitions is increased to approach the task parallelism.
Published 59 original articles · won praise 4 · Views 4455

Guess you like

Origin blog.csdn.net/weixin_45102492/article/details/104729972