Spark RDD programming advanced chapter (extended)

Advanced RDD programming

One, the accumulator

The accumulator is used 对信息进行聚合, usually when passing functions to Spark, such as when using the map() function or passing conditions with filter(), you can use the variables defined in the driver program, but each task running in the cluster will get one of these variables. For new copies, updating the values of these copies will not affect the corresponding variables in the drive. If we want to achieve the function of updating shared variables during all sharding processing, then the accumulator can achieve the effect we want.

1.1 System accumulator

For an input log file, if we want to count the number of all blank lines in the file, we can write the following program:

scala> val notice = sc.textFile("./NOTICE")
notice: org.apache.spark.rdd.RDD[String] = ./NOTICE MapPartitionsRDD[40] at textFile at:32

scala> val blanklines = sc.accumulator(0)
warning: there were two deprecation warnings; re-run with -deprecation for details
blanklines: org.apache.spark.Accumulator[Int] = 0

scala> val tmp = notice.flatMap(line => {
    
    
 | if (line == "") {
    
    
 | blanklines += 1
 | }
 | line.split(" ")
 | })
tmp: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[41] at flatMap at :36

scala> tmp.count()
res31: Long = 3213

scala> blanklines.value
res32: Int = 171

The usage of the accumulator is as follows:

By calling in the drive SparkContext.accumulator(initialValue)way out there to create the initial value of the accumulator. The return value is an org.apache.spark.Accumulator[T] object, where T is the type of the initial value. Spark closed bag can execute code accumulator +=methods (in Java is add) increase the value of the accumulator. The driver program can call the value attribute of the accumulator (use value() or setValue() in Java) to access the value of the accumulator.

Note:工作节点上的任务不能访问累加器的值 . From the perspective of these tasks, the accumulator is a write-only variable.

For accumulators to be used in action operations, Spark will only apply changes to each accumulator for each task once. Therefore, if we want an accumulator that is absolutely reliable in the event of failure or repeated calculations, we must put it in foreach()such an action operation. The accumulator may be updated more than once during the conversion operation.

1.2 Custom accumulator

The function of customizing the accumulator type has been provided in the 1.X version, but it is more troublesome to use. After the 2.0 version, the ease of use of the accumulator has been greatly improved, and the official also provides a new abstract AccumulatorV2 来提供更加友好的自定义类型累加器的实现方式class: . To implement a custom type accumulator, you need to inherit AccumulatorV2 and at least override the methods shown in the example. The following accumulator can be used to collect some text information during the running of the program 最终以 Set[String]的形式返回.

import org.apache.spark.util.AccumulatorV2
import org.apache.spark.{
    
    SparkConf, SparkContext}
import scala.collection.JavaConversions._


class LogAccumulator extends org.apache.spark.util.AccumulatorV2[String, java.util.Set[String]] {
    
    
  private val _logArray: java.util.Set[String] = new java.util.HashSet[String]()

  override def isZero: Boolean = {
    
    
    _logArray.isEmpty
  }

  override def reset(): Unit = {
    
    
    _logArray.clear()
  }

  override def add(v: String): Unit = {
    
    
    _logArray.add(v)
  }

  override def merge(other: org.apache.spark.util.AccumulatorV2[String, java.util.Set[String]]): Unit = {
    
    
    other match {
    
    
      case o: LogAccumulator => _logArray.addAll(o.value)
    }
  }

  override def value: java.util.Set[String] = {
    
    
    java.util.Collections.unmodifiableSet(_logArray)
  }

  override def copy(): org.apache.spark.util.AccumulatorV2[String, java.util.Set[String]] = {
    
    
    val newAcc = new LogAccumulator()
    _logArray.synchronized {
    
    
      newAcc._logArray.addAll(_logArray)
    }
    newAcc
  }
}

// 过滤掉带字母的
object LogAccumulator {
    
    

  def main(args: Array[String]) {
    
    
    val conf = new SparkConf().setAppName("LogAccumulator")
    val sc = new SparkContext(conf)
    
    val accum = new LogAccumulator
    sc.register(accum, "logAccum")
    val sum = sc.parallelize(Array("1", "2a", "3", "4b", "5", "6", "7cd", "8",
      "9"), 2).filter(line => {
    
    
      val pattern = """^-?(\d+)"""
      val flag = line.matches(pattern)
      if (!flag) {
    
    
        accum.add(line)
      }
      flag
    }).map(_.toInt).reduce(_ + _)
    
    println("sum: " + sum)
    for (v <- accum.value) print(v + "")
    println()
    sc.stop()
  }
}

Two, broadcast variables (tuning strategy)

广播变量用来高效分发较大的对象. Send a larger read-only value to all worker nodes for use by one or more Spark operations. For example, if your application needs to send a large read-only lookup table to all nodes, or even a large feature vector in a machine learning algorithm, broadcast variables are easy to use. The same variable is used in multiple parallel operations, but Spark will send it separately for each task.

scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))
broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(35)

scala> broadcastVar.value
res33: Array[Int] = Array(1, 2, 3)

The process of using broadcast variables is as follows:

(1) Create a Broadcast[T] object by calling SparkContext.broadcast on an object of type T. Any serializable type can be implemented in this way

(2) Access the value of the object through the value attribute (value() method in Java)

(3) Variables will only be sent to each node once and should be treated as read-only values (modifying this value will not affect other nodes)