Spark learns from 0 to 1 (4)-Apache Spark broadcast variables and accumulators

1. Broadcast variables

1.1 Broadcast variable understanding diagram

Insert picture description here

1.2 Use of broadcast variables

object SparkBroadCast {
    
    
  def main(args: Array[String]): Unit = {
    
    
    val conf = new SparkConf();
    conf.setMaster("local")
    conf.setAppName("SparkBroadCast")
    val sparkContext = new SparkContext(conf)
    val list = List("spark","hadoop","flink")
    //广播 list
    val brodcast = sparkContext.broadcast(list)
    val wordList = List("spark","hadoop","java","python")
    sparkContext.parallelize(wordList).filter(word => {
    
    
      // 使用广播变量
      val value = brodcast.value
      value.contains(word)
    }).foreach(println)
  }
}

1.3 Matters needing attention

  • Can an RDD be broadcast using broadcast variables?

    No, because RDD does not store data. You can broadcast the results of the RDD.

  • Broadcast variables can only be defined on the Driver side, not on the Executor side.

  • The value of the broadcast variable can be modified on the Driver side, and the value of the broadcast variable cannot be modified on the Executor side.

2. Accumulator

2.1 The accumulator understanding diagram

Insert picture description here

2.2 The use of accumulator

object SparkAccumulator {
    
    

  def main(args: Array[String]): Unit = {
    
    
    val conf = new SparkConf();
    conf.setAppName("SparkAccumulator")
    conf.setMaster("local")
    val sparkContext = new SparkContext(conf)
    // 定义一个累加器
    val acc = sparkContext.longAccumulator;
    val list = List("kafka","java","spark","hadoop","flink")
    // 循环rdd,统计有多少个word
    sparkContext.parallelize(list).foreach(word => acc.add(1))
    val value = acc.value
    println(value)
    sparkContext.stop()
  }
}

2.3 Custom accumulator

  1. Inherit AccumulatorV2, define the input and output type as INT

    package com.abcft.spark.acc
    import org.apache.spark.util.AccumulatorV2
    
    class CustomAcc extends AccumulatorV2[Int ,Int] {
          
          
    
      var acc= 0
      override def isZero: Boolean = acc == 0
    
      override def copy(): AccumulatorV2[Int, Int] = {
          
          
        var v = new CustomAcc()
        v.acc = this.acc
        return v;
      }
    
      override def reset(): Unit = {
          
          
        var v = new CustomAcc()
      }
    
      override def add(v: Int): Unit = {
          
          
        this.acc = this.acc + v
      }
    
      override def merge(other: AccumulatorV2[Int, Int]): Unit = {
          
          
        this.acc = other.value + this.acc
      }
    
      override def value: Int = this.acc
    }
    
  2. Use a custom accumulator

    import com.abcft.spark.acc.CustomAcc
    import org.apache.spark.{
          
          SparkConf, SparkContext}
    
    object SparkCustomAcc {
          
          
    
      def main(args: Array[String]): Unit = {
          
          
        val conf = new SparkConf();
        conf.setMaster("local")
        conf.setAppName("SparkCustomAcc")
        val sparkContext = new SparkContext(conf);
        // 创建累加器
        val acc = new CustomAcc()
        sparkContext.register(acc,"CustomAcc")
        val list = List("kafka","java","spark","hadoop","flink")
        // 循环rdd,统计有多少个word
        sparkContext.parallelize(list).foreach(word => acc.add(1))
        val value = acc.value
        println(value)
        sparkContext.stop()
      }
    }
    

2.4 Main issues

The accumulator is defined on the Driver side and assigned an initial value. The accumulator can only be read on the Driver side and updated on the Executor side.

Guess you like

Origin blog.csdn.net/dwjf321/article/details/109048090