Spark - Broadcasting & variable accumulator

Radio broadcast variable variable

1, meaning broadcasting variables

If we want to calculate which distribute large objects in a distributed, for example: dictionaries, sets, black and white lists, etc., which will be distributed by Driver-side, in general, if this variable is not broadcasting variables, then each task will distribute a this Driver bandwidth will become a bottleneck in the system of the number of task is very much the case, and will consume a lot of resources on the task server, if the variable is declared as broadcast variable, except that each has an executor, the executor start the task will share the variable costs and saves resources server communications.
Such as: a spark application has 50 executor, 1000 th Tasks, 10M size of a data broadcast is not variable, it is necessary 10M * 1000 = 10G of memory while using the broadcast variables only need memory 10M * 50 = 500M

2, code samples
package com.test.bigdata

import org.apache.spark.{SparkConf, SparkContext}

object BroadcastApp {

  def main(args: Array[String]) {
    val sparkConf = new SparkConf()
      .setMaster("local[2]").setAppName("SparkContextApp")

    val sc = new SparkContext(sparkConf)

    //    commonJoin(sc)

    broadcastJoin(sc)
    sc.stop()
  }

  def broadcastJoin(sc: SparkContext): Unit = {
    // 假设 a1是个小表
    val a1 = sc.parallelize(Array(("1", "大米"), ("2", "土豆"), ("29", "小花"))).collectAsMap()
  	//广播
    val a1Broadcast = sc.broadcast(a1) 

    sc.longAccumulator("").add(1)

    val f11 = sc.parallelize(Array(("29", "深圳", 18), ("10", "北京", 2)))
      .map(x => (x._1, x))

    f11.mapPartitions(partition => {
   	 // 获取广播里面的内容
      val a1Stus = a1Broadcast.value 
      for ((key, value) <- partition if (a1Stus.contains(key)))
        yield (key, a1Stus.getOrElse(key,""), value._2, value._3)
    })
  }

  def commonJoin(sc: SparkContext): Unit = {

    // a1 join f11 on a1.id = f11.id   ==> 29,"小花","深圳",18
    val a1 = sc.parallelize(Array(("1", "大米"), ("2", "土豆"), ("29", "小花"))).map(x => (x._1, x))

    val f11 = sc.parallelize(Array(("29", "深圳", 18), ("10", "北京", 2))).map(x => (x._1, x))

    a1.join(f11).map(x => {
      x._1 + " , " + x._2._1._2 + " , " + x._2._2._2 + " , " + x._2._2._3
    }).collect()


  }
}
3 Notes
  • Broadcast variables can not be too large
  • Broadcast variables are read attribute can not be changed, change the values ​​of variables at the broadcast end Driver, re-broadcast, the broadcast can not modify the value of the variable in the Executor end.
  • After RDD action must be broadcast operation is performed
    val a1 = sc.parallelize (Array (( "1", " rice"), ( "2", "potato"), ( "29", "flower"))) .collectAsMap ()
    Val = a1Broadcast sc.broadcast (A1) // broadcast

Accumulator accumulator

1, the accumulator

In the spark applications, we often have such needs, such as abnormal monitoring, debugging, recording data in line with a number of features, this demand will need to use the counter, if a variable is not declared as an accumulator, then it will not be changed when the driver-side globally aggregated for each task that is running only a copy of the original variables distributed runtime, does not change the value of the original variable, but when the variable is declared as an accumulator after this variable will be distributed counting.

2, the use of accumulator
val conf = new SparkConf()
conf.setMaster("local").setAppName("accumulator")
val sc = new SparkContext(conf)
//定义累加器
val accumulator = sc.accumulator(0) 
//分布式累加
sc.textFile("./words.txt").foreach { x =>{accumulator.add(1)}} 
//获取累加器的结果
println(accumulator.value) 
sc.stop()

Guess you like

Origin blog.csdn.net/aubekpan/article/details/88959688