Broadcast broadcast variables

Broadcast broadcast variables :
       Broadcast variables are used to efficiently distribute larger objects and send a larger read-only value to all working nodes for one or more spark operations.
       For example, if your application needs to send a large read-only lookup table to all nodes, or even a large feature vector in the machine learning 1 algorithm,
broadcast variables are easy to use.

Broadcasting Variables When
       broadcasting a variable, the variable on the driver side is broadcasted to each worker side, and one worker side will receive only one copy of the value of the variable.
Application scenario :
      After the job is submitted, during the execution of the task, when there are one or more values ​​that need to be fetched from the Driver side multiple times during the calculation process, a large amount of network IO will inevitably occur at this time.
       At this time, it is best to broadcast the variable value of the Driver end to each Worker end first, and then only need to get the value locally during the calculation process to improve efficiency and avoid network IO. When broadcasting, the variables on the Driver end are broadcast to each worker end, and only one copy is received by a worker end.
Note : The broadcast value must be an exact value, RDD cannot be broadcast (because RDD is a description of data, no exact value is obtained), if you want to broadcast RDD, you need to get the data corresponding to the RDD to the Driver terminal and then Go in and broadcast again. The broadcast data is immutable.

import org.apache.spark.{
    
    SparkConf, SparkContext}

object BroadcastDemo {
    
    
  def main(args: Array[String]): Unit = {
    
    
    val conf = new SparkConf()
    conf.setAppName(this.getClass.getName).setMaster("local[2]")
    val sc = new SparkContext(conf)
    //list是在Driver端创建也相当于本地变量
    val list = List("hello java")
    //封装广播变量,并进行广播,该方法也是transformation,只有action的时候才执行
    val broadcast = sc.broadcast(list)
    //算子部分是在Executor端执行
    val lines = sc.parallelize(List("dd ff","ff ffd"))
    //使用广播变量进行数据处理value可以获取广播变量的值
    val filterStr = lines.filter(broadcast.value.contains(_))
    filterStr.foreach(println)

    sc.stop()

  }
}

Guess you like

Origin blog.csdn.net/qq_42706464/article/details/108440617