spark---广播变量

一.

在写spark程序时，有可能Driver会从本地读取一份数据，这个时候这份数据就会在Driver所在的机器上，那么executor端也要使用这份数据；
这时就需要将Driver端的数据广播给Executor端

/**
   * Broadcast a read-only variable to the cluster, returning a
   * [[org.apache.spark.broadcast.Broadcast]] object for reading it in distributed functions.
   * The variable will be sent to each cluster only once.
   *
   * @param value value to broadcast to the Spark nodes
   * @return `Broadcast` object, a read-only variable cached on each machine
   */
  def broadcast[T: ClassTag](value: T): Broadcast[T] = {
    assertNotStopped()
    require(!classOf[RDD[_]].isAssignableFrom(classTag[T].runtimeClass),
      "Can not directly broadcast RDDs; instead, call collect() and broadcast the result.")
    val bc = env.broadcastManager.newBroadcast[T](value, isLocal)
    val callSite = getCallSite
    logInfo("Created broadcast " + bc.id + " from " + callSite.shortForm)
    cleaner.foreach(_.registerBroadcastForCleanup(bc))
    bc
  }

向集群广播只读变量，返回a
[[org.apache.spark.broadcast。对象，用于在分布式函数中读取它。
变量只会发送到每个集群一次。
@param值向Spark节点广播的值
@return ’ Broadcast '对象，缓存在每台机器上的只读变量

使用broadcast这个方法就可以将数据广播给其他Executor，会生成一个广播数据的引用

然后Executor就可以通过这个引用直接使用这份数据了，因为Task是在Driver端生成的，广播变量的引用也会伴随着Task被发送到Executor端。
调用方法是生成的引用.value

二

或者这份数据从HDFS上读取的，那么executor端获取的数据是不完整的，但是Executor又要使用完整的数据，那么这时候的解决办法就是：
1.先将每个Executor上的数据收集到Driver端
2.再将Driver上完整的数据广播给Executor。

一.

二

猜你喜欢