1)假设某个作业有10000个tasks,每个task上有100M的变量,这个数据是很可怕的
所以:10000tasks ==>100 executor 广播变量是广播到executor上的,每个executor上的所有task共享
2)使用案例
map join 把小表的数据广播出去
BroadcastJoin = MapJoin
3)说明
Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.
广播变量允许保留一个只读的变量,缓存在每台机器上,而不是每一个task上。相当于在每个executor都放一份,可以直接使用。Spark尝试去把广播变量分布到各个节点上去,降低通信成本
4)用法
scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))
broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0)
scala> broadcastVar.value
res0: Array[Int] = Array(1, 2, 3)