spark性能调优---广播变量的使用

Broadcast Variables

Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.

Spark actions are executed through a set of stages, separated by distributed “shuffle” operations. Spark automatically broadcasts the common data needed by tasks within each stage. The data broadcasted this way is cached in serialized form and deserialized before running each task. This means that explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.

Broadcast variables are created from a variable v by calling SparkContext.broadcast(v). The broadcast variable is a wrapper around v, and its value can be accessed by calling the value method. The code below shows this:

上面的是官网上的信息大概意思就是:

         广播变量允许程序在每台机器上缓存只读变量,而不是在任务中附带他的变量副本

使用广播变量的好处?

以50个executor,1000个task。一个map 10M为例说明使用广播变量和没有使用广播变量的区别 

默认情况下,1000个task,1000份副本。10G的数据,网络传输,在集群中,耗费10G的内存资源。 如果使用了广播变量。50个execurtor,50个副本。500M的数据,网络传输,而且不一定都是从Driver传输到每个节点,还可能是就近从最近的节点的executor的bockmanager上拉取变量副本,网络传输速度大大增加;500M的内存消耗。 10000M,500M,20倍。20倍~以上的网络传输性能消耗的降低;20倍的内存消耗的减少。

如何使用广播变量(摘抄自官网)?

//这个是对你需要广播的数据进行广播
Broadcast<int[]> broadcastVar = sc.broadcast(new int[] {1, 2, 3});


//通过broadcast的value方法获取广播的值
broadcastVar.value();
// returns [1, 2, 3]

猜你喜欢

转载自blog.csdn.net/u013164612/article/details/84651291