Spark tuning of the broadcast variables

Broadcast variables (groadcast varible) is read-only variable that has run SparkContext the driver program creates sent to the participating nodes calculated. For those who need to work efficiently access the same data node application scenarios, such as machine learning. We can call the broadcast method on SparkContext create broadcast variables:

Broadcast variables can also be accessed, access method method is to call the value of the variable non-driver program is located node (ie, worker)

Variables can optimize the use of radio resources to improve performance

 

Advantages broadcast variables: because not every task is a copy of the variable, but became executor of each node only a copy. In this case, you can make a copy of variables generated greatly reduced.

 

Broadcast variables, the initial time, we have a copy on Drvier. task at run time, you want to use variable data broadcasting, this time is first in his native Executor corresponding BlockManager, an attempt to obtain a copy of variables; if no local, BlockManager, perhaps to get a copy of the above variables from a remote Driver ; there may boost access from BlockManager Executor of the other nodes from the more recent, and stored in the local BlockManager in; BlockManager responsible for managing data on a Executor corresponding memory and disk, task on After the executor, will directly use BlockManager in the local copy.

 

Spark distributed execution in the code need to be passed to the respective running the Executor Task. For some read-only fixed data (such as data read out from the DB), each time the need to broadcast on each Driver Task, so efficiency is low. Radio variable allows only variable broadcast (broadcast in advance) to each Executor. Task Executor on each of the node where the re-acquired from BlockManager variables, rather than obtaining variable from Driver, thereby enhancing efficiency.

 

Executor only when a first Task start, get a Broadcast data, after obtaining all relevant data from the Task BlockManager this node.

 

 


Instructions

 

  1. Call SparkContext.broadcast method to create a Broadcast [T] object. Any type can be serialized so achieved.

     

  2. Change the value of the property by accessing the object

     

  3. Variable is sent to each node only once (this value is modified without affecting the other node) should be treated as read-only value

 


Examples

 

 
 
 
  

Published 131 original articles · won praise 79 · views 310 000 +

Guess you like

Origin blog.csdn.net/qq_31780525/article/details/79535885