2-Spark-1- tuning - data skew 2-Join / Broadcast usage scenarios

Technical points: the join operation may produce RDD data skew, when two RDD is not very large, the operation may be similar (the Join) by the end of the Broadcast reduce by:

  broadcast is a process level, read-only.

  broadcast may be adapted to broadcast a small table by broadcasting to the corresponding node memory (where blockManager management), Rdd through that node mapPartitions method, and acquires the content broadcast by BlockManager, for the same key for (join) operating.

  map method is to traverse each record in each of the partitions rdd, mapPartitions is traversed to each Partitions rdd, a considerable batch operation is based on an array (corresponding to each cache partition).

Applicable scene: This broadcast by the Broadcast way for small table, does not apply to the amount of data RDD is very large, it may cause OOM, and memory (broadcast variables is occupied for a relatively large burden Gc, very it will easily become members Jvm old age, and take up a larger).

 

Guess you like

Origin www.cnblogs.com/ywdjx/p/2-Spark-1-performance2.html
Recommended