go through this block codes below,we will figure out some conclusions:
val barr1 = sc.broadcast(arr1) //-broadcast a array with 1M int elements //-this is a embedded broadcast wrapped by rdd below.so this data val observedSizes = sc.parallelize(1 to 10, slices).map(_ => barr1.value.size) //-embeded broadcast // Collect the small RDD so we can print the observed sizes locally. observedSizes.collect().foreach(i => println(i))
note:
1.if there is a embeded broadcast in a rdd,the bc will be deserialized with the same process of rdd deserialization.(this procedure is not present in this figure)
2.so a bottleneck will occur in driver when all the executors try to fetch out block data simetaneously from driver at first time.
refer:
Spark源码系列(五)分布式缓存