shared variable
broadcast variable
Questions raised:
When the local list object is associated with the RDD data object, the local list object will be sent to each partition of the Executor for use, that is, each Executor may store multiple copies of the same data; but due to a Executor represents a process, and the resources in the process are shared, so only one copy of data can be kept in each Executor; we can
广播
achieve such an operation through;Specific operation:
Identify the local list object as a broadcast variable object:
# 本地list对象 list=[1,2,3] # 封装为广播变量对象 broadcast = sc.broadcast(list) # 使用广播变量,即从广播变量对象中取出本地的list对象 value = broadcast.value
In this way, since we have encapsulated the local data object into a broadcast variable object before data transmission, Spark will send only one copy of data to each Executor during data transmission, and each thread (partition) inside the Executor can share it. this data;
Applicable scene:
When the local collection object is associated with the distributed collection object (RDD) , it is necessary to mark the local collection object as a broadcast variable object, which can reduce the number of IO transmissions and reduce the memory usage of the Executor
Why use local collection objects?
When the collection occupies a small amount of memory, using local collection objects can improve performance and avoid a large number of shuffle
accumulator
The question is raised: how to count and accumulate the data in the map operator calculation?
Code:
conf = SparkConf().setAppName("test").setMaster("local[*]") sc = SparkContext(conf=conf) rdd = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 2) count = 0 # 用于计数 def map_func(data): global count count += 1 print(count) rdd.map(map_func).collect() print(count)
The final output result is 0; this is because the original definition
count
comes from the driver object. When the map operator needs the count object (count += 1
), the driver will make a copy of the count object and send it to each executor (note that here is copy sending , instead of sending the memory address); so the count object in the executor is accumulated, which has nothing to do with the count object in the driver, and the final printout is still the object in the driver;How to solve this problem? - use accumulator
Just replace the count object in the above code with an accumulator object:
conf = SparkConf().setAppName("test").setMaster("local[*]") sc = SparkContext(conf=conf) rdd = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 2) # Spark提供的累加器变量, 参数是初始值 acmlt = sc.accumulator(0) def map_func(data): global acmlt acmlt += 1 rdd.map(map_func).collect() print(count)
The accumulator object can collect running results from each executor object and act on itself (similar to a memory pointer)
Precautions:
The accumulator code may be repeated multiple times due to the rebuilding of the rdd
Solution: use cache or checkpoint