Getting Started with PySpark Basics (4): RDD Shared Variables

shared variable

broadcast variable

Questions raised:

When the local list object is associated with the RDD data object, the local list object will be sent to each partition of the Executor for use, that is, each Executor may store multiple copies of the same data; but due to a Executor represents a process, and the resources in the process are shared, so only one copy of data can be kept in each Executor; we can 广播achieve such an operation through;

Specific operation:

Identify the local list object as a broadcast variable object:

# 本地list对象
list=[1,2,3]
# 封装为广播变量对象
broadcast = sc.broadcast(list)
# 使用广播变量,即从广播变量对象中取出本地的list对象
value = broadcast.value

In this way, since we have encapsulated the local data object into a broadcast variable object before data transmission, Spark will send only one copy of data to each Executor during data transmission, and each thread (partition) inside the Executor can share it. this data;

Applicable scene:

When the local collection object is associated with the distributed collection object (RDD) , it is necessary to mark the local collection object as a broadcast variable object, which can reduce the number of IO transmissions and reduce the memory usage of the Executor

Why use local collection objects?

When the collection occupies a small amount of memory, using local collection objects can improve performance and avoid a large number of shuffle

accumulator

The question is raised: how to count and accumulate the data in the map operator calculation?

Code:

    conf = SparkConf().setAppName("test").setMaster("local[*]")
    sc = SparkContext(conf=conf)

    rdd = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 2)
	count = 0 # 用于计数
	def map_func(data):
        global count
        count += 1
        print(count)
    rdd.map(map_func).collect()
	print(count)

The final output result is 0; this is because the original definition countcomes from the driver object. When the map operator needs the count object ( count += 1), the driver will make a copy of the count object and send it to each executor (note that here is copy sending , instead of sending the memory address); so the count object in the executor is accumulated, which has nothing to do with the count object in the driver, and the final printout is still the object in the driver;

How to solve this problem? - use accumulator

Just replace the count object in the above code with an accumulator object:

    conf = SparkConf().setAppName("test").setMaster("local[*]")
    sc = SparkContext(conf=conf)

    rdd = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 2)

    # Spark提供的累加器变量, 参数是初始值
    acmlt = sc.accumulator(0)

    def map_func(data):
        global acmlt
        acmlt += 1

    rdd.map(map_func).collect()
	print(count)

The accumulator object can collect running results from each executor object and act on itself (similar to a memory pointer)

Precautions:

The accumulator code may be repeated multiple times due to the rebuilding of the rdd

Solution: use cache or checkpoint


Guess you like

Origin blog.csdn.net/qq_51235856/article/details/130470524