Pyspark
Note: If you think the blog is good, don’t forget to like and collect it. I will update the content related to artificial intelligence and big data every week. Most of the content is original, Python Java Scala SQL code, CV NLP recommendation system, etc., Spark Flink Kafka, Hbase, Hive, Flume, etc. are all pure dry goods, and the interpretation of various top conference papers makes progress together.
Today, I will continue to share with you Pyspark Basic Introduction 6
#博学谷IT Learning Technical Support
Article directory
foreword
What I share with you today is the shared variable of Spark RDD
1. Broadcast variables
Define a shared variable on the Driver side. If the broadcast variable is not used, each thread needs to copy this variable to its own thread when it is running. The use of network transmission and memory is a waste and affects efficiency.
If broadcast variables are used, a copy of the variable will be placed on each executor, and each thread can directly read the variable on the executor without pulling it into the Task, reducing the number of copies, and reducing the impact on the network and memory. thereby improving efficiency
Broadcast variables are read-only, and each Task can only read data and cannot modify it
Related API:
Set broadcast variable: object of broadcast variable = sc.broadcast(variable value)
Get broadcast variable: object of broadcast variable.value
1. How to use
from pyspark import SparkContext, SparkConf
import os
# 锁定远端环境, 确保环境统一
os.environ['SPARK_HOME'] = '/export/server/spark'
os.environ['PYSPARK_PYTHON'] = '/root/anaconda3/bin/python3'
os.environ['PYSPARK_DRIVER_PYTHON'] = '/root/anaconda3/bin/python3'
if __name__ == '__main__':
print("演示广播变量相关使用")
# 1. 创建SparkContext对象:
conf = SparkConf().setAppName('sougou').setMaster('local[*]')
sc = SparkContext(conf=conf)
# a = 100
broadcast = sc.broadcast(100)
# 2- 初始化数据:
rdd = sc.parallelize([1, 2, 3, 4, 5, 6, 7])
# 3- 处理数据:
# 需求: 请为每一个元素累加一个值
def fn1(num):
return num + broadcast.value
rdd_res = rdd.map(fn1)
print(rdd_res.collect())
2. Accumulator
Spark provides accumulators, which can be used to implement global cumulative calculation operations, such as how much data has been operated globally, which can be implemented using accumulators
The accumulator is the initial value set by the Driver side, the accumulation operation is performed in each Task, and finally the result is obtained on the Driver side
Task can only accumulate operations, and cannot read the value of the accumulator
Related APIs:
1- Set the initial value of the accumulator on the Driver side:
acc = sc.accumulator(initial value)
2- In Task(RDD): perform the accumulation operation
acc.add(accumulated value)
3- Get the value in the Driver
acc. value
1. How to use
from pyspark import SparkContext, SparkConf
import os
# 锁定远端环境, 确保环境统一
os.environ['SPARK_HOME'] = '/export/server/spark'
os.environ['PYSPARK_PYTHON'] = '/root/anaconda3/bin/python3'
os.environ['PYSPARK_DRIVER_PYTHON'] = '/root/anaconda3/bin/python3'
if __name__ == '__main__':
print("演示累加器相关的操作:")
# 1. 创建SparkContext对象:
conf = SparkConf().setAppName('sougou').setMaster('local[*]')
sc = SparkContext(conf=conf)
# 定义一个累加的变量
# agg = 0
acc = sc.accumulator(0)
# 2- 初始化数据:
rdd = sc.parallelize([1, 2, 3, 4, 5, 6, 7])
# 3- 执行相关的操作:
# 需求: 对每个元素进行 +1 返回, 在执行操作的过程汇总, 需要统计共计对多少个数据进行 +1操作
def fn1(num):
acc.add(1)
return num + 1
rdd_res = rdd.map(fn1)
# 3- 获取结果
print(rdd_res.collect())
print(acc.value)
如果后续多次调用action算子, 会导致累加器重复累加操作
主要原因: 每一次调度action算子, 都会触发一个Job任务执行, 每一个Job任务都要重新对象其所依赖的所有RDD进行整个计算操作, 从而导致累加器重复累加
解决方案:
在调用累加器后的RDD上, 对其设置缓存操作, 即可解决问题, 但是不能单独设置checkpoint, checkpoint和累加器无法直接共用, 可以通过缓存 + 累加器的思路来解决
主要原因: 每一次调度action算子, 都会触发一个Job任务执行, 每一个Job任务都要重新对象其所依赖的所有RDD进行整个计算操作, 从而导致累加器重复累加
解决方案:
在调用累加器后的RDD上, 对其设置缓存操作, 即可解决问题, 但是不能单独设置checkpoint, checkpoint和累加器无法直接共用, 可以通过缓存 + 累加器的思路来解决
Summarize
Today I shared with you two shared variables of RDD.