集群上的python环境通常没有任务计算所需要的包,pyspark中的SparkContext提供pyFiles参数供我们导入第三包,这里的包可以是我们自己写的py文件,也可以是.whl文件,比如测试中的并行计算需要用到以下三个包:
将三个包直接打包压缩成 package.zip,记住一定要是zip格式
测试代码
from pyspark import SparkConf, SparkContext
import add
import mult
import traceback
import os
import pandas as pd
def getResult(x):
a=add.add(len(x),1)#并行计算中用到的自定义函数add
b=mult.mult(a,2)#并行计算中用到的自定义函数mult
b=pd.to_datetime(b)#并行计算中用到pandas
return b
if __name__ == '__main__':
os.environ["PYSPARK_PYTHON"]="/usr/bin/python3"#集群上pyspark的python版本指向python3
appname = "test"
master ="spark://XXX.XXX.XX.XX:XXXX" #"spark://host:port"
spark_driver_host="XXX.XXX.XX.XX"#本地主机ip
pyFiles=["D:/pysparktest/package.zip"]#压缩的包的所在路径
'''
也可以这样:
pyFiles=["D:/pysparktest/add.py","D:/pysparktest/mult.py","D:/pysparktest/pandas.py"]
'''
try:
conf = SparkConf().setAppName(appname).setMaster(master).set("spark.driver.host",spark_driver_host )
sc = SparkContext(conf=conf,pyFiles=pyFiles)
words = sc.parallelize(
["scala",
"java",
"hadoop",
"spark",
"akka",
"spark vs hadoop",
"pyspark",
"pyspark and spark"
])
result=words.map(lambda x:getResult(x)).collect()
print(result)
sc.stop()
print('计算成功!')
except:
sc.stop()
traceback.print_exc()#返回出错信息
print('连接出错!')
运行结果