版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/chao2016/article/details/82914754
1. 开发工具
- Java
- spark-2.3.0-bin-2.6.0-cdh5.7.0
- PyCharm
2. Spark配置
JAVA_HOME=/Users/chao/.jenv/candidates/java/current/
- slaves
localhost
3. PyCharm配置
3.1 设置启动参数
- 新建一个python工程,创建一个.py文件
- Run -> Edit Configurations -> Configuration -> Environment Variables -> 添加参数:
PYTHONPATH=/Users/chao/Documents/app/spark-2.3.0-bin-2.6.0-cdh5.7.0/bin
SPARK_HOME=/Users/chao/Documents/app/spark-2.3.0-bin-2.6.0-cdh5.7.0
如下图所示:
3.2 引入spark包
PyCharm -> Preferences -> Project -> Project Structure -> Add Current Root
添加同一个目录下的两个包:
/Users/chao/Documents/app/spark-2.3.0-bin-2.6.0-cdh5.7.0/python/lib/py4j-0.10.6-src.zip
/Users/chao/Documents/app/spark-2.3.0-bin-2.6.0-cdh5.7.0/python/lib/pyspark.zip
4. 测试
from pyspark import SparkConf, SparkContext
# 创建SparkConf:设置的是Spark相关的参数信息
conf = SparkConf().setMaster("local[2]").setAppName("spark0301")
# 创建SparkContext
sc = SparkContext(conf=conf)
# 业务逻辑
data = [1, 2, 3, 4, 5]
distData = sc.parallelize(data)
print(distData.collect())
sc.stop()
点击运行,结果显示:
[1, 2, 3, 4, 5]
5. 集群运行
- spark-submit 参数中添加py文件(代替jar包)即可。
spark-submit --master local[2] --name spark0301 /root/script/spark0301.py