版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/qq_41672744/article/details/79540142
windows 7 python spark环境搭建笔记
1. 安装Anaconda 3, python3.6, JAVA
2. 安装spark,解压到d:\spark, hadoop到d:\hadoop
3.到Anaconda3添加spark环境
4.到我的电脑,高级,添加环境变量
系统变量:
HADOOP_HOME=D:\HADOOP
SPARK_HOME=D:\SPARK
PATH=C:\ProgramData\Oracle\Java\javapath;%SystemRoot%\system32;%SystemRoot%;%SystemRoot%\System32\Wbem;%SYSTEMROOT%\System32\WindowsPowerShell\v1.0\;%SPARK_HOME%\bin;%Path%;D:\Spark\bin
用户变量:
path=%JAVA_HOME%\bin;%SPARK_HOME%\bin;%SPARK_HOME%\sbin;%HADOOP_HOME%\bin
测试:
打开 spyder
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local[*]").setAppName("First_App")
sc = SparkContext(conf=conf)
data = [1, 2, 3, 4, 5]
distData = sc.parallelize(data)
print(distData)
结果:
Reloaded modules: pyspark, pyspark.conf, pyspark.context, pyspark.accumulators, pyspark.cloudpickle, pyspark.util, pyspark.serializers, pyspark.broadcast, pyspark.files, pyspark.java_gateway, pyspark.find_spark_home, pyspark.storagelevel, pyspark.rdd, pyspark.join, pyspark.resultiterable, pyspark.statcounter, pyspark.rddsampler, pyspark.shuffle, pyspark.heapq3, pyspark.traceback_utils, pyspark.status, pyspark.profiler, pyspark.taskcontext, pyspark.version, pyspark._globals, pyspark.sql, pyspark.sql.types, pyspark.sql.context, pyspark.sql.session, pyspark.sql.conf, pyspark.sql.dataframe, pyspark.sql.column, pyspark.sql.readwriter, pyspark.sql.utils, pyspark.sql.streaming, pyspark.sql.udf, pyspark.sql.catalog, pyspark.sql.group, pyspark.sql.window
ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:175
初步成功,还有java版本问题,待续
继续补救:
1. 添加python环境变量 “C:\Users\...\AppData\Local\Programs\Python\Python36”
2. 安装setuptools-28.6.0:在目录下执行 cmd--python setup.py install
3.到pyspark目录下执行 cmd--python.py install
spyder下 import pyspark 成功