First, the tools to prepare
1. jdk1.8
2. scala
3. anaconda3
4. spark-2.3.1-bin-hadoop2.7
5. hadoop-2.8.3
6. winutils
7. pycharm
Second, the installation
1. jdk installation
oracle official website to download, configure the JAVA_HOME after installation, CLASS_PATH, bin directory is added to PATH, attention: PATH best use absolute paths under win10 environment! The same below!
2. scala installation
Official website to download, after installation configuration SCALA_HOME, bin directory is added to the PATH (FIG comprising a)
3. anaconda3 installation
Official website to download, pay attention to the "appended to the PATH" tick box installation
4. spark install
Official website to download compressed, decompressed SPARK_HOME arranged, appended to the PATH bin directory (FIG comprising a)
5. hadoop installation
Official website to download version> = spark hadoop version corresponds, after decompression HADOOP_HOME arranged, appended to the PATH bin directory (FIG comprising a)
6. winutils installation
Download: https://github.com/steveloughran/winutils , according to the corresponding version of hadoop download
7. pycharm installation
Download the paid version, use lanyu registration code activation, attention resolves to follow the prompts to add domain hosts file
Third, the related processing python
- The pyspark folder to anaconda3 \ Lib \ site-packages directory (in spark-2.3.1-bin-hadoop2.7 \ python directory)
- After replacing decompressed winutils bin directory hadoop version with the corresponding bin directory
- conda install py4j
- 进入hadoop\bin目录下,以管理员方式打开cmd,输入命令:winutils.exe chmod 777 c:\tmp\Hive,若提示错误,检查Hive目录是否存在,若不存在,则手动创建,再重新执行命令
四、验证
打开pycharm,使用anaconda中的python作为解释器,输入以下代码并运行:
from pyspark import SparkContext sc = SparkContext('local') doc = sc.parallelize([['a', 'b', 'c'], ['b', 'd', 'd']]) words = doc.flatMap(lambda d: d).distinct().collect() word_dict = {w: i for w, i in zip(words, range(len(words)))} word_dict_b = sc.broadcast(word_dict) def wordCountPerDoc(d): dict = {} wd = word_dict_b.value for w in d: if wd[w] in dict: dict[wd[w]] += 1 else: dict[wd[w]] = 1 return dict print(doc.map(wordCountPerDoc).collect()) print("successful!")
运行结果:
[{0: 1, 1: 1, 2: 1}, {1: 1, 3: 2}]
successful!
本文为win10+pyspark+pycharm+anaconda的单机测试环境搭建。