win10 + pyspark + pycharm + anaconda build a stand-alone environment

First, the tools to prepare

1. jdk1.8

2. scala

3. anaconda3

4. spark-2.3.1-bin-hadoop2.7

5. hadoop-2.8.3

6. winutils

7. pycharm

Second, the installation

1. jdk installation

oracle official website to download, configure the JAVA_HOME after installation, CLASS_PATH, bin directory is added to PATH, attention: PATH best use absolute paths under win10 environment! The same below!

 

 

 

2. scala installation

Official website to download, after installation configuration SCALA_HOME, bin directory is added to the PATH (FIG comprising a)

3. anaconda3 installation

Official website to download, pay attention to the "appended to the PATH" tick box installation

4. spark install

Official website to download compressed, decompressed SPARK_HOME arranged, appended to the PATH bin directory (FIG comprising a)

5. hadoop installation

Official website to download version> = spark hadoop version corresponds, after decompression HADOOP_HOME arranged, appended to the PATH bin directory (FIG comprising a)

6. winutils installation

Download: https://github.com/steveloughran/winutils , according to the corresponding version of hadoop download

7. pycharm installation

Download the paid version, use lanyu registration code activation, attention resolves to follow the prompts to add domain hosts file

Third, the related processing python

  1. The pyspark folder to anaconda3 \ Lib \ site-packages directory (in spark-2.3.1-bin-hadoop2.7 \ python directory)
  2. After replacing decompressed winutils bin directory hadoop version with the corresponding bin directory
  3. conda install py4j
  4. 进入hadoop\bin目录下,以管理员方式打开cmd,输入命令:winutils.exe chmod 777 c:\tmp\Hive,若提示错误,检查Hive目录是否存在,若不存在,则手动创建,再重新执行命令

四、验证

打开pycharm,使用anaconda中的python作为解释器,输入以下代码并运行:

from pyspark import SparkContext

sc = SparkContext('local')
doc = sc.parallelize([['a', 'b', 'c'], ['b', 'd', 'd']])
words = doc.flatMap(lambda d: d).distinct().collect()
word_dict = {w: i for w, i in zip(words, range(len(words)))}
word_dict_b = sc.broadcast(word_dict)

def wordCountPerDoc(d):
    dict = {}
    wd = word_dict_b.value
    for w in d:
        if wd[w] in dict:
            dict[wd[w]] += 1
        else:
            dict[wd[w]] = 1
    return dict

print(doc.map(wordCountPerDoc).collect())
print("successful!")

  运行结果:

[{0: 1, 1: 1, 2: 1}, {1: 1, 3: 2}]
successful!
View Code

 

本文为win10+pyspark+pycharm+anaconda的单机测试环境搭建。

Guess you like

Origin www.cnblogs.com/tianqizhi/p/11271812.html