window10 build pyspark (super detail)

First, the component Imprint

Java JDK:1.8.0_144

spark-2.4.3-bin-hadoop2.7
hadoop-2.7.7

scala-2.12.8

hadooponwindows-master

Python3.7

Precautions:

Spark running Java 8 +, the Python 2.7 + / 3.4 + and R 3.1+. For Scala API, Spark 2.4.3 using Scala 2.12. You need to use a compatible Scala version (2.12.x)

1, JDK installation

After downloading the installation configuration environment variable:

Method for the computer configuration environment variable [right] -> Properties -> environment variable, method of editing environment variables shown below

After the configuration: cmd window open

2, the configuration Scala

download link:

https://www.scala-lang.org/download/2.12.8.html

After installation is complete, configure the environment variables

3, installation Spark

download link:

http://spark.apache.org/downloads.html

After extracting configuration environment variable:

4, install hadoop

download link:

http://hadoop.apache.org/releases.html

After extracting configuration environment variable:

5, the installation Python3.7

6, the bin cover hadooponwindows-master of the bin hadoop-2.7.7

 

7. Python-related treatment

1, the next directory spark (such as my D: \ IT \ bigdata \ soft \ spark-2.4.3-bin-hadoop2.7 \ python) of pyspark folder to the python folder (mine is D: \ IT \ python \ Python \ Lib \ site-packages)

2,安装py4j库

一般的在cmd命令行下 pip install py4j 就可以。若是没有将pip路径添加到path中,就将路径切换到python的Scripts中,然后再 pip install py4j 来安装库。

3 修改权限

将winutils.exe文件放到Hadoop的bin目录下(我的是E:\spark\spark-2.1.0-bin-hadoop2.7\bin),然后以管理员的身份打开cmd,然后通过cd命令进入到Hadoop的bin目录下,然后执行以下命令:

winutils.exe chmod 777 c:\tmp\Hive
8、启动

9、使用Pycharm新建一个wordcount例程

from pyspark import SparkConf, SparkContext
# 创建SparkConf和SparkContext
conf = SparkConf().setMaster("local").setAppName("lichao-wordcount")
sc = SparkContext(conf=conf)
# 输入的数据
data = ["hello", "world", "hello", "word", "count", "count", "hello"]
# 将Collection的data转化为spark中的rdd并进行操作
rdd = sc.parallelize(data)
resultRdd = rdd.map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)
# rdd转为collecton并打印
resultColl = resultRdd.collect()
for line in resultColl:
    print(line)

搭建完成啦!

Guess you like

Origin www.cnblogs.com/yfb918/p/10978856.html