windows mounted spark-python

First you need to install Java

Download and install and configure Spark

From the official website Download Apache Spark ™ to download the appropriate version of the spark, because the spark is based on hadoop, you need to download the corresponding version of hadoop job, this page has requested version of hadoop of the Spark Click the Download:  spark-2.3.1-bin- hadoop2.7.tgz you can download the compressed package, corresponding to hadoop version in Hadoop2.7 and beyond.

 

 
 

Here decompression to D: \ spark-2.3.1-bin-hadoop2.7, for easy subsequent operation, where after the folder name to extract the spark, such a path is decompressed D: \ spark

Configuration environment variable

Right My Computer, then click Properties - Advanced System Settings - Environment Variables

New User Variable SPARK_HOME D: \ spark

 

 
 

Click the button to find a system variable Path New, and then add text% SPARK_HOME% \ bin, press Enter to enter, continue to create a new, add text% SPARK_HOME% \ sbin, the Enter key has been click OK, you save the changes, so will bin, sbin folder program into the system variable

 

 
 

pyspark: the configuration of the spark portion is completed here, there needs to be configured pyspark, pyspark other after installation anaconda discussed below, there are several ways pyspark mounted, wherein after extracting spark folders have pyspark libraries, may be mounted to python library to them; can not replicate, can be installed separately by pyspark PIP, there is a separate installation package pyspark download, install later extract the python them to the library.

 

 

 

Install and configure Hadoop

When installing spark above requirements have hadoop version, where the requirements of 2.7 and later, into the official website of Apache Hadoop Releases Download version 2.7.6 binary, which is the source version hadoop version of the source code, after downloading unzip to D: \ hadoop-2.7.6, in order to facilitate the subsequent operation, after extracting the folder name is modified Hadoop, this folder is D: \ hadoop

 

 
 

Configuration environment variable:

Right My Computer, then click Properties - Advanced System Settings - Environment Variables

 

 
 

New user variables HADOOP_HOME D: \ hadoop

Then click the button to find a system variable Path New, and then add text% HADOOP% \ bin, press Enter to enter, continue to create a new, add text% HADOOP% \ sbin, the Enter key has been click OK, you save the changes, so the bin, sbin folder program into the system variable

 

 
 

Downloaded from the website click on the link to open a compressed package, and then extract them, copy them winutils.exe and winutils.pdb to hadoop installation folder, copy the directory: D: \ hadoop \ bin, copied to this directory

Show spark installation configuration when inputting commands pyspark completed the following results

 
 

Install and configure anaconda

 

 

 

 

 

In anaconda official website to download and install the corresponding version of the anaconda, the installation path here is C: \ Anaconda3.5.2.0, which point to note is the need to select the first option to join the anaconda environment variables, so we do not need to own its path will be added to the environment variable in the go

Anaconda installation is not required, you must install a python, python also can install only the individual, but the anaconda which integrates a lot of need to use the library, for convenience, here is the installation anaconda.

 
 

Configuring pyspark anaconda repository contains pyspark

之前在安装spark的时候,提到过pyspark库的安装有几种方法,一种方法是直接将spark自带的pyspark库安装到python的库当中去;一种是使用命令pip install pyspark安装;还有一种是单独下载pyspark的安装包,解压以后安装到python库当中去。这几种方法,这里都会进行讲解。

 

将spark自带的pyspark库安装到python:

以管理员身份打开cmd,按一下键盘上的window键,依次选中Windows 系统,右键命令提示符,点击更多,点击以管理员身份运行

 

进入spark安装目录的python文件夹,cd D:\spark\python

C:\>cd D:\spark\python

C:\>d:

 

D:\spark\python>

 

输入命令 python setup.py install,等待安装完成,

D:\spark\python>python setup.py install

 

出现这个图时pyspark就安装好了

 

 
 

pip install pyspark命令行方式安装:

同上面打开cmd的方式相同,需要以管理员身份运行,按一下键盘上的window键,依次选中Windows 系统,右键命令提示符,点击更多,点击以管理员身份运行

输入命令 pip install pyspark,等待安装完成,这里需要注意的是,pyspark本身的安装包占用磁盘空间很多,有几百M,这种方式安装需要在线下载pyspark,网速不错的话,是非常推荐的,这种方式最简单,只需要一行命令就行了。

 

单独下载安装pyspark:

进入pyspark的PyPI的网站,点击左侧的Download files,下载pyspark的安装包,然后解压好,这里解压的路径是D:\pyspark-2.3.1

 

同上面打开cmd的方式相同,需要以管理员身份运行,按一下键盘上的window键,依次选中Windows 系统,右键命令提示符,点击更多,点击以管理员身份运行

进入解压以后文件夹的目录

输入命令行 python setup.py install ,等待安装完成,pyspark就安装完成了

D:\pyspark-2.3.1>python setup.py install

以上几种方式都可以安装pyspark,其中最方便的方式是使用命令行 pip install pyspark,下面将讲解pycharm的安装配置过程,并演示一个python编写spark的示例。

 

 

安装并配置Pycharm

Pycharm的官方网站中下载pycharm的community版本,这个版本是免费的,按照默认配置安装就可以

安装好以后打开pycharm,根据自己的喜好配置界面,到这一步时,可以安装一些插件,这里安装的是Markdown

 

进入打开界面时打开settings

选择好Project Interpreter,点击右侧的下拉链,然后点击show all

 

 
 

点击+号,添加项目解释器,选中其中的Conda Environment,然后点击Existing environment,点击右侧的选择按钮,进入目录C:\Anaconda3.5.2.0,选中其中的python.exe文件,然后一直点击OK

 

 
 

等待库载入完成以后,点击OK,就完成了Project Interpreter的配置,等待更新完成,或者让它在后台运行

 

 
 
 
 
 
 
 
 
 
 

这个是在最开始的时候配置Project Interpreter,进入界面以后,可以在File-Settings或者File-Default_Settings中设置

设置自己的字体,在File-Settings-Editor-Font当中设置

 

使用python来编写spark的WordCount程序实例流程

新建一个项目,编辑好项目的存放目录以后,需要注意选择Existing interpreter,而不是New interpreter,上一步就是在配置Project interpreter,需要点击选择已经配置好的解释器。新建一个项目还依次点击按钮File-Setting-New Project

等待pycharm配置好,右下角会有提示的,等这个任务完成以后,就可以新建python文件了

点击Create就创建好了一个项目,鼠标放在左侧项目然后右键,依次点击New-Python File,创建一个python文件WordCount.py

进入WordCount.py文件写入如下代码,就是中文版WordCount,很经典的分布式程序,需要用到中文分词库jieba,去除停用词再进行计数

新建两个文件

 
 

jieba分词https://pypi.org/project/jieba/#files

 

 
 

下载完后将导入项目中

 
 
 
 

from pyspark.contextimport SparkContext

import jieba

sc = SparkContext("local", "WordCount")#初始化配置

data = sc.textFile(r"D:\WordCount.txt")#读取是utf-8编码的文件

with open(r'd:\中文停用词库.txt','r',encoding='utf-8')as f:

x=f.readlines()

stop=[i.replace('\n','')for iin x]

print(stop)

stop.extend([',','的','我','他','','。',' ','\n','?',';',':','-','(',')','!','1909','1920','325','B612','II','III','IV','V','VI','—','‘','’','“','”','…','、'])#停用标点之类

data=data.flatMap(lambda line: jieba.cut(line,cut_all=False)).filter(lambda w: wnot in stop).\

map(lambda w:(w,1)).reduceByKey(lambda w0,w1:w0+w1).sortBy(lambda x:x[1],ascending=False)

print(data.take(100))

 

 
 

 



转自:https://www.jianshu.com/p/c5190d4e8aaa

Guess you like

Origin www.cnblogs.com/tjp40922/p/12174171.html