Because spark has been deployed on the virtual machine linux, but every time you write a python script of spark, you have to test it on the virtual machine, which is very troublesome. Therefore, under the local win7 system, combined with the pycharm development tool, you can build a local Test run environment. To run the python script program of spark locally, of course, the related environment of spark is required, so the premise is to build a spark environment under the local win7 [the steps are as follows]
1. 搭建本地测试的spark环境
2. 在pycharm 开发工具中配置spark环境指向,这样脚本运行时,就能成功调用spark的相关资源包。
3. 编写好程序,运行
[1 Build a spark environment for local testing]
In order to avoid problems caused by inconsistent versions, the spark version installed under the local win is the same as the spark version on the virtual machine.
Spark depends on scala and scala depends on jdk, so both jdk and scala have to be configured and installed.
on the virtual machine
jdk1.7
scala 2.11.8
spark-2.0.1
It is best to install the same version on the local window, as shown below.
Download the corresponding version of the window system (linux and windows of jdk1.7 are different, scala and spark are the same in linux and windows, because jdk is to achieve system platform independence, so it is related to the system, scala and spark are running Above the jdk, it is already platform independent). Configure related environment variables of jdk, scala, and spark.
[2 Install the python environment]
Different spark supports different python versions. You can view the version requirements through the spark directory /bin/pyspark.sh file.
if hash python2.7 2>/dev/null; then
# Attempt to use Python 2.7, if installed:
DEFAULT_PYTHON="python2.7"
else
DEFAULT_PYTHON="python"
fi
The latest version 3.6 does not seem to be very good, and the following error will be reported in the python3.6 environment
TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 'rename', and 'module'。
So after installing 2.7 or above, I chose the python 3.5 version. After
configuring the python3.5 environment variable, check whether it can run normally.
[3 Install and configure pycharm]
Download and install pychram (the cracking method is Baidu, if you have a school mailbox, you can register for free) to
create a new python project of testspark, and write the test.py script as follows
import os
import sys
import re
try:
from pyspark import SparkContext
from pyspark import SparkConf
print("Successfully imported Spark Modules")
except ImportError as e:
print("Can not import Spark Modules", e)
Select tesy.py, right-click and select to execute the [run 'test.py'] operation, it will output
Can not import Spark Modules No module named 'pyspark'
This is that the try operation in the code was unsuccessful. Failed to load spark library file
from pyspark import SparkContext
from pyspark import SparkConf
You need to configure the running attempt - spark library file location, the configuration is as follows: Run menu bar -> Edit Configurations -> select the script file tesy.py->Environment variables -> add spark environment variables -> apply and save and run again
//本地spark目录下的python位置及py4j的位置
PYTHONPATH=C:\Java\spark-2.0.1-bin-hadoop2.7\python;C:\Java\spark-2.0.1-bin-hadoop2.7\python\lib\py4j-0.10.3-src.zip
//spark 目录位置
SPARK_HOME=C:\Java\spark-2.0.1-bin-hadoop2.7
Execute test.py again
Loaded successfully. You can test the spark code locally. Note: Because the configured environment action object is the current py script, after creating a new py file, you need to increase the PYTHONPATH and SPARK_HOME environment variables again.