Configure pyspark on pycharm under window7 to build a spark test environment

Because spark has been deployed on the virtual machine linux, but every time you write a python script of spark, you have to test it on the virtual machine, which is very troublesome. Therefore, under the local win7 system, combined with the pycharm development tool, you can build a local Test run environment. To run the python script program of spark locally, of course, the related environment of spark is required, so the premise is to build a spark environment under the local win7 [the steps are as follows]



 1. 搭建本地测试的spark环境
 2. 在pycharm 开发工具中配置spark环境指向,这样脚本运行时,就能成功调用spark的相关资源包。
 3. 编写好程序,运行

[1 Build a spark environment for local testing]
In order to avoid problems caused by inconsistent versions, the spark version installed under the local win is the same as the spark version on the virtual machine.
Spark depends on scala and scala depends on jdk, so both jdk and scala have to be configured and installed.
on the virtual machine

    jdk1.7
    scala   2.11.8
    spark-2.0.1

write picture description here


It is best to install the same version on the local window, as shown below.

write picture description here

Download the corresponding version of the window system (linux and windows of jdk1.7 are different, scala and spark are the same in linux and windows, because jdk is to achieve system platform independence, so it is related to the system, scala and spark are running Above the jdk, it is already platform independent). Configure related environment variables of jdk, scala, and spark.


[2 Install the python environment]


Different spark supports different python versions. You can view the version requirements through the spark directory /bin/pyspark.sh file.

if hash python2.7 2>/dev/null; then
  # Attempt to use Python 2.7, if installed:
  DEFAULT_PYTHON="python2.7"
else
  DEFAULT_PYTHON="python"
fi

The latest version 3.6 does not seem to be very good, and the following error will be reported in the python3.6 environment
TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 'rename', and 'module'。

So after installing 2.7 or above, I chose the python 3.5 version. After
configuring the python3.5 environment variable, check whether it can run normally.
write picture description here


[3 Install and configure pycharm]
Download and install pychram (the cracking method is Baidu, if you have a school mailbox, you can register for free) to
create a new python project of testspark, and write the test.py script as follows

import os
import sys
import re
try:
    from pyspark import SparkContext
    from pyspark import SparkConf
    print("Successfully imported Spark Modules")
except ImportError as e:
    print("Can not import Spark Modules", e)

write picture description here
Select tesy.py, right-click and select to execute the [run 'test.py'] operation, it will output

Can not import Spark Modules No module named 'pyspark'

This is that the try operation in the code was unsuccessful. Failed to load spark library file

   from pyspark import SparkContext
   from pyspark import SparkConf

You need to configure the running attempt - spark library file location, the configuration is as follows: Run menu bar -> Edit Configurations -> select the script file tesy.py->Environment variables -> add spark environment variables -> apply and save and run again

write picture description here

write picture description here

write picture description here

//本地spark目录下的python位置及py4j的位置 

PYTHONPATH=C:\Java\spark-2.0.1-bin-hadoop2.7\python;C:\Java\spark-2.0.1-bin-hadoop2.7\python\lib\py4j-0.10.3-src.zip

//spark 目录位置
SPARK_HOME=C:\Java\spark-2.0.1-bin-hadoop2.7

Execute test.py again

write picture description here

Loaded successfully. You can test the spark code locally. Note: Because the configured environment action object is the current py script, after creating a new py file, you need to increase the PYTHONPATH and SPARK_HOME environment variables again.


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325449129&siteId=291194637