Scenario description

The scenario is to configure the SSH remote interpreter on the local Pycharm, which is equivalent to developing locally and submitting it to the remote server for execution synchronously, which is very convenient. Following the teaching video, when building a small demo, I encountered several problems that did not appear in the video, and recorded them here. It has been tested that the pyspark interactive environment and the spark-submit command on the remote server can run normally.

1. Question 1

Error: JAVA_HOME not set.

At first I thought it was necessary to configure JAVA_HOME in the .bashrc file of the user who configures the interpreter, but it didn't work after trying. Another solution I saw on the Internet is to configure JAVA_HOME to the sbin/spark-config.sh file in the spark installation directory, and it didn't work after trying. The following scheme is effective in the information on the Internet: the following code is added to the local py file (JAVA_HOME configures the jdk installation path on the remote server).

import os
os.environ['JAVA_HOME'] = "/export/server/jdk1.8.0_241"

2. Question 2

The program exits directly after executing spark initialization (the following lines of code), and the following codes cannot be executed.

if __name__ == '__main__':
    # 初始化执行环境，构建SparkContext对象
    conf = SparkConf().setAppName("test").setMaster("local[3]")
    sc = SparkContext(conf=conf)

Compared with the output in the video, I found that the output of "Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties" is missing. From this, I think the problem may be that the location of spark cannot be found. The information on the Internet There is also a plan to add SPARK_HOME, which works after trying: add the following code to the local py file (configure the installation path of spark in SPARK_HOME).

os.environ['SPARK_HOME'] = "/export/server/spark-3.2.0-bin-hadoop3.2"

3. Question 3

报错：Exception:Python in worker has different version 3.6 than that in driver 3.8.

Noticing that it is a version problem, I first thought that I did not specify the python version when creating the virtual environment, so I tried to recreate the virtual environment and specify the python version, but this error still occurs. The following scheme is effective in the information on the Internet: the following code is added to the local py file (PYSPARK_PYTHON configures the python interpreter path in the virtual environment).

import os
os.environ['PYSPARK_PYTHON'] = '/export/software/anaconda3/envs/pyspark/bin/python'

4. Question 4

After solving the above three problems, I encountered the following error report.
Question 4 error

I didn’t find this problem on the Internet. I was very surprised at first, because I can usually find it. Then I compared it with the video. Then I found that the pyspark installed in the video is version 3.2.0, which corresponds to the spark version. , and pyspark automatically installed the latest 3.3.0 version when I installed it, which caused inconsistencies. Reinstalling the 3.2.0 version of pyspark successfully solved the problem.

thoughts

A very personal question like question 4 is likely to be related to the version. An error may be reported if a small part is updated. The software version (such as: spark) and the class library version (such as: pyspark) need to be consistent. From the first three questions, it can be seen that this remote submission code execution method seems to be unable to access the environment variables in the server, so it needs to be explicitly specified in the program.

Solution to error reporting when running pyspark program remotely

Table of contents