Is the PySpark environment still installed in pip?

Foreword:
Before reading this article, you may have been using pip install to install pyspark, and sometimes various errors occur due to the inconsistency of the installed version. Today you are very lucky and saw this article , Here is a description of Spark's python development environment construction (note that the premise is that you have already installed spark, it does not matter if you have not installed it, you can see the blogger's previous article Spark pseudo-distributed construction ), well , let's enter today's Main topic:

1. Use local shell

This is very simple. It depends on the file $SPARK_HOME/bin/pyspark after installing spark to start a pyspark shell window. This is a bit similar to the interactive window of Ipython. It is usually used for testing. But It is a bit laborious for development.
It only needs the environment variable $SPARK_HOME/bin that you configure in ~/.bashrc or /etc/profile, and then enter the following command to see

pyspark

Insert picture description here
It can be found that when this Shell is opened, a variable (that is marked in the red box) is automatically created for us. The type is Spark Session. In fact, not only this variable but also sc (that is SparkContext), sql (Spark.sql), etc., It is easy to see how it is implemented. Check the pyspark file in the bin directory, and you will return the content of the following picture

Insert picture description here
The writing of this script is quite simple. We found that the core of starting the pyspark shell window is the place marked in the red box, which is the called shell.py file, and then we then open shell.py according to this path, Found:
Insert picture description here
The several places marked in the figure are the codes for creating common variables. After understanding the implementation method, it feels less magical. This is the Python interactive development environment of spark.
Let's talk about the development environment of configuring the IDE.

2. Use PySpark in PyCharm

  • When you just installed the spark environment and did not install pyspark separately, you will find that you enter the code in your PyCharm:
import pyspark

No module name is'pyspark' will pop up, prompting that this package cannot be found. At this time, you may know Baidu. Many of the above will tell you to directly pip install one, but this will cause a lot of dependency problems. Since we have seen above When the spark installation package comes with its own pyspark package, why not use this package directly? The spark development team is also very considerate, and the pyspark package has been packaged in this installation package and we can use it directly.
Before using it, you should have such a concept first. What did we do when using the pip install command to automatically install the Python package?
Usually we install a python package by pip install'package_name version', after this command is executed will search at the far end of the warehouse python whether to have this package, if it will be downloaded, and then call the setup.py script package to install, the install directory in your $ PYTHON_HOME / site-packages
in the With this concept, I believe it will not be difficult to understand the following practices.
There are two ways to configure:

  1. Add the following code to the program:
import os
import sys

os.environ['SPARK_HOME'] = '/xxx/xxxx/spark'
sys.path.append('/xxxx/xxxx/spark/python')

Needless to say, this method is simple and rude, and treats the symptoms but not the root cause
. 2. The following is the method we will talk about today

(1) Click Run in the toolbar, and then click'Edit Configuration'
Insert picture description here
(2) Then click the $ symbol in the pop-up box to modify the runtime variables:
Insert picture description here

(3) Add the two variables PYTHONPATH and SPARK_HOME in this box
Insert picture description here
(4) Then save and exit, and then go to the code editing page, you will find that there is no error when you run the program, but there is another problem that there is no prompt, which requires The content mentioned earlier,
now we need to copy the two necessary directories under $SPARK_HOME\python (the two directories marked by the icon below) in the Python installation directory/site-packages directory
Insert picture description here
(5) Copy is complete Then go to Python's site-packages directory to see if it is copied in. The Anaconda environment I use here is a little different. After my copy is completed, it looks like this:
Insert picture description here
OK and finally restart PyCharm, you will find the code There is a prompt function, it is still very simple.
If you have any questions, please comment and discuss

Guess you like

Origin blog.csdn.net/qq_42359956/article/details/105764568