Under Linux build of Python programming environment Spark

Spark programming environment

Spark installation

Access Spark download page , and select to download the latest version of Spark direct current latest version is 2.4.2. After downloading the required good compression solution to the installation folder, depends on your preferences, we are installed to/opt 目录下。

tar -xzf spark-2.4.2-bin-hadoop2.7.tgz
mv spark-2.4.2-bin-hadoop2.7/opt/spark-2.4.2

In order to open the terminal directly Spark shell 环境,需要配置相应的环境变量。这里我由于使用的是 zsh,所以需要配置环境到 ~/.zshrc 中。

Not installed can be configured to zsh ~/.bashrc 中

# Edit zshrc file 
sudo gedit ~ / .zshrc 
# add the following: Export SPARK_HOME = / opt / the Spark-2.4.2export the PATH = $ SPARK_HOME / bin: $ the PATH 
Export the PATH = $ SPARK_HOME / Python: $ SPARK_HOME / Python / lib / py4j-0.10.4-src.zip:$PYTHONPATH

After configuration is complete, shell 中输入 spark-shell 或者 pyspark 就可以进入到 Spark 的交互式编程环境中,前者是进入 Scala 交互式环境,后者是进入 Python 交互式环境。

Configuring Python programming environment

Here introduces two programming environment,Jupyter 和 Visual Studio Code。前者方便进行交互式编程,后者方便最终的集成式开发。

PySpark in Jupyter

First, how in Jupyter 中使用 Spark,注意这里 Jupyter notebook 和 Jupyter lab 是通用的方式,此处以 Jupyter lab 中的配置为例:

Use PySpark Jupyter lab in the presence of two methods:

  • Configuring PySpark starter is Jupyter lab, run pyspark 将自动打开一个 Jupyter lab;
  • Open a normal Jupyter lab, and use findSpark 包来加载 PySpark。

The first option is faster, but is specific to Jupyter notebook, the second option is a broader way to make PySpark in any of your favorite IDE are available, the second method is strongly recommended.

Method a: Configuration initiator PySpark

Update PySpark starter environment variables, continued ~/.zshrc 文件中增加以下内容:

export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='lab'

If you are using jupyter notebook, the second parameter value will be changed to notebook

Refresh environment variable or restart the machine, and perform pyspark 命令,将直接打开一个启动了 Spark 的 Jupyter lab。

pyspark

uploading-image-247799.png

Option two: findSpark package

Use PySpark Yet another more general Jupyter lab method: Using findspark 包在代码中提供 Spark 上下文环境。

findspark 包不是特定于 Jupyter lab 的,您也可以其它的 IDE 中使用该方法,因此这种方法更通用,也更推荐该方法。

First install findspark:

pip install findspark

After opening a Jupyter lab, we conduct Spark programming needs to be imported findspark package, for example:

# Import findspark and initializes findspark Import 
findspark.init () from pyspark importSparkConf, SparkContextimport Random 
# Configure the Spark 
the conf = SparkConf (). A setMaster ( "local [*]"). SetAppName ( "Pi") using a context # start the Spark 
SC = SparkContext (= the conf the conf) 
num_samples = 100000000definside (P):      
    X, Y = random.random (), random.random () return Y * X * X + Y <. 1 
COUNT = sc.parallelize (Range (0, num_samples) ) .filter (Inside) .count () 
PI = COUNT. 4 * / num_samples 
Print (PI) 
sc.stop ()

Run the sample :

uploading-image-293957.png


PySpark in VScode

Visual Studio Code 作为一个优秀的编辑器,对于 Python 开发十分便利。这里首先推荐个人常用的一些插件:

  • Python: I will install the plug-in, provides a Python language support;
  • Code Runner: Support Certain fragments run file;

In addition, the VScode 上使用 Spark 就不需要使用 findspark 包了,可以直接进行编程:

from pyspark importSparkContext,SparkConf
conf =SparkConf().setMaster("local[*]").setAppName("test")
sc =SparkContext(conf=conf)
logFile ="file:///opt/spark-2.4.2/README.md"
logData = sc.textFile(logFile,2).cache()
numAs = logData.filter(lambda line:'a'in line).count()
numBs = logData.filter(lambda line:'b'in line).count()print("Lines with a: {0}, Lines with b:{1}".format(numAs, numBs))

Guess you like

Origin www.linuxidc.com/Linux/2019-06/159050.htm