Spark programming environment
Spark installation
Access Spark download page , and select to download the latest version of Spark direct current latest version is 2.4.2. After downloading the required good compression solution to the installation folder, depends on your preferences, we are installed to/opt 目录下。
tar -xzf spark-2.4.2-bin-hadoop2.7.tgz mv spark-2.4.2-bin-hadoop2.7/opt/spark-2.4.2
In order to open the terminal directly Spark shell 环境,需要配置相应的环境变量。这里我由于使用的是
zsh,所以需要配置环境到
~/.zshrc 中。
Not installed can be configured to zsh
~/.bashrc 中
# Edit zshrc file sudo gedit ~ / .zshrc # add the following: Export SPARK_HOME = / opt / the Spark-2.4.2export the PATH = $ SPARK_HOME / bin: $ the PATH Export the PATH = $ SPARK_HOME / Python: $ SPARK_HOME / Python / lib / py4j-0.10.4-src.zip:$PYTHONPATH
After configuration is complete, shell 中输入
spark-shell 或者
pyspark 就可以进入到 Spark 的交互式编程环境中,前者是进入
Scala 交互式环境,后者是进入
Python 交互式环境。
Configuring Python programming environment
Here introduces two programming environment,Jupyter 和
Visual Studio Code。前者方便进行交互式编程,后者方便最终的集成式开发。
PySpark in Jupyter
First, how in Jupyter 中使用 Spark,注意这里 Jupyter notebook 和 Jupyter lab 是通用的方式,此处以 Jupyter lab 中的配置为例:
Use PySpark Jupyter lab in the presence of two methods:
- Configuring PySpark starter is Jupyter lab, run
pyspark 将自动打开一个 Jupyter lab;
- Open a normal Jupyter lab, and use
findSpark 包来加载 PySpark。
The first option is faster, but is specific to Jupyter notebook, the second option is a broader way to make PySpark in any of your favorite IDE are available, the second method is strongly recommended.
Method a: Configuration initiator PySpark
Update PySpark starter environment variables, continued ~/.zshrc 文件中增加以下内容:
export PYSPARK_DRIVER_PYTHON=jupyter export PYSPARK_DRIVER_PYTHON_OPTS='lab'
If you are using jupyter notebook, the second parameter value will be changed to notebook
Refresh environment variable or restart the machine, and perform pyspark 命令,将直接打开一个启动了 Spark 的 Jupyter lab。
pyspark
Option two: findSpark package
Use PySpark Yet another more general Jupyter lab method: Using findspark 包在代码中提供 Spark 上下文环境。
findspark 包不是特定于 Jupyter lab 的,您也可以其它的 IDE 中使用该方法,因此这种方法更通用,也更推荐该方法。
First install findspark:
pip install findspark
After opening a Jupyter lab, we conduct Spark programming needs to be imported findspark package, for example:
# Import findspark and initializes findspark Import findspark.init () from pyspark importSparkConf, SparkContextimport Random # Configure the Spark the conf = SparkConf (). A setMaster ( "local [*]"). SetAppName ( "Pi") using a context # start the Spark SC = SparkContext (= the conf the conf) num_samples = 100000000definside (P): X, Y = random.random (), random.random () return Y * X * X + Y <. 1 COUNT = sc.parallelize (Range (0, num_samples) ) .filter (Inside) .count () PI = COUNT. 4 * / num_samples Print (PI) sc.stop ()
Run the sample :
PySpark in VScode
Visual Studio Code 作为一个优秀的编辑器,对于
Python 开发十分便利。这里首先推荐个人常用的一些插件:
- Python: I will install the plug-in, provides a Python language support;
- Code Runner: Support Certain fragments run file;
In addition, the VScode 上使用 Spark 就不需要使用
findspark 包了,可以直接进行编程:
from pyspark importSparkContext,SparkConf conf =SparkConf().setMaster("local[*]").setAppName("test") sc =SparkContext(conf=conf) logFile ="file:///opt/spark-2.4.2/README.md" logData = sc.textFile(logFile,2).cache() numAs = logData.filter(lambda line:'a'in line).count() numBs = logData.filter(lambda line:'b'in line).count()print("Lines with a: {0}, Lines with b:{1}".format(numAs, numBs))