0485- How to specify PySpark Python runtime environment code

Tips: If you use a computer to view the picture is not clear, you can use the phone to open the text of the article click photo to enlarge view high-definition picture.
Fayson the GitHub:
https://github.com/fayson/cdhproject
Tip: block portion can slide see Oh

Documents written in Objective 1

Fayson In the previous article " 0483- how to specify PySpark Python runtime environment " describes the Python specified when submitting using Spark2-submit operating environment. There are some users need to specify the operating environment in PySpark Python code, Fayson that this article describes how to specify PySpark Python runtime environment code.

  • test environment

1.RedHat7.2

2.CM and CDH version 5.15.0

3.Python2.7.5 和 Python3.6

2 Prepare the Python environment

Here Fayson Python2 environment and prepare two Python3, the following preparation steps for the environment:

1. In the two Anaconda official website to download the installation package, and the installation process Python2 Python3 Fayson not here in the introduced

Anaconda3-5.2.0-Linux-x86_64.sh two installation packages and Anaconda2-5.3.1-Linux-x86_64.sh

2. Python2 Pythonn3 two environments and packaged into the installation directory under Python2 and Python3

Use the zip command to package the two environments, respectively

[root@cdh05 anaconda2]# cd /opt/cloudera/anaconda2
[root@cdh05 anaconda2]# zip -r /data/disk1/anaconda2.zip ./*

[root@cdh05 anaconda3]# cd /opt/cloudera/anaconda3
[root@cdh05 anaconda3]# zip -r /data/disk1/anaconda3.zip ./*

Note: This is to the next installation directory Python compressing, did not bring the parent directory Python

3. Place the prepared Python2 and Python3 uploaded to HDFS

[root@cdh05 disk1]# hadoop fs -put anaconda2.zip /tmp
[root@cdh05 disk1]# hadoop fs -put anaconda3.zip /tmp
[root@cdh05 disk1]# hadoop fs -ls /tmp/anaconda*

Completion of these steps is ready PySpark operating environment, then specify the operating environment when you submit the code.

3 Prepare the sample job PySpark

Here is an example of explaining to do a simple PI PySpark code for the sample code in the previous article some differences increase the designated operating environment python examples code, sample code is as follows:

from __future__ import print_function
import sys
from random import random
from operator import add
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("PythonPi") \
    .config("spark.pyspark.python", "python/bin/python3.6") \
    .config("spark.pyspark.driver.python", "python3.6") \
    .config("spark.yarn.dist.archives", "hdfs://nameservice1/tmp/anaconda3.zip#python") \
    .config("spark.driver.memory", "2g") \
    .config("spark.executor.memory", "4g") \
    .getOrCreate()
#.config("spark.pyspark.driver.python", "/opt/cloudera/anaconda2/bin/python2.7")\

partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2
n = 100000 * partitions

def f(_):
    x = random() * 2 - 1
    y = random() * 2 - 1
    return 1 if x ** 2 + y ** 2 < 1 else 0

count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add)
print("Pi is roughly %f" % (4.0 * count / n))

spark.stop()

Example 4 runs

We first perform before running load Spark and pyspark environment variable, otherwise it will not find the error module "SparkSession" when executing python code, run python code that you need to make sure that the node has Spark2 Gateway client configuration.

1. In the following command to load the command execution environment variables Spark and python

export SPARK_HOME=/opt/cloudera/parcels/SPARK2/lib/spark2
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.4-src.zip:$SPARK_HOME/python/lib/pyspark.zip:$PYTHONPATH

2. Run the command line using the python code that pi_test.py

[root@cdh05 ~]# python pi_test.py 

Job Submission Success

3. The operation was successful

4. Review the Python environment jobs

5 summary

Use the command python Spark need to ensure that there is an environment variable on the node currently executing code execution PySpark code.

You need to specify SPARK_HOME and PYTHONPATH environment variable before running the code will be loaded Spark compiled Python environment to environment variable.

After the operating environment Python2 and Python3 PySpark packaged in HDFS, the process will start operating slower than in the past some of the need to get the Python environment from HDFS.

Tip: block section can slide around to see Oh
is the world heart, for the pepole destiny, to go to San Following his secrets for eternal peace open.

Published 315 original articles · won praise 11 · views 20000 +

Guess you like

Origin blog.csdn.net/Hadoop_SC/article/details/103945719