spark2.x-pyspark-cluster environment setup (centos6_python3)

pyspark cluster deployment

1. Install spark2.1.0 (omitted)

2. Install the Python3 environment

Download Anaconda3-4.2.0-Linux-x86_64.shDownload addresshttps
:
//mirrors.tuna.tsinghua.edu.cn/anaconda/archive/Anaconda3-4.2.0-Linux-x86_64.sh

Install Anaconda3
and switch to a user with hdfs read and write permissions and spark program execution (such as hadoop user)

sh Anaconda3-4.2.0-Linux-x86_64.sh

Enter

Choose yes to agree to the agreement

Corresponding python version Python 3.5.2

Enter Enter to enter the installation path (not in the current user's home directory, my installation path is /opt/py3/anaconda3,
note: Before entering, modify the /opt/py3 directory owner to be the current user)

Select no to configure global environment variables
3. Repair exceptions

If the hive environment is configured in the $SPARK_HOME/conf directory:
hive-env.sh
hive-site.xml configuration

Need to add oracle driver package to spark classpath

Add ojdbc14-10.2.0.4.0.jar
to $SPARK_HOME/jars directory

ClassNotFoundException: Class com.hadoop.compression.lzo.LzoCodec not found
在spark-env.sh中加入

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/cloudera/parcels/HADOOP_LZO-0.4.15-1.gplextras.p0.123/lib/hadoop/lib/native:/opt/cloudera/parcels/HADOOP_LZO/lib/hadoop/lib/*

export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:/opt/cloudera/parcels/HADOOP_LZO-0.4.15-1.gplextras.p0.123/lib/hadoop/lib/native:/opt/cloudera/parcels/HADOOP_LZO/lib/hadoop/lib/*

export SPARK_CLASSPATH=$SPARK_CLASSPATH:/opt/cloudera/parcels/HADOOP_LZO-0.4.15-1.gplextras.p0.123/lib/hadoop/lib/native:/opt/cloudera/parcels/HADOOP_LZO/lib/hadoop/lib/*

4. Configure environment variables

#spark
export SPARK_HOME=/usr/local/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME

#python3
export PYSPARK_PYTHON=/opt/py3/anaconda3/bin/python
export PYTHONPATH=${SPARK_HOME}/python/:$(echo ${SPARK_HOME}/python/lib/py4j-*-src.zip):${PYTHONPATH}

5. Distribute replication to other nodes in the cluster (omitted)

6. Test verification
Write a python program such as
test.py

from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local[*]").setAppName("First_App")
sc = SparkContext(conf=conf)
sc.textFile("/tmp/xxx测试路径").count()
sc.stop()

Submit to run

 /usr/local/spark/bin/spark-submit \
   --master yarn \
   --deploy-mode client \
   --executor-memory 2G \
   --num-executors 5 \
   --executor-cores 2 \
   --driver-memory 2G \
   --queue spark \
  /home/hadoop/f.py \
   env=dev  \
   log.level=info

If there is no exception, the deployment is successful!

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325892135&siteId=291194637