pyspark cluster deployment
1. Install spark2.1.0 (omitted)
2. Install the Python3 environment
Download Anaconda3-4.2.0-Linux-x86_64.shDownload addresshttps
:
//mirrors.tuna.tsinghua.edu.cn/anaconda/archive/Anaconda3-4.2.0-Linux-x86_64.sh
Install Anaconda3
and switch to a user with hdfs read and write permissions and spark program execution (such as hadoop user)
sh Anaconda3-4.2.0-Linux-x86_64.sh
Enter
Choose yes to agree to the agreement
Corresponding python version Python 3.5.2
Enter Enter to enter the installation path (not in the current user's home directory, my installation path is /opt/py3/anaconda3,
note: Before entering, modify the /opt/py3 directory owner to be the current user)
Select no to configure global environment variables
3. Repair exceptions
If the hive environment is configured in the $SPARK_HOME/conf directory:
hive-env.sh
hive-site.xml configuration
Need to add oracle driver package to spark classpath
Add ojdbc14-10.2.0.4.0.jar
to $SPARK_HOME/jars directory
ClassNotFoundException: Class com.hadoop.compression.lzo.LzoCodec not found
在spark-env.sh中加入
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/cloudera/parcels/HADOOP_LZO-0.4.15-1.gplextras.p0.123/lib/hadoop/lib/native:/opt/cloudera/parcels/HADOOP_LZO/lib/hadoop/lib/*
export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:/opt/cloudera/parcels/HADOOP_LZO-0.4.15-1.gplextras.p0.123/lib/hadoop/lib/native:/opt/cloudera/parcels/HADOOP_LZO/lib/hadoop/lib/*
export SPARK_CLASSPATH=$SPARK_CLASSPATH:/opt/cloudera/parcels/HADOOP_LZO-0.4.15-1.gplextras.p0.123/lib/hadoop/lib/native:/opt/cloudera/parcels/HADOOP_LZO/lib/hadoop/lib/*
4. Configure environment variables
#spark
export SPARK_HOME=/usr/local/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME
#python3
export PYSPARK_PYTHON=/opt/py3/anaconda3/bin/python
export PYTHONPATH=${SPARK_HOME}/python/:$(echo ${SPARK_HOME}/python/lib/py4j-*-src.zip):${PYTHONPATH}
5. Distribute replication to other nodes in the cluster (omitted)
6. Test verification
Write a python program such as
test.py
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local[*]").setAppName("First_App")
sc = SparkContext(conf=conf)
sc.textFile("/tmp/xxx测试路径").count()
sc.stop()
Submit to run
/usr/local/spark/bin/spark-submit \
--master yarn \
--deploy-mode client \
--executor-memory 2G \
--num-executors 5 \
--executor-cores 2 \
--driver-memory 2G \
--queue spark \
/home/hadoop/f.py \
env=dev \
log.level=info
If there is no exception, the deployment is successful!