First, literacy
This article
helps explain the following issues:
- What spark on Yarn that?
- What is the operating principle pyspark application?
- How pyspark application running Yarn?
Second, the actual
- The python environment package
cd path_to_python
* Note that to enter the next package python directory, otherwise the whole path (path python is located) will be packed, unpacked later time, there will be long python in front of the path, can not properly parse error
zip -r path_to_pythonzip/python_user.zip ./*
- The python environment uploaded to hdfs
hadoop fs -put python_user.zip
- Modify spark Profile
Spark-defualts.config configuration file, uploads the python such that, during spark-submit, automatically python bundle to each working node environment.cp spark-defaults.conf spark-user.conf # 修改相关配置 spark.yarn.dist.archives path_to_hdfs/python_user.zip#python
* Note that the last #python surface can not be deleted, he would probably find python path in the path after the zip decompression, then named python. This involves the configuration file pyspark the python can find the right
- Modify the script submitted
#!bin/bash spark-submit --master yarn \ --driver-memory 4G --executor-memory 12G \ --properties-file conf/spark-user.conf \ --py-files other_dependence.py main.py
Third, the operating results
Simply run it, gensim version output python environment