Pyspark spark-submit 集群提交任务以及引入虚拟环境依赖包攻略

网上提交 scala spark 任务的攻略非常多，官方文档其实也非常详细仔细的介绍了 spark-submit 的用法。但是对于 python 的提交提及得非常少，能查阅到的资料非常少导致是有非常多的坑需要踩。

官方文档对于任务提交有这么一段介绍，但是初次使用者依然会非常疑惑：

Bundling Your Application’s Dependencies

If your code depends on other projects, you will need to package them alongside your application in order to distribute the code to a Spark cluster. To do this, create an assembly jar (or “uber” jar) containing your code and its dependencies. Both sbt and Maven have assembly plugins. When creating assembly jars, list Spark and Hadoop as provided dependencies; these need not be bundled since they are provided by the cluster manager at runtime. Once you have an assembled jar you can call the bin/spark-submit script as shown here while passing your jar.

For Python, you can use the --py-files argument of spark-submit to add .py, .zip or .egg files to be distributed with your application. If you depend on multiple Python files we recommend packaging them into a .zip or .egg.

可以看到如果我们使用 java 系语言，例如 java scala 我们可以轻松的将相关的依赖环境打包成 .jar，然后在提交的时候使用官方建议使用在的姿势进行集群提交。例如使用：

sudo -u hdfs spark-submit \
  --class "Excellent" \
  --master yarn \
  --deploy-mode cluster \
  --driver-memory 2g \
  --executor-memory 2g \
  --executor-cores 1 \
  /home/zhizhizhi/sparktry_2.11-0.1.jar

主要程序是 Excellent, 使用 yarn 进行调度，使用集群模式运行。需要分配的执行和 driver 的内存，以及执行的时候指定的核数。

其实对 python 的 submit 使用 yarn 也和这个命令差不多，我们可以使用

/etc/alternatives/spark-submit \
--master yarn \
--deploy-mode cluster \
--name md_day_dump_user \
--conf "spark.pyspark.driver.python=/home/uther/miniconda2/envs/uther/bin/python2.7" \
--conf "spark.pyspark.python=/home/uther/miniconda2/envs/uther/bin/python2.7" \
--py-files /home/uther/uther/uther.zip \
/home/uther/uther/spark_run/md_day_dump_users.py

好了让我们来谈下这里面的坑。

首先注意一下我这里显示申明了使用 /etc/alternatives/spark-submit 如果我们不使用这个申明我们会默认使用安装 pyspark 的时候给我们带的 spark-submit。

这一点非常重要，因为我们的集群使用的 CDH 部署的，所以其实很多环境变量依赖什么的 CDH 都已经给我们配置好了，使用自己的 spark-submit 就需要自己配置这些东西，可能会导致很多问题，比如你无法直接连接到目标 hive 等等等。

默认会使用

(uther) [uther@zed-2 ~]$ which spark-submit
~/miniconda2/envs/uther/bin/spark-submit

这一点要非常难发现。。。。。。得非常小心。

使用集群进行运行这一点感觉也有坑，按照我查阅的一些资料来看，如果使用集群调度，很有可能在分配 application master 的时候被分配到别的机器上去，这就需要别的机器也有这一套环境，否则可能会导致失败。可能会报出类似下面的问题，但是也不太确定，因为最近跑似乎每次都分配给了提交任务的节点进行执行的，之后再观察一下。

thread "main" java.io.FileNotFoundException: File

另外最关键的一步指定虚拟环境可以使用类似命令：

--conf "spark.pyspark.driver.python=/home/uther/miniconda2/envs/uther/bin/python2.7" \
--conf "spark.pyspark.python=/home/uther/miniconda2/envs/uther/bin/python2.7" \

这个两条命令指定了集群使用哪里的环境来运行你的程序。明显我们可能会关联非常多的依赖包，使用这种方法会比较优雅。

另外 spark 提供了另外一条命令给我们引入包

--py-files /home/uther/uther/uther.zip

这条命令的意思类似于我的程序里有 import uther.xxxx.xxx or from uther.xx.xx import xxx 类似语句，我需要将 uther 当错一个 egg or zip 包来 import 。像第三方包的行为一样。

指定之后 spark 就能在执行你的代码的时候找到对应的环境了。这对在使用 pyspark 的代码结构有一定的要求，尽量将你的逻辑打包成一个 python 包来方便引用。

另外还值得一提的是，当我们操作提交代码的时候还会报出各种奇奇怪怪的错误，但是基本上分为权限问题 | 和环境变量问题。

例如没有在 hdfs 上操作读写的权限，就需要你耐心的去 hdfs 上面把相关权限加上，研究相关的操作方法。

要用 yarn 调用相关的程序也记得把 yarn 加入被调用方的组，然后仔细检查相关的权限。

Reference:

https://zhuanlan.zhihu.com/p/43434216 spark-python版本依赖与三方模块方案

https://spark.apache.org/docs/2.2.0/submitting-applications.html 官方 Submitting Applications 文档

Pyspark spark-submit 集群提交任务以及引入虚拟环境依赖包攻略

Bundling Your Application’s Dependencies

猜你喜欢