记一次spark计算结果无法落盘问题的解析思路

一、问题现象

今天的azkaban有个spark任务(离线)报错, 报错信息如下:

08-02-2022 07:09:32 CST DailyReport2Excel  INFO - 22/02/08 07:09:32 INFO yarn.Client: Application report for application_1640678855326_133429 (state: ACCEPTED)
08-02-2022 07:09:32 CST DailyReport2Excel  INFO - 22/02/08 07:09:32 INFO yarn.Client: Application report for application_1640678855326_133429 (state: RUNNING)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 22/02/08 07:09:32 INFO yarn.Client: 
08-02-2022 07:09:32 CST DailyReport2Excel  INFO - 	 client token: N/A
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	 diagnostics: N/A
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	 ApplicationMaster host: 111.111.111.131
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	 ApplicationMaster RPC port: 0
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	 queue: root.users.hdfs
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	 start time: 1644275223892
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	 final status: FAILED
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	 tracking URL: http://nn1.my-cdh.com:8088/proxy/application_1640678855326_133429/

08-02-2022 07:09:32 CST DailyReport2Excel INFO - 22/02/08 07:09:32 INFO yarn.Client: Application report for application_1640678855326_133429 (state: FINISHED)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 22/02/08 07:09:32 INFO yarn.Client: 
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	 client token: N/A
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	 diagnostics: User class threw exception: java.io.FileNotFoundException: /data/reports/xxx平台xx报表.xls (Permission denied)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	at java.io.FileOutputStream.open0(Native Method)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	at java.io.FileOutputStream.open(FileOutputStream.java:270)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	at java.io.FileOutputStream.<init>(FileOutputStream.java:162)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	at com.david.report.DailyReport2Excel$.do_business(DailyReport2Excel.scala:409)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	at com.david.report.DailyReport2Excel$$anonfun$main$1.apply$mcVI$sp(DailyReport2Excel.scala:56)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	at com.david.report.DailyReport2Excel$.main(DailyReport2Excel.scala:49)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	at com.david.report.DailyReport2Excel.main(DailyReport2Excel.scala)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	at java.lang.reflect.Method.invoke(Method.java:497)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:688)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	 ApplicationMaster host: 111.111.111.131
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	 ApplicationMaster RPC port: 0
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	 queue: root.users.hdfs
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	 start time: 1644275223892
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	 final status: FAILED
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	 tracking URL: http://nn1.my-cdh.com:8088/proxy/application_1640678855326_133429/
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	 user: hdfs
08-02-2022 07:09:32 CST DailyReport2Excel INFO - Exception in thread "main" org.apache.spark.SparkException: Application application_1640678855326_133429 finished with failed status
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	at org.apache.spark.deploy.yarn.Client.run(Client.scala:1153)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1568)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 22/02/08 07:09:32 INFO util.ShutdownHookManager: Shutdown hook called
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 22/02/08 07:09:32 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-0f892c24-67c2-424a-a4fd-f24853d3eef3
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 22/02/08 07:09:32 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-07eb9a13-8f22-4a6a-aac1-b5daa87980cd
08-02-2022 07:09:32 CST DailyReport2Excel INFO - Process completed unsuccessfully in 152 seconds.
08-02-2022 07:09:32 CST DailyReport2Excel ERROR - Job run failed!
java.lang.RuntimeException: azkaban.jobExecutor.utils.process.ProcessFailureException: Process exited with code 1
	at azkaban.jobExecutor.ProcessJob.run(ProcessJob.java:304)
	at azkaban.execapp.JobRunner.runJob(JobRunner.java:786)
	at azkaban.execapp.JobRunner.doRun(JobRunner.java:601)
	at azkaban.execapp.JobRunner.run(JobRunner.java:562)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: azkaban.jobExecutor.utils.process.ProcessFailureException: Process exited with code 1
	at azkaban.jobExecutor.utils.process.AzkabanProcess.run(AzkabanProcess.java:125)
	at azkaban.jobExecutor.ProcessJob.run(ProcessJob.java:296)
	... 8 more
08-02-2022 07:09:32 CST DailyReport2Excel ERROR - azkaban.jobExecutor.utils.process.ProcessFailureException: Process exited with code 1 cause: azkaban.jobExecutor.utils.process.ProcessFailureException: Process exited with code 1
08-02-2022 07:09:32 CST DailyReport2Excel INFO - Finishing job DailyReport2Excel at 1644275372628 with status FAILED

二、梳理定位

经过对近期操作的梳理盘点, 整理了如下信息:

①.执行该spark任务的用户为: hdfs, 其所属组信息为 “queue: root.users.hdfs”;
②.最近CDH集群曾添加过3台新节点, 而azkaban log中显示的服务器 111.111.111.131 正是3台新节点中的一台;
③.经查询历史azkaban的执行日志, 发现凡是spark on yarn, 经yarn调度到这3台服务器上的spark task, 都会报这样的错误, 导致该task调度失败.
④.上述spark task会先落盘至本地linux的/data目录下, 之后上传至HDFS目录中.

三、推断

新添加的3台spark节点, 在 yarn 将 spark task 调度到这些节点上后, 计算结果落盘到 本地linux 的 data目录 的过程中, 发现用户hdfs并没有相应目录的写&执行权限, 导致存储失败.

四、验证

4.1 在之前的集群节点上

以节点 cdh01 为例:

[root@cdh01 ~]# id hdfs
uid=996(hdfs) gid=993(hdfs) groups=993(hdfs),0(root)

[root@cdh01 ~]# id hadoop
id: hadoop: no such user

4.2 在新加的集群节点上

以节点 cdh31 为例:

[root@cdh31 ~]# id hdfs
uid=994(hdfs) gid=991(hdfs) groups=991(hdfs),993(hadoop)

[root@cdh31 ~]# id hadoop
id: hadoop: no such user

[root@cdh31 ~]# groups hdfs
hdfs : hdfs hadoop

可以看到,新节点上的hdfs用户, 并没有在root组下, 导致无法向 /data文件夹下写入文件.

五、解决方案

5.1 追加hdfs用户到root组下

[root@cdh31 ~]# usermod -a -G root hdfs

[root@cdh31 ~]# id hdfs
uid=994(hdfs) gid=991(hdfs) groups=991(hdfs),0(root),993(hadoop)

5.2 将hdfs用户从hadoop组中移除

[root@cdh31 ~]# gpasswd -d hdfs hadoop
Removing user hdfs from group hadoop

[root@cdh31 ~]# id hdfs
uid=994(hdfs) gid=991(hdfs) groups=991(hdfs),0(root)

5.3 查看hdfs用户&所属组的最新对应关系

[root@cdh31 ~]# id hdfs
uid=994(hdfs) gid=991(hdfs) groups=991(hdfs),0(root)

[root@cdh31 data]# groups hdfs
hdfs : hdfs root

六、引申

6.1 追加用户到用户组

将一个用户添加到用户组中,千万不能直接用:

usermod -G groupA
这样做会使你离开其他用户组,仅仅做为 这个用户组 groupA 的成员。
应该用 加上 -a 选项:

usermod -a -G groupA user
(FC4: usermod -G groupA,groupB,groupC user)
-a 代表 append, 也就是 将自己添加到 用户组groupA 中,而不必离开 其他用户组。

命令的所有的选项,及其含义:
Options:

-c, –comment COMMENT new value of the GECOS field
-d, –home HOME_DIR new home directory for the user account
-e, –expiredate EXPIRE_DATE set account expiration date to EXPIRE_DATE
-f, –inactive INACTIVE set password inactive after expiration
to INACTIVE
-g, –gid GROUP force use GROUP as new primary group
-G, –groups GROUPS new list of supplementary GROUPS
-a, –append append the user to the supplemental GROUPS
mentioned by the -G option without removing
him/her from other groups
-h, –help display this help message and exit
-l, –login NEW_LOGIN new value of the login name
-L, –lock lock the user account
-m, –move-home move contents of the home directory to the new
location (use only with -d)
-o, –non-unique allow using duplicate (non-unique) UID
-p, –password PASSWORD use encrypted password for the new password
-s, –shell SHELL new login shell for the user account
-u, –uid UID new UID for the user account
-U, –unlock unlock the user account

查看用户所属的组使用命令:

$ groups user
或者查看文件:

$ cat /etc/group

6.2 如何将用户从一个组中移除?

gpasswd -d userName groupName

注意: 通过如下方式移除, 将删除该用户, 与本场景不符

deluser USER GROUP  将用户从一个组中删除
  例: deluser mike students
常用选项:
  --quiet | -q   不将进程信息发给 stdout
  --help | -h  帮助信息
  --version | -v 版本号和版权
  --conf | -c 文件 以制定文件作为配置文件

猜你喜欢

转载自blog.csdn.net/liuwei0376/article/details/122819978