Remember the analysis idea of the problem that the spark calculation result cannot be placed on the disk

1. Problem phenomenon

Today's azkaban has a spark task (offline) error, and the error message is as follows:

08-02-2022 07:09:32 CST DailyReport2Excel  INFO - 22/02/08 07:09:32 INFO yarn.Client: Application report for application_1640678855326_133429 (state: ACCEPTED)
08-02-2022 07:09:32 CST DailyReport2Excel  INFO - 22/02/08 07:09:32 INFO yarn.Client: Application report for application_1640678855326_133429 (state: RUNNING)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 22/02/08 07:09:32 INFO yarn.Client: 
08-02-2022 07:09:32 CST DailyReport2Excel  INFO - 	 client token: N/A
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	 diagnostics: N/A
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	 ApplicationMaster host: 111.111.111.131
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	 ApplicationMaster RPC port: 0
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	 queue: root.users.hdfs
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	 start time: 1644275223892
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	 final status: FAILED
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	 tracking URL: http://nn1.my-cdh.com:8088/proxy/application_1640678855326_133429/

08-02-2022 07:09:32 CST DailyReport2Excel INFO - 22/02/08 07:09:32 INFO yarn.Client: Application report for application_1640678855326_133429 (state: FINISHED)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 22/02/08 07:09:32 INFO yarn.Client: 
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	 client token: N/A
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	 diagnostics: User class threw exception: java.io.FileNotFoundException: /data/reports/xxx平台xx报表.xls (Permission denied)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	at java.io.FileOutputStream.open0(Native Method)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	at java.io.FileOutputStream.open(FileOutputStream.java:270)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	at java.io.FileOutputStream.<init>(FileOutputStream.java:162)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	at com.david.report.DailyReport2Excel$.do_business(DailyReport2Excel.scala:409)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	at com.david.report.DailyReport2Excel$$anonfun$main$1.apply$mcVI$sp(DailyReport2Excel.scala:56)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	at com.david.report.DailyReport2Excel$.main(DailyReport2Excel.scala:49)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	at com.david.report.DailyReport2Excel.main(DailyReport2Excel.scala)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	at java.lang.reflect.Method.invoke(Method.java:497)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:688)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	 ApplicationMaster host: 111.111.111.131
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	 ApplicationMaster RPC port: 0
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	 queue: root.users.hdfs
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	 start time: 1644275223892
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	 final status: FAILED
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	 tracking URL: http://nn1.my-cdh.com:8088/proxy/application_1640678855326_133429/
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	 user: hdfs
08-02-2022 07:09:32 CST DailyReport2Excel INFO - Exception in thread "main" org.apache.spark.SparkException: Application application_1640678855326_133429 finished with failed status
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	at org.apache.spark.deploy.yarn.Client.run(Client.scala:1153)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1568)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 22/02/08 07:09:32 INFO util.ShutdownHookManager: Shutdown hook called
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 22/02/08 07:09:32 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-0f892c24-67c2-424a-a4fd-f24853d3eef3
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 22/02/08 07:09:32 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-07eb9a13-8f22-4a6a-aac1-b5daa87980cd
08-02-2022 07:09:32 CST DailyReport2Excel INFO - Process completed unsuccessfully in 152 seconds.
08-02-2022 07:09:32 CST DailyReport2Excel ERROR - Job run failed!
java.lang.RuntimeException: azkaban.jobExecutor.utils.process.ProcessFailureException: Process exited with code 1
	at azkaban.jobExecutor.ProcessJob.run(ProcessJob.java:304)
	at azkaban.execapp.JobRunner.runJob(JobRunner.java:786)
	at azkaban.execapp.JobRunner.doRun(JobRunner.java:601)
	at azkaban.execapp.JobRunner.run(JobRunner.java:562)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: azkaban.jobExecutor.utils.process.ProcessFailureException: Process exited with code 1
	at azkaban.jobExecutor.utils.process.AzkabanProcess.run(AzkabanProcess.java:125)
	at azkaban.jobExecutor.ProcessJob.run(ProcessJob.java:296)
	... 8 more
08-02-2022 07:09:32 CST DailyReport2Excel ERROR - azkaban.jobExecutor.utils.process.ProcessFailureException: Process exited with code 1 cause: azkaban.jobExecutor.utils.process.ProcessFailureException: Process exited with code 1
08-02-2022 07:09:32 CST DailyReport2Excel INFO - Finishing job DailyReport2Excel at 1644275372628 with status FAILED

Second, sorting out the positioning

After sorting out the recent operations, the following information has been sorted out:

①. The user who executes the spark task is: hdfs, and its group information is "queue: root.users.hdfs";
②. Recently, 3 new nodes have been added to the CDH cluster, and the server 111.111.111.131 displayed in the azkaban log is It is one of the three new nodes;
3. After querying the execution log of historical azkaban, it is found that any spark task scheduled by spark on yarn and yarn to these three servers will report such an error, causing the task to fail to be scheduled.
④. The above spark task will be placed in the /data directory of the local linux first, and then uploaded to the HDFS directory.

3. Inference

For the three newly added spark nodes, after yarn schedules the spark task to these nodes, and the calculation results are placed in the data directory of the local linux, it is found that the user hdfs does not have the write & execute permission of the corresponding directory, resulting in storage failure. .

4. Verification

4.1 On the previous cluster node

以节点 cdh01 为例:

[root@cdh01 ~]# id hdfs
uid=996(hdfs) gid=993(hdfs) groups=993(hdfs),0(root)

[root@cdh01 ~]# id hadoop
id: hadoop: no such user

4.2 On the newly added cluster node

以节点 cdh31 为例:

[root@cdh31 ~]# id hdfs
uid=994(hdfs) gid=991(hdfs) groups=991(hdfs),993(hadoop)

[root@cdh31 ~]# id hadoop
id: hadoop: no such user

[root@cdh31 ~]# groups hdfs
hdfs : hdfs hadoop

It can be seen that the hdfs user on the new node is not under the root group, which makes it impossible to write files to the /data folder.

5. Solutions

5.1 Add hdfs user to root group

[root@cdh31 ~]# usermod -a -G root hdfs

[root@cdh31 ~]# id hdfs
uid=994(hdfs) gid=991(hdfs) groups=991(hdfs),0(root),993(hadoop)

5.2 Remove the hdfs user from the hadoop group

[root@cdh31 ~]# gpasswd -d hdfs hadoop
Removing user hdfs from group hadoop

[root@cdh31 ~]# id hdfs
uid=994(hdfs) gid=991(hdfs) groups=991(hdfs),0(root)

5.3 View the latest correspondence between hdfs users and groups

[root@cdh31 ~]# id hdfs
uid=994(hdfs) gid=991(hdfs) groups=991(hdfs),0(root)

[root@cdh31 data]# groups hdfs
hdfs : hdfs root

6. Extension

6.1 Add user to user group

将一个用户添加到用户组中,千万不能直接用:

usermod -G groupA
这样做会使你离开其他用户组,仅仅做为 这个用户组 groupA 的成员。
应该用 加上 -a 选项:

usermod -a -G groupA user
(FC4: usermod -G groupA,groupB,groupC user)
-a 代表 append, 也就是 将自己添加到 用户组groupA 中,而不必离开 其他用户组。

命令的所有的选项,及其含义:
Options:

-c, –comment COMMENT new value of the GECOS field
-d, –home HOME_DIR new home directory for the user account
-e, –expiredate EXPIRE_DATE set account expiration date to EXPIRE_DATE
-f, –inactive INACTIVE set password inactive after expiration
to INACTIVE
-g, –gid GROUP force use GROUP as new primary group
-G, –groups GROUPS new list of supplementary GROUPS
-a, –append append the user to the supplemental GROUPS
mentioned by the -G option without removing
him/her from other groups
-h, –help display this help message and exit
-l, –login NEW_LOGIN new value of the login name
-L, –lock lock the user account
-m, –move-home move contents of the home directory to the new
location (use only with -d)
-o, –non-unique allow using duplicate (non-unique) UID
-p, –password PASSWORD use encrypted password for the new password
-s, –shell SHELL new login shell for the user account
-u, –uid UID new UID for the user account
-U, –unlock unlock the user account

查看用户所属的组使用命令:

$ groups user
或者查看文件:

$ cat /etc/group

6.2 How do I remove a user from a group?

gpasswd -d userName groupName

Note: Removing in the following ways will delete the user, which is inconsistent with this scenario

deluser USER GROUP  将用户从一个组中删除
  例: deluser mike students
常用选项:
  --quiet | -q   不将进程信息发给 stdout
  --help | -h  帮助信息
  --version | -v 版本号和版权
  --conf | -c 文件 以制定文件作为配置文件

Guess you like

Origin blog.csdn.net/liuwei0376/article/details/122819978