When using Spark On Yarn, the jar package will fly randomly, and a warning will be given. Let’s analyze the problem.
table of Contents
1. Log phenomenon
SparkonYarn log information:
WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
INFO yarn.Client: Uploading resource file:/tmp/spark-a7f4a566-9d21-43cf-8388-41698316838e/__spark_libs__855332709144911684.zip ->
hdfs://mycluster/user/root/.sparkStaging/application_1588347637771_0011/__spark_libs__855332709144911684.zip
INFO yarn.Client: Uploading resource file:/tmp/spark-a7f4a566-9d21-43cf-8388-41698316838e/__spark_conf__9173892461601825482.zip ->
hdfs://mycluster/user/root/.sparkStaging/application_1588347637771_0011/__spark_conf__.zip
The prompt message here is that neither spark.yarn.jars nor spark.yarn.archive is set in the YarnClient mode. Then the program will package and upload the Jar package under Spark_Home to the specified path on HDFS. After that, I deleted them. The size of these two files is relatively large. This is very resource and time consuming.
We check the size of these two files:
hadoop fs -du -s -h hdfs://mycluster/user/root/.sparkStaging/application_1588347637771_0011/__spark_libs__855332709144911684.zip
246.6 M 739.7 M hdfs://mycluster/user/root/.sparkStaging/application_1588347637771_0011/__spark_libs__855332709144911684.zip
hadoop fs -du -s -h hdfs://mycluster/user/root/.sparkStaging/application_1588347637771_0011/__spark_conf__.zip
228.7 K 686.0 K hdfs://mycluster/user/root/.sparkStaging/application_1588347637771_0011/__spark_conf__.zip
2. Solution:
1. Type the jar package under spark/jars into a zip package (to prevent too many small files and affect the performance of hdfs)
cd jars
zip sparkjars.zip *.jar
2. Upload to hdfs
hadoop fs -put sparkjars.zip /lib
3. Specify the address of the jar package in the spark-default configuration, or specify it in -conf at startup
vi spark-default.conf
spark.yarn.archive hdfs://mycluster/lib/sparkjars.zip
3. Result verification
INFO yarn.Client: Source and destination file systems are the same. Not copying hdfs://mycluster/lib/sparkjars.zip
According to the log information, it is known that this information is not copied to HDFS, but the jar package information on HDFS is directly read