A large number of jars flying randomly during Spark on YARN

When using Spark On Yarn, the jar package will fly randomly, and a warning will be given. Let’s analyze the problem.



1. Log phenomenon

SparkonYarn log information:

WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
INFO yarn.Client: Uploading resource file:/tmp/spark-a7f4a566-9d21-43cf-8388-41698316838e/__spark_libs__855332709144911684.zip -> 
hdfs://mycluster/user/root/.sparkStaging/application_1588347637771_0011/__spark_libs__855332709144911684.zip
INFO yarn.Client: Uploading resource file:/tmp/spark-a7f4a566-9d21-43cf-8388-41698316838e/__spark_conf__9173892461601825482.zip -> 
hdfs://mycluster/user/root/.sparkStaging/application_1588347637771_0011/__spark_conf__.zip

The prompt message here is that neither spark.yarn.jars nor spark.yarn.archive is set in the YarnClient mode. Then the program will package and upload the Jar package under Spark_Home to the specified path on HDFS. After that, I deleted them. The size of these two files is relatively large. This is very resource and time consuming.

We check the size of these two files:

hadoop fs -du -s -h hdfs://mycluster/user/root/.sparkStaging/application_1588347637771_0011/__spark_libs__855332709144911684.zip
246.6 M  739.7 M  hdfs://mycluster/user/root/.sparkStaging/application_1588347637771_0011/__spark_libs__855332709144911684.zip
 
hadoop fs -du -s -h hdfs://mycluster/user/root/.sparkStaging/application_1588347637771_0011/__spark_conf__.zip
228.7 K  686.0 K  hdfs://mycluster/user/root/.sparkStaging/application_1588347637771_0011/__spark_conf__.zip

2. Solution:

1. Type the jar package under spark/jars into a zip package (to prevent too many small files and affect the performance of hdfs)

cd jars 
zip sparkjars.zip *.jar

2. Upload to hdfs

hadoop fs -put sparkjars.zip /lib

3. Specify the address of the jar package in the spark-default configuration, or specify it in -conf at startup

vi spark-default.conf

spark.yarn.archive                  hdfs://mycluster/lib/sparkjars.zip

3. Result verification

INFO yarn.Client: Source and destination file systems are the same. Not copying hdfs://mycluster/lib/sparkjars.zip

According to the log information, it is known that this information is not copied to HDFS, but the jar package information on HDFS is directly read

Guess you like

Origin blog.csdn.net/qq_43081842/article/details/109556624