spark optimization

1: "Make the best use of everything", but after allocating multiple machines to spark, you need to configure the spark- submit shell as follows:

/usr/local/spark/bin/spark-submit \
--class com.spark.test.Top3UV \
--num-executors 3 \
--driver-memory 100m \
--executor-memory 100m \
--executor-cores 3 \
--files /usr/local/hive/conf/hive-site.xml \
--driver-class-path /usr/local/hive/lib/mysql-connector-java-5.1.17.jar \
/usr/local/jars/spTest-1.0-SNAPSHOT-jar-with-dependencies.jar \

The number of cpu cores of the executor is 3, and the number of executors is 3, then the total number of cpu cores is 9. It is recommended to set 2 to 3 times the cpu-core of the task parallelism to make the best use of it, because it is difficult to ensure that all tasks are in Completed at the same time!
2 : Reused Rdd, need to cache: StorageLevel.MEMORY_AND_DISK_SER_2()

Choose from:
( 1) Memory cache (2) Memory disk cache (3) Cache with serialization (4) Cache with copy -> in case of data loss, like _2.

 




Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324788124&siteId=291194637