spark - small keys for performance optimization

1. Spark now mainly pushes the API of the dataset. More and more operators can be done based on the dataset. The dataset is based on the natural optimization engine, but the dataset manipulation ability is not as good as the RDD. If you are a god, you should use the RDD. .

2. reduceByKey and groupbykey: reduceByKey will perform the reduce aggregation operation on the map side first, while groupbykey will not perform the aggregation operation in advance, and all data will be transmitted to the reduce side, and the number of shuffle data will be more.

3. coalesce: Use coalesce to reset the number of partitions. When the filter operator is used, the amount of data in the data partition will be quite different, and there will be data fragmentation. Use coalesce (numpartition, shuffle = true) to reset the new partition number, generally set a smaller partition attribute. 

4. Spark memory parameter settings;

  spark.shuffle.memoryFraction: The default usage is 20%, if the calculation depends on shuffle, it can be increased

  spark.storage.memoryFraction: The default usage is 60%, and the instance data is cached. If the calculation depends on the cached data, the proportion can be increased

5. Parallelism setting:

   spark.default.parallelism: It is recommended to set at least 100, preferably about 700. executor-cores determines the number of tasks in parallel in the executor. If the default parallelism is too small, the parallel ability of the executor cannot be fully utilized.

  Official recommendation: The number of tasks is set to 2-3 times the total number of CPUs in the Spark Application, which means that a total of 150 CPU cores can be used. Basically, the number of tasks should be set to 300-500.

  In addition, if the machine memory resources are sufficient and the cpu resources are tight, the parallelism can be set to a small value. If the cpu resources are sufficient but the memory resources are tight, a higher parallelism can be set, because the memory resources are shared on an executor, Larger tasks use less CPU and more memory resources.



to be continued……


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325584187&siteId=291194637