Spark learning from 0 to 1 (10)-Spark tuning (2)-data localization

Article Directory

1. Data localization

1.1 Level of data localization

1.1.1 PROCESS_LOCAL (process local)

The data to be calculated by the task is in the memory of this process (Executor).

Insert picture description here

1.1.2 NODE_LOCAL (Node Local)

The data calculated by the task is on the disk where the node is located.
The data calculated by task is in the memory of other Executor processes of this node.

Insert picture description here

1.1.3 NO_PREF

The data calculated by the task is in a relational database, such as mysql.

Insert picture description here

1.1.4 RACK_LOCAL

The data calculated by task is in the disk of different nodes in the same rack or the memory of Executor process.

Insert picture description here

1.1.5 ANY

The data calculated by task spans racks.

2. Spark data localization tuning

[External link image transfer failed. The origin site may have an anti-hotlinking mechanism. It is recommended to save the image and upload it directly (img-kjJgBdgD-1602580285852)(E:\maven\big-data-cloud\spark\spark_data_local.png)]

When task scheduling in Spark, TaskScheduler needs to be distributed according to the location of the data before it is distributed. It is best to distribute the task to the node where the data is located. If the task distributed by TaskScheduler cannot be executed in the default 3s, TaskScheduler will resend the task to To execute it in the same Executor, it will retry 5 times. If it still cannot be executed, TaskScheduler will lower the level of one-level data localization and send the task again.

As shown in the above figure, it will first try to select 1, PROCESS_LOCAL data localization level. If you retry 5 times and wait 3 seconds each time, it will default to this Executor computing resource is full, then it will lower the level one data localization level to 2. NODE_LOCAL. If you still retry 5 times, and each time you wait for 3 seconds, you still fail, then lower the data localization level one level to 3, RACK_LOCAL. In this way, data will be transmitted over the network, which reduces execution efficiency.

2.1 How to improve the level of data localization?

You can increase the waiting time for each task sent (the default is 3s), increase the multiple of 3s, and adjust it with WEBUI:

spark.locality.wait
spark.locality.wait.process
spark.locality.wait.node
spark.locality.wait.rack

Note: The waiting time cannot be adjusted very large. The level of adjusting the data localization does not need to turn the cart before the horse. Although the localization level of each task is the highest, the execution time of the entire Application is lengthened.

2.2 How to check the level of data localization?

Through the log or WEBUI.