Spark data locality

Scenes:

        Spark on the Driver, before assigning tasks to each stage of the Application, it will calculate which shard data each task is to calculate, a certain partition of RDD; Spark’s task allocation algorithm, priority, will hope that each A task is assigned exactly to the node where the data it wants to calculate. In this case, there is no need to transmit data between the networks; but, generally speaking, sometimes, things are counterproductive, and the task may not have the opportunity to be assigned to the node where its data is located. Why Well, maybe the computing resources and computing power of that node are full; so, in this case, usually, Spark will wait for a period of time, by default it is 3 seconds (not absolute, there are many situations, right Different localization levels, you can set different waiting time), the default retry is 5 times, in the end, if you really can’t wait, you will choose a poor localization level, for example, assign tasks to it. The node where the calculated data is located, the nearest node, and then perform the calculation.

        But for the second case, generally speaking, data transmission must occur. The task will get the data through the BlockManager of the node where it is located. BlockManager finds that there is no data locally, and will pass a getRemote() method through the TransferService (network data Transmission component) Obtain data from the BlockManager of the node where the data is located, and transmit it back to the node where the task is located through the network.

        For us, of course, we don't want it to be similar to the second situation. The best, of course, is that the task and data are on the same node, and the data is obtained directly from the BlockManager of the local executor, pure memory, or with a little disk IO; if you want to transmit data through the network, then it is true, the performance will definitely decline , A large number of network transmissions, and disk IO are all performance killers.

        If the data can be obtained from the location where the data is located, that is the best situation, directly in an executor process, the memory speed is the best. If the resources of the machine where the data is located, more than 3 seconds, it will be placed close to the data Go to other machines, then the Task task will find its own local BlockManager for data, if not, it will use the BlockManager to manage the nearby BlockManager which is the data of the machine where the data is located. It may not be at the same node and needs to be transmitted over the network. Of course, if you say Both executors are in the same node. In this case, it's not bad. They are in one node, and the data transmission between processes is enough.

        There is another situation, the worst is this way of pulling data across racks. The speed is very slow, and the impact on performance is quite large.

 What are the data localization levels in spark?

  • PROCESS_LOCAL: Process localization, the code and data are in the same process, that is, in the same executor; the task of calculating data is executed by the executor, and the data is in the BlockManager of the executor, which has the best performance.
  • NODE_LOCAL: Node localization, code and data are in the same node; for example, the data as an HDFS block is on the node, and the task runs in an executor on the node; or, the data and task are in the same node In the different executors on the above, data needs to be transferred between processes
  • NO_PREF: For tasks, where the data is obtained is the same, there is no good or bad, such as obtaining data from a database
  • RACK_LOCAL: Rack localization, data and task are on two nodes in a rack, and data needs to be transmitted between nodes through the network;
  • ANY: Data and tasks may be anywhere in the cluster, and not in a rack, the performance is the worst.
spark.locality.wait,默认是3s

When should we adjust this parameter?

        Observe the log and the running log of the spark job. It is recommended that you use the client mode first when testing, and you can directly see the relatively complete log locally. The log will show, starting task. . . , PROCESS LOCAL, NODE LOCAL observe the data localization level of most tasks.

        If most of them are PROCESS_LOCAL, then there is no need to adjust; if it is found that many levels are RACK_LOCAL, ANY, then it is best to adjust the waiting time for data localization after adjustment, it should be adjusted repeatedly, each adjustment After that, let’s run it again and observe the logs to see if the localization level of most tasks has been improved; see if the running time of the entire spark job has been shortened. Don’t put the cart before the horse. The localization level has increased, but because of the large number of Waiting time, the running time of the spark job has increased instead, so don’t adjust it. 

How to adjust?

  1. spark.locality.wait, the default is 3s; 6s, 10s

  2. By default, the waiting time of the following three is the same as the one above, all of which are 3s

  3. spark.locality.wait.process

  4. spark.locality.wait.node

  5. spark.locality.wait.rack

  6. new SparkConf().set("spark.locality.wait", "10")

 

Guess you like

Origin blog.csdn.net/qq_32445015/article/details/101979094