【Spark】 In-depth understanding of Spark localization

`spark.locality.wait`	3s	How long to wait to launch a data-local task before giving up and launching it on a less-local node. The same wait will be used to step through multiple locality levels (process-local, node-local, rack-local and then any). It is also possible to customize the waiting time for each level by setting `spark.locality.wait.node`, etc. You should increase this setting if your tasks are long and see poor locality, but the default usually works well.
`spark.locality.wait.node`	spark.locality.wait	Customize the locality wait for node locality. For example, you can set this to 0 to skip node locality and search immediately for rack locality (if your cluster has rack information).
`spark.locality.wait.process`	spark.locality.wait	Customize the locality wait for process locality. This affects tasks that attempt to access cached data in a particular executor process.
`spark.locality.wait.rack`	spark.locality.wait	Customize the locality wait for rack locality.

Data locality can have a major impact on the performance of Spark jobs. If data and the code that operates on it are together then computation tends to be fast. But if code and data are separated, one must move to the other. Typically it is faster to ship serialized code from place to place than a chunk of data because code size is much smaller than data. Spark builds its scheduling around this general principle of data locality.

Data locality is how close data is to the code processing it. There are several levels of locality based on the data’s current location. In order from closest to farthest:

PROCESS_LOCAL data is in the same JVM as the running code. This is the best locality possible
NODE_LOCAL data is on the same node. Examples might be in HDFS on the same node, or in another executor on the same node. This is a little slower than PROCESS_LOCAL because the data has to travel between processes
NO_PREF data is accessed equally quickly from anywhere and has no locality preference
RACK_LOCAL data is on the same rack of servers. Data is on a different server on the same rack so needs to be sent over the network, typically through a single switch
ANY data is elsewhere on the network and not in the same rack
Spark prefers to schedule all tasks at the best locality level, but this is not always possible. In situations where there is no unprocessed data on any idle executor, Spark switches to lower locality levels. There are two options: a) wait until a busy CPU frees up to start a task on data on the same server, or b) immediately start a new task in a farther away place that requires moving data there.

What Spark typically does is wait a bit in the hopes that a busy CPU frees up. Once that timeout expires, it starts moving the data from far away to the free CPU. The wait timeout for fallback between each level can be configured individually or all together in one parameter; see the spark.locality parameters on the configuration page for details. You should increase these settings if your tasks are long and see poor locality, but the default usually works well.

Adjust the waiting time of data localization

On the Driver, Spark will calculate which shard data each task needs to calculate before assigning each task of the Application stage, a certain partition of the RDD; Spark's task allocation algorithm, priority, will hope that each A task is just assigned to the node where the data it wants to calculate is located, so that there is no need to transfer data between networks;

However, generally speaking, sometimes, contrary to wishes, the task may not have the opportunity to allocate to the node where its data is located. Why, maybe the computing resources and computing power of that node are full; , Spark will wait for a while, the default is 3s clock (not absolute, there are many cases, for different localization levels, will wait), in the end, it is really unable to wait, will choose a comparison Poor localization level, for example, assign the task to the node where the data it wants to calculate is located, a closer node, and then perform the calculation.

But for the second case, usually, data transmission must occur. The task will obtain data through the BlockManager of the node where it is located. The BlockManager finds that there is no data locally, and will use a getRemote () method through TransferService Transmission component) Get the data from the BlockManager of the node where the data is located and transmit it back to the node where the task is located through the network.

For us, of course, we do not want to be similar to the second case. The best, of course, is that the task and data are on one node, and the data is obtained directly from the local executor ’s BlockManager, pure memory, or with a little disk IO; if you want to transfer data over the network, then, indeed, performance will definitely decline , A large number of network transmissions, and disk IO are all performance killers.

When should we adjust this parameter?

Observe the log and the running log of the spark job. It is recommended that you use the client mode first when testing, and you can directly see the full log locally.

The log will show, starting task. . . , PROCESS LOCAL, NODE LOCAL

Observe the data localization level of most tasks

If most of them are PROCESS_LOCAL, then there is no need to adjust

If it is found that many levels are NODE_LOCAL, ANY, then it is best to adjust the waiting time for data localization

After the adjustment, it should be repeated adjustment. After each adjustment, run again and observe the log

See if the localization level of most tasks has been improved; see if the running time of the entire spark job has been shortened

Do n’t turn it upside down, the localization level is improved, but because of the large amount of waiting time, the running time of the spark job has increased, so do n’t adjust it

How to adjust?

spark.locality.wait, the default is 3s; 6s, 10s

By default, the waiting time of the following three are the same as the above, all are 3s

spark.locality.wait.process

spark.locality.wait.node

spark.locality.wait.rack

new SparkConf ()

.set("spark.locality.wait", "10")

https://blog.csdn.net/xueyao0201/article/details/79670633

I am Wangwang

Published 61 original articles · won praise 2 · Views 7302

Private letter concerns

【Spark】 In-depth understanding of Spark localization

Guess you like