How to configure parameters when calculating big data

In our work for big data tasks, the most troublesome thing is the configuration of operating parameters, OOM and other issues are emerging one after another, and many people are not very clear about how resources should be determined. Let me tell you about it.

In today’s big data projects, the core of the data cluster is hadoop+hive. However, we usually cannot directly use native things when constructing projects. Different manufacturers will package these native things into their own products. A name, but it’s inseparable. No matter how it changes, the underlying technology is still using these native technologies. So whether it’s configured or used, they are actually the same. Of course, there will be some that are in line with the characteristics of the company. Some methods of use, but they are all shallow things, and do not replace the core

I now use the most popular batch-stream integrated hive+hadoop+spark on yarn framework model to give you an example. First of all, the scale of the production environment framework will definitely be completely distributed, unless it is for special purposes, that is, usually hadoop. Both the spark cluster and the spark cluster are greater than or equal to three nodes, so the resources will certainly not be as small as our own test environment.

For native spark, when the task is submitted, it has the following five parameter values ​​by default. These five parameter values ​​are the core five. All other configurations are for optimizing these five or based on these five. And the extended configuration

spark.executor.memory      默认值1g
spark.executor.cores       默认值1核
spark.executor.instances   默认值2个
spark.yarn.am.memory       默认值512m
spark.yarn.am.cores        默认值1核

These configurations are core configurations. From top to bottom, each executor uses 1g memory and 1 core number, 1 task corresponds to 2 executor instances, and applicationmaster uses 520m memory and 1 core number.
From the above configuration, we can see everyone You can calculate it yourself first, how much resources are occupied in total

If what I expected is not bad, according to this thinking, if I don’t say it, everyone will be counted as 3 core 2.5g. In fact, I will tell you the correct answer first. Once this task runs, it will occupy yarn 3 core 7g resources. I will give you the following Let’s talk about why

Reason 1: When running according to the default configuration above, when initializing the executor, in addition to the 1g of the task to be run later, each executor takes up a little more memory for its operation and initialization, which means that it actually not only applies for 1g.

Reason two, if you are familiar with the operating principle of mr, you will know that the applicationmaster is not directly generated, it is generated in the executor of the first child node that accepts the task of the task, that is to say, the default configuration is a task corresponding Two executors, but including the executor used to initialize the applicationmaster, in fact, a total of three executors are started during the task.

The third reason is that in yarn, in order to make full use of resources, there are two attributes in yarn, one is the minimum number of resources to be applied for, and the other is the regularization factor. The regularization factor does not accept that it is equal to the minimum number of available resources. , The minimum number of resources to be applied for is 1024 by default. The function of the regularization factor is to divide the resources into one portion according to the minimum amount of resources that can be applied for.

It is precisely because of the above three reasons that there are actually three executors in the entire task, and the number of resources requested at the time of instantiation actually exceeds 1G, and because of the regularization factor, the requested resources are regularized. Finally reach 2+2+2=6 g

However, these 6 gs are not directly one-time applications. If you are interested, you can go to the submitted view of the yarn page immediately after submitting the task. If this task happens to have not been started, you can find that the task has just been submitted in the initialization phase, only first Apply for the memory of an executor plus the resources of the applicationmaster. Note that the number of cores of the first executor will not be applied for in the initialization phase, because it is useless, and the applicationmaster only applies for 512M of memory, so the regularization Yingzi directly gave 1g, resulting in the final default In the initialization phase of the configuration task, the first requested resource is 1 executor + 1 core + 3g memory

After the applicationmaster is initialized, the executor it occupies will not be killed due to the completion of its initialization, but will be used to run necessary functions such as applicationmaster, and the other two executors that actually run tasks will also be generated by the applicationmaster and yarn interactively. And each will occupy a total of 2g+1 cores due to the regularization factor, which will result in the task's final resource occupation of 1+2+2+2=7g, 1+1+1=3 cores.

This blog post has finished talking about the calculation form of core resource occupancy. The others are optimized around this idea as I said at the beginning, and I also encourage everyone to diversify their configuration knowledge according to this thinking. For example, spark can actually configure the number of flexible executors, which can also control the maximum and minimum values ​​of flexible generation.

Although I use the spark on yarn framework, in fact all frameworks are the same. When preparing, you must first think about what each configuration item is used for? Knowing what it is used for is easy to configure

And when configuring tasks, try not to use 100% of resources or load use, or problems will occur sooner or later. For example, if you only have three child nodes, then each task runs, it is best to have at most two running tasks. Only one running application master is sufficient. Unless there is a special need, do not let one child node take on multiple tasks. After all, there are a lot of tasks. The cluster can not stand it, and it will be over if it crashes.

Finally, I will give you an experience. Most of the time in your work, your thinking is actually correct, but after your task is submitted, you find that the occupied resources are displayed incorrectly. In short, it is different from what you think, even if the cluster is optimized, this time Don’t want to be wrong, there is indeed such a situation, there are generally three reasons

The first one is the resource control scheme of the resource scheduler. It is impossible to say that the 100% selection is appropriate, and it is impossible to use the resources again and again. This probability problem, we must allow it, not to be too arrogant, we only It can be said that as much as possible, consider more when configuring

The second reason is the most common. It appears in various non-native tools. We use tools for development now. These tools generally have built-in optimization strategies. The configuration parameters you submit may not be considered suitable by internal optimization strategies. I have optimized it for you. If this happens, you can find the command that triggers the task execution and see if it is the same as your configuration. It is most likely that it has been optimized, and some manufacturers For the product, the configurable parameters are limited, and only a few can be configured, so the other operations that you customized are not effective at all, but only a part of them.

The third one is the most helpless one. When you submit a task, go to the yarn task check and find that the resources of the initialization task are the same as you configured, but it will be different at the beginning. This type of problem is clusters. Certain optimizations are required, otherwise such a cluster will definitely waste resources when performing tasks, for example, initialization tasks will take up too many and underused resources

Guess you like

Origin blog.csdn.net/dudadudadd/article/details/114676429