Resource management and resource isolation of Impala source code

This article is published by  NetEase Cloud .  

 

foreword

 

Impala is a query system with MPP architecture. In order to achieve platform-based services, the first thing to consider is how to achieve resource isolation, with as little or no impact as possible between multiple products. For this requirement, the best isolation solution is undoubtedly the isolation on physical machines. Product A uses these machines, product B uses those machines, and then the front-end routes to different clusters according to the product, which can achieve the ideal Resource isolation, but this greatly increases the difficulty of deployment, operation and maintenance, and cannot achieve resource sharing. Even if product A has no tasks running, product B cannot use the resources of product A, which is undoubtedly a waste. Chairman Mao taught us that waste is shameful, so we have to find a way to achieve resource isolation between products while making full use of resources, which is actually a very difficult job.

 

YARN

 

In the big data ecosystem, when it comes to Resource Management and Resource Isolation, the first reaction that comes to mind must be YARN, which is a resource management system that has been in use since Hadoop 2.0. YARN mainly uses The centralized resource management service Resource Manager manages all the resources (mainly CPU and memory) in the system, and then defines a queue for each product or business. The queue defines the maximum application resources of the tasks submitted to the queue. When a task is submitted to the Resource Manager, it will start an ApplicationMaster to be responsible for the resource application and scheduling of the task, and then apply for resources from the Resource Manager according to the resource requirements required by the task, and the Resource Manager will judge according to the remaining resources in the current queue. Whether to grant resources, if the current queue resources have been exhausted, the task needs to wait until the resources are released. After the ApplicationMaster applies for the resources, it requests the NodeManager to start the Container containing certain resources. The Container is implemented using the lightweight isolation scheme of cgroups. In order to According to different usage scenarios, YARN also integrates different allocation and scheduling strategies, typically Capacity Scheduler and Fair Scheduler.

 

The above figure shows the process of submitting tasks from the client to YARN, which is also used when submitting MR and spark tasks. However, for a system with MPP architecture, its query response time is determined by the running time of the slowest node. To improve query performance, as many nodes as possible need to participate in the calculation, and the task on YARN starts a new process each time. The time to start the process is acceptable for batch tasks, after all, this kind of task runs for a long time. , and the cost is a bit high for ad-hoc queries that pursue low latency, and it is likely that the process startup + initialization time is greater than the real running time.

 

In addition to using the native yarn scheduling, impala has also tried to use a service called Llama (Long-Lived Application Master) to manage and schedule resources. It is actually an ApplicationMaster on YARN to achieve coordination between impala and yarn , when an impala receives a query, impala requests resources from Llama according to the estimated resource demand, and the latter applies for available resources from YARN's Resource Manager service. But as mentioned earlier, in order to ensure the query speed, impala needs to obtain all resources at the same time, so as to promote the execution of the next task. In fact, Llama implements such a batch application function, so the execution of a query needs to wait for a batch of resources at the same time. It can only proceed when it arrives. In addition, Llama will also cache the requested resources. However, after all, Llama still needs to apply for resources from YARN and start the process, and there is still a problem of relatively large delay. Therefore, Impala no longer supports Llama after version 2.3.

 

Impala resource isolation

 

At present, the deployment method of Impala is still to start a long-running process and allocate resources for each query. In the new version (after 2.6.0), a function called Admission Control has been added, which can achieve a certain meaning Let's take a deeper look at this mechanism and see how it isolates and controls resources.

First of all, according to the structure of impala, all SQL queries, from parsing, execution plan generation to execution, are executed on the impalad node. In order to implement Admission Control, the following two parameters need to be configured in impalad:

1. --fair_scheduler_allocation_path This parameter is used to specify the path of the fair-scheduler.xml configuration file, which is similar to the fair-scheduler.xml configuration of YARN. The specific configuration content will be described in detail below;

2. --llama_site_path This parameter is used to specify the configuration file llama-site.xml of Llama. Didn't it say that the new version does not use Llama? Why do you need to configure it? In fact, the configuration items here are all legacy items.

Next, I will introduce how to configure these two files in detail. For the first file fair-scheduler.xml, anyone familiar with YARN knows that this file implements the configuration of the fair scheduler. I don’t understand how the fair scheduling in YARN is implemented. , but the file basically needs to configure the resource allocation of each queue. The following is a configuration example: 

 

<queue name="sample_queue">

<minResources>10000 mb,0vcores</minResources>

<maxResources>90000 mb,0vcores</maxResources>

<maxRunningApps>50</maxRunningApps>

<weight>2.0</weight>

<schedulingPolicy>fair</schedulingPolicy>

<aclSubmitApps>charlie</aclSubmitApps>

</queue>

 

However, through the impala source code, it is found that the configuration of each queue used in impala is only aclSubmitApps and maxResources. The former is used to determine which users can submit tasks to the queue. If no queue is specified, submit to the default team

If the default queue does not exist or the user does not have permission to submit to the default queue, the request will be rejected; the latter is used to determine the maximum number of resources used by the queue in the entire cluster. Currently, the only resource that impala is concerned about is memory. In the above example, the memory size that the sample_queue queue can use in the entire cluster is 90GB, and only the charlie user can submit to this queue.

Since only these two configurations are used, why does Impala not create a separate configuration format, but choose to use fair-schedular.xml directly? I think on the one hand, it is to save myself from writing the parsing class, and just use the yarn interface directly, and on the other hand, prepare for more perfection in the future. Let's take a look at what configuration is used in the Llama configuration. The configuration example is as follows:

 

<property>

<name>llama.am.throttling.maximum.placed.reservations.root.default</name>

<value>10</value>

</property>

<property>

<name>llama.am.throttling.maximum.queued.reservations.root.default</name>

<value>50</value>

</property>

<property>

<name>impala.admission-control.pool-default-query-options.root.default</name>

<value>mem_limit=128m,query_timeout_s=20,max_io_buffers=10</value>

</property>

<property>

<name>impala.admission-control.pool-queue-timeout-ms.root.default</name>

<value>30000</value>

</property>

 

 

The meanings of these configurations are as follows. The specific configuration items are the following keys followed by the queue name:

 

//The maximum number of tasks running at the same time in the queue, the default is unlimited llama.am.throttling.maximum.placed.reservations

//The maximum number of blocked tasks in the queue, the default value is 200 llama.am.throttling.maximum.queued.reservations

//The maximum waiting time of the blocked task in the queue in the blocking queue, the default value is 60s

 

Impala Implementation

 

 

Well, after analyzing the configuration items used in Admission Control, the user can set the queue for the request to change the session by setting REQUEST_POOL=pool_name after creating a session. Of course, if the user does not have the submission permission of the queue, the execution will be executed afterwards. fail. Let's take a look at how impala uses these parameters to complete resource isolation according to the query process. When impala receives a query request, the request includes not only query SQL, but also a batch of query parameters. Here we are concerned about the queue submitted by the request. (REQUEST_POOL parameter), it first obtains the queue to which the query should be submitted according to the user and queue parameters of the query execution. The rules for selecting queues are as follows:

1. If fair-scheduler.xml and llama-site.xml are not configured on the server, indicating that the resource control service is not started, all requests are submitted to a default queue named default-pool;

2. If the query does not specify REQUEST_POOL, set REQUEST_POOL to the yarn default queue default.

Determine whether the queue name exists, and then determine whether the user has the permission to submit tasks to the queue according to the user who currently submits the task and the queue name. The query fails if the queue name does not exist or the user does not have permission to submit.

After the query executes the initialization work (the selection queue is only a part of the work), the GetExecRequest interface of FE will be called to generate the execution plan. The process of generating the execution plan is roughly divided into three parts:

1. Syntax analysis to generate logic execution plan and preprocessing;

2. Generate a stand-alone execution plan according to the logical execution plan;

3. Convert the single-machine execution plan into a physical self-planning, which will be introduced separately later.

 

Then the impalad node will judge whether the query can continue to be executed according to the execution plan. Only in the following situations, the query needs to be queued:

1. There are already queries queued in the current queue. Because the blocking queue is scheduled by FIFO, new queries need to be queued directly;

2. The concurrent query that has reached the queue setting is online;

3. The memory required by the current query cannot be satisfied.

 

The first two conditions are easier to judge. For the third case, impala needs to know the memory required by the current query and the remaining memory in the current queue to judge. The memory usage here is divided into two aspects: the total remaining memory of the queue in the cluster. and the remaining memory of the stand-alone. First, determine whether the remaining memory in the queue and the memory that the current query needs to use in the cluster have reached the upper memory limit set by the queue; then determine whether the memory required by the query on each impalad node and the remaining memory of the node have reached the set memory upper limit. (The memory limit of the node is set by the parameter mem_limit). Then the question comes again, how much memory does the query need in the entire cluster and how much memory is needed on each node? The memory required for the entire cluster is actually the memory required by each node multiplied by the number of nodes required, so the core problem is the memory size that the query needs to use on each node.

 

Maybe everyone thinks like me that the memory consumed by each node for each query is estimated according to the query plan, but it is very difficult to do so, so let's take a look at how Impala does it currently. For a single node To estimate the memory, calculate the single-machine memory required by the query according to the following priorities:

1. First determine whether the query parameter rm_initial_mem is set (it can be set by set rm_initial_mem =xxx), if it is set, the value is directly used as the estimated value;

2. Then determine whether rm_always_use_defaults=true is set when impalad is started, and if it is set, the memory size configured in rm_default_memory is used;

3. Then judge whether the session has mem_limit set (it can be set by set mem_limit=xx, pay attention to the difference between it and the mem_limit configuration when impalad is started), if set, use this value;

4. Finally, determine whether the memory size that needs to be allocated for each node is calculated in the execution plan;

5. If none of the above hits, use the default rm_default_memory configuration (parameter when impalad starts up), the default value of which is "4GB".

 

From the above judgment logic, impalad will finally determine the memory size allocated by each node according to the estimated value in the execution plan. After all, the information estimated only based on statistical information is not accurate. For some complex queries , the error may be very large.

Well, through the above analysis, the entire query audit process has been sorted out, but what if the current resources cannot satisfy the query? At this point, the query needs to be put into the queue, and the query will block until one of the following two conditions are met:

1. The queue used by the query has enough resources required by the query;

2. The query is stored in the queue and times out. The timeout time is set by the queue_timeout_ms parameter in the queue. If the queue is not set, it is determined by the queue_wait_timeout_ms parameter when impalad is started. The default is 60s.

 

Talking about Statestore again

 

We will not go into how the memory estimation in the execution plan is calculated here, but there is a more important problem: each impalad works independently, and only when it needs to allocate tasks will the other impalads be notified to execute the corresponding tasks. operation, how does impalad know the resource status of other impalad nodes? Including the memory size used by each queue, the memory size used by each node, etc. This depends on the statestored introduced in our previous article. When impalad starts in statestored, an impala-request-queue topic will be registered. Each impalad is the publisher and subscriber of the topic, and periodically publishes the current The memory usage of the node, and then each impalad node updates the resource usage status of the entire cluster according to the latest information in the topic. This kind of interaction is indeed very convenient, but there may be certain uncertainties. For example, a certain network jitter between impalad and statestored may lead to the inability to obtain the latest resource usage status.

 

hard limit

 

Impalad uses Admission Control to achieve a certain sense of resource isolation, but this is soft after all, rather than starting new processes through cgroups like YARN for isolation, there may still be a relatively heavy query that will bring down the entire cluster In this case, as a query platform, we need to ensure that even if an error is returned, the service of the entire cluster cannot be affected. At this time, we need to sacrifice a key parameter of impala query: mem_limit. For each module in impalad, when starting You can set the mem_limit parameter, which is the maximum memory that each execution node can allocate (tcmalloc management), and each query can also set mem_limit, which indicates the maximum allocated memory size for the query on each node, and then each When impalad executes a query (called fragment in impala), it will allocate a block pool with a size of mem_limit. The required memory is allocated from the pool and stored in the memory. If the allocated memory exceeds the size of the pool, a certain block is selected. Spilling into external memory, the specific execution process of this part is also very complicated. It can be discussed separately. This part of the block is read from the local disk to the memory when needed. Queries that require spilling to external memory will undoubtedly slow down. query speed, but this preserves the availability and stability of the entire system well.

 

Summarize

 

In actual deployment, we will allocate queues according to the data volume and business type of each user, and even allocate different queues for queries of the same business in different time intervals, and set the default query parameters in the queue in detail, especially mem_limit parameters and the maximum number of concurrency, which can better limit the influence between tenants. In order to avoid malicious use, users can be restricted from setting the MEM_LIMIT parameter themselves to ensure the stability of the cluster as much as possible.

 

NetEase has a number: an enterprise-level big data visualization analysis platform. The self-service agile analysis platform for business personnel uses PPT mode for report production, which is easier to learn and use, and has powerful exploration and analysis functions, which really help users gain insight into data and discover value. Click here for a free trial .

 

 

Learn about NetEase Cloud:
NetEase Cloud Official Website: https://www.163yun.com/
New User Gift Package: https://www.163yun.com/gift
NetEase Cloud Community: https://sq.163yun.com/

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325654357&siteId=291194637