Impala Series: Impala Query Optimization


===========================
Understanding the mem_limit parameter
===================== =====
set mem_limit=-1b #Cancel the memory limit
set mem_limit=1gb #Set the upper limit of the single machine memory to 1GB, note that the single machine
set mem_limit=1mb #Set the upper limit of the single machine memory to 1MB, note that
if the single machine is set mem_limit, impala The query memory evaluation link will be skipped, and the remaining memory in the pool will be checked directly. If it is enough, it will be executed directly. If it is not enough, the query will be placed in the queue according to the strategy set by the pool. If there is still not enough memory resources within the timeout setting time, the query will be canceled. If the user does not set mem_limit, the default_pool_mem_limit value of the Pool is used by default. If the default_pool_mem_limit is not set, Impala will estimate this value by itself. If not set mem_limit, otherwise Impala estimates memory consumption based on statistics and execution plan.

Of course, during runtime, if the memory occupied by the query exceeds the mem_limit, impala will terminate the query,

Assuming that a single machine is set to go online with 10GB of memory, and the cluster has 10 nodes, before querying, Impala checks whether the remaining memory in the Pool is enough to 300GB.

Several setting parameters of Impala pool level are:
default_pool_max_queued, the default number of queries entering the waiting queue
default_pool_max_requests, the number of queries being executed
default_pool_mem_limit, the default memory usage limit set for the query, usually disabled by the disable_pool_mem_limits parameter.
queue_wait_timeout_ms, the default is 60000 milliseconds, that is, 1 minute, the maximum waiting time of the query in the queue, if it exceeds the set value, the query will be rejected.


The necessity of manually setting the MEM_LIMIT parameter
1. Avoid the adverse consequences of Impala's over-estimation
Impala will predict how much memory will be consumed by SQL execution before query execution. The prediction is mainly based on the statistical information of the table, but in many cases, Impala's estimate It is very extensive. Even if the statistics of the table are very timely, impala will over-estimate the memory consumption. The consequences of over-estimation are: (1) If the estimated value is greater than the remaining memory in the impala pool, impala will reject the query (2) Reduce the degree of concurrency.
2. It may improve the execution speed
. After manually setting the MEM_LIMIT parameter, Impala will skip the memory prediction process, which may speed up the execution speed.

How to set the MEM_LIMIT parameter?
Try running the SQL first, and then check the memory_per_node_peak value in the profile, which is the actual memory consumption, or in the Peak Mem field of the Summary tab of the query detail page in impala WebUI.


===========================
Soft isolation of impala resources
=================== =======
Excerpted from the description of Gridsum: https://blog.csdn.net/qq_18882219/article/details/78447558

Since each Impalad node of Impala can accept queries, how many queries are currently in each Pool, how much memory is occupied, and how much is Queued, this information is also updated by each Impalad and broadcast to other Impalads through Statestored, so this information may be May be inconsistent on each node. When an Impalad receives a query, it needs to make some decisions, such as whether to reject, whether to live in a queue. The local decision information may be old, so Impala's Pool-based resource isolation itself is a kind of soft isolation, that is, for any one For the Pool, the memory used may exceed the maximum memory, and the number of running queries may exceed the maximum number of queries set by the Pool. This we have also proved in actual use. The soft isolation problem will bring two risks:
1. The memory requested by a single node exceeds the memory allocated to the Impalad process at a certain time, which will cause the Impalad OOM to exit
2. A certain Pool uses far more than The resources of this Pool, this is unfavorable for different businesses to use Pool for resource isolation.
We have also discussed this issue with the developers of the Impala community, and the final solution is: a single Pool designates a unique Coordinator, and all queries of this Pool are sent to the same Impalad. Therefore, the Coordinator has the latest information of the resource pool at all times, and it has evolved from soft isolation to hard isolation.


===========================
Several other important session variables
================= =========
In addition to MEM_LIMIT, there are also the following commonly used session variables
EXPLAIN_LEVEL : Set the output of explain and profile in detail
DISABLE_UNSAFE_SPILLS : Whether to disable disk spill, there are still many restrictions on disk spill, such as non-equivalent join cannot Use disk spill.
REQUEST_POOL : set the queue where it is located


The EXPLAIN_LEVEL parameter can control the output of the explain statement, as well as the output of the profile command. It should be noted that explain can be executed in the SQL client, while the profile command can only be executed in the impala shell, and can only display the recently executed ones. The SQL profile.
set EXPLAIN_LEVEL = 0 -- 0 or MINIMAL, because the output information is less, it is easier to find the main information such as the join order
set EXPLAIN_LEVEL = 1 -- 1 or STANDARD, the standard output
set EXPLAIN_LEVEL = 2 -- 2 or EXTENDED, verbose output
set EXPLAIN_LEVEL = 3 # 3 or VERBOSE, more verbose output

If the query speed is poor, you can run the following command to see if the statistics are missing.
set explain_level=3 --verbose level, contains more detailed information. In the explain output, if there is the following information, the corresponding table has no statistics.
cardinality: unavailable
table stats: unavailable
column stats: unavailable

===========================
Join reordering
======================= ====
For multi-table Join queries, the old version of Impala always queries in the order in which the tables appear, but the new version of Impala execution engine can automatically perform Join reordering according to the statistics of the tables/columns. The rule is:
large tables (Tables with more table size and distinct value) are queried first, and small tables are queried later. Tables
without statistical information (impala thinks that the size of the table is 0) will be queried last.


=========================================================================================================
_
_ =====
When the statistical information is out of date or there is no statistical information or the table has a very strange data distribution, the query table order of the Impala execution plan may not be optimal. At this time, it is best to add the STRAIGHT_JOIN Hint to force Impala to follow the SQL The order in which the tables appear for the query, of course, the order in which our SQL tables appear should be carefully adjusted.

In addition, STRAIGHT_JOIN is only valid for the current Select statement. If subqueries and views also need to be queried in strict accordance with the table order, you need to explicitly add the STRAIGHT_JOIN hint to the subqueries and views. In addition, the STRAIGHT_JOIN hint is a keyword of impala and cannot be placed in /*+*/.

select STRAIGHT_JOIN t1.* from t1
join t2 on t1.id=t2.id

select distinct STRAIGHT_JOIN t2.id,t1.name from t1
join t2 on t1.id=t2.id
;


===========================
Join Algorithm
===================== =====
For equal joins, impala supports nested loop join and hash join algorithms. For hash joins, it can be divided into broadcast/Shuffle. For equal joins, use hash join, and for non-equivalent joins, use nested loop. In the join.on clause, even if a certain operation is applied to the field, if it is still an equivalent join, it is still a hash join.
Broadcast join is very suitable for the case where the right table is a small table. Impala first copies the right table to each node and the left table. Do join.
Shuffle join, also known as partitioned join, is suitable for large table and large table association. Note that partitioned join is not directly related to the partition of the right table. Impala will break the right table into N parts and send them to the node where the left table is located. Then do join.
Nested loop join: For non-equivalent join, impala will use nested loop join. At this time, we cannot set the SHUFFLE/BROADCAST hint, nor use the spill disk function. Impala's non-equivalent join is less efficient, Vertica The efficiency is very high, and Hive does not support it directly.


SELECT STRAIGHT_JOIN select_list FROM
join_left_hand_table
JOIN [{ /* +BROADCAST */ | /* +SHUFFLE */ }]
join_right_hand_table
remainder_of_query;

/* +SHUFFLE */ That is, partitioned join, the two tables to be associated are according to
/* +BROADCAST */ That is
Exchange : the intermediate results are transmitted back to the coordinator node (labelled here as the EXCHANGE node)

 

===========================
Best Practices
====================== =====
1. Impala execution engine is not so smart yet, the SQL for multi-table join is best to follow the recommended writing method below.
2. The largest table should be placed at the leftmost of the table list.
3. Multiple joins Query statements should put the most selective join at the top.
4. Collect statistics on the table regularly, or actively collect statistics after a large number of DML operations.

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324521556&siteId=291194637