HUAWEI CLOUD GaussDB (for Influx) Interpretation Issue 5: Sub-query of Best Practices

This article is shared from HUAWEI CLOUD Community " Huawei Cloud GaussDB (for Influx) Revealing the Fifth Issue: Sub-Query of Best Practices ", author: GaussDB database.

"Alert! Alert!".

"What's the alarm?" Xiao Wang, who was in a daze in his sleep, was suddenly woken up by a phone call from an operation and maintenance colleague, and his face was startled.

"Slow query! The customer has reported a fault! Hurry up and deal with it!"

Xiao Wang quickly opened the portable, remotely connected to the environment to find the problem, and finally found that the slow query was a sub-query. "No, the same statement didn't report a slow query yesterday?"

But Xiao Wang quickly figured out why. The problem with this slow query is that the inner query of the subquery could have aggregated the data and then output it to the outer query, but since there is no aggregation, it will be very slow when the amount of data is large!

Finding the crux, Xiao Wang immediately passed the optimized SQL statement to the customer through the operation and maintenance colleagues, and the alarm was finally resolved.

"Looks like it's time to sort out the query!" Taking advantage of his clear thinking, Xiao Wang began to sort it out...


0 What is a subquery?

A subquery is a query nested within another query, and is generally placed in the from statement in InfluxQL syntax to enhance code flexibility. Subqueries mainly fall into the following categories:

Scalar subquery (scalarsubquery) : returns a value in 1 row and 1 column

Row subquery (rowsubquery) : The returned result set is 1 row and N columns

Column subquery (columnsubquery) : The returned result set is N rows and 1 column

Table subquery (tablesubquery) : The returned result set is N rows and N columns

For example in the query statement:

select first(sum_f1)/first(sum_f2) from (select sum(f1) as sum_f1 from mst), (select sum(f2) as sum_f2 from mst)

Two subqueries are used, which are to obtain the sum of the two columns f1 and f2 from the table mst, respectively, and use the results sum_f1 and sum_f2 as the source of the outer query for the outer query statement.

The general syntax for a GaussDB (for Influx) subquery is SELECT_clause FROM (SELECT_statement ) [...]. The logic in processing subqueries is shown in the figure below.

The system will first process the sub-query statement, and the result of the sub-query will be cached as the data source of the outer query. Finally, the outer query will be processed and the result will be returned to the customer.

0 2  Subquery usage scenarios

Subqueries are used when a simple query cannot be processed, or for further processing based on the data of a query, for example, to find the three largest among the minimum values ​​of each group:

SELECT top (v,3)
FROM (
SELECT min (value) AS v
FROM mst
GROUP BY tag1
)

Subqueries give us a lot of flexibility, but subqueries are not recommended in principle. The reason is very simple. Compared with ordinary queries, sub-queries have deeper function calls and larger data volumes, which consume more resources and increase latency.

0 3  Case Analysis

In the process of developing with GaussDB (for Influx), we often face some difficulties in sub-queries, such as:

1. When to use subqueries?

2. Faced with a complex scenario, how to decompose it into sub-queries to solve it?

3. Is the written subquery optimal? Can it be optimized again?

Next, we combine a specific case to briefly analyze how to efficiently use subqueries and analysis ideas.

A user of HUAWEI CLOUD uses GaussDB (for Influx) to write about 540 million points every day, and the timeline is 100w+. In the business, time and space query, request success rate query, and topN query.

Take the following desensitized data as sample data for case analysis and practice:

  • Case 1 When to use subqueries?

The user uses subqueries for spatiotemporal grouping and as the source of the outer query, which aggregates the results of the spatiotemporal grouping. The query statement is:

SELECT SUM(req_nums)
FROM(
SELECT requestNum AS req_nums 
FROM req_table 
WHERE statement=’SUCCESS’ AND time >= 1629129600000000000 
AND time<=1629129611000000000 )
WHERE time>=1629129600000000000 AND time<=1629129611000000000 
AND req_nums  < 50
GROUP BY time(1s), group
ORDER BY time ASC

The resulting problem:

In the user's usage scenario, it can be found that the query sub-query only implements conditional filtering and column name change, so the internal query is equivalent to SELECT requestNum AS req_nums + filtering. The query in the non-aggregation scenario requires a large amount of raw data to be retrieved, resulting in query speed. Slow, so the query efficiency does not meet the user's requirements.

Solutions:

By analyzing the query statement, it can be seen that the user's demand is to aggregate the data that meets the conditions (statement='SUCCESS' AND requestNum < 50) into a space-time aggregation (GROUPBY TAG, time(5m)), and after clarifying the query target, it can be written more clearly Efficient query statement: Put all the filter conditions together and do space-time aggregation directly.

Grammar improvements:

SELECT SUM(requestNum)
FORM req_table
WHERE statement=’SUCCESS’ AND requestNum < 50
AND time>=1629129600000000000 AND time<=1629129611000000000
GROUP BY time(1s), group
ORDER BY time ASC
  • Case 2 Using subqueries to solve complex problems

In the user's business scenario, the request success rate needs to be calculated, that is, a certain column of data is filtered and counted according to different filter conditions, and finally the ratio is calculated. GaussDB (for Influx) does not support the case when statement, so it is difficult to filter out different data in the same column according to different cases. When many developers encounter such a problem, they have no idea.

Solutions:

Step 1: Use the subquery + multi-table feature to change the same column of data into two columns according to the filter conditions:

SELECT * FROM 
(SELECT requestNum AS success_requestNum FROM req_table WHERE statement=’SUCCESS’ AND time>=1629129600000000000 AND time<=1629129611000000000),
(SELECT requestNum AS total_requestNum FROM req_table WHERE time>=1629129600000000000 AND time<=1629129611000000000)

Step 2: Count the queried data:

SELECT SUM(success_requestNum) AS total_success_reqNum, SUM(total_requestNum) AS total_requestNum 
FROM
(SELECT requestNum AS success_requestNum FROM req_table WHERE statement=’SUCCESS’ AND time>=1629129600000000000 AND time<=1629129611000000000),
(SELECT requestNum AS total_requestNum FROM req_table WHERE time>=1629129600000000000 AND time<=1629129611000000000)
    GROUP BY time ASC

Step 3: Write a query statement for the final success rate:

SELECT SUM(success_requestNum)/SUM(total_requestNum) AS success_ratio
FROM
(SELECT requestNum AS success_requestNum FROM req_table WHERE statement=’SUCCESS’ AND time>=1629129600000000000 AND time<=1629129611000000000),
(SELECT requestNum AS total_requestNum FROM req_table WHERE time>=1629129600000000000 AND time<=1629129611000000000)
    GROUP BY time ASC

  • Case 3 How to optimize the subquery statement?

Based on case 2, we got the method to find the success rate. The query statement is as follows:

SELECT SUM(success_requestNum)/SUM(total_requestNum) AS success_ratio
FROM
(SELECT requestNum AS success_requestNum FROM req_table WHERE statement=’SUCCESS’ AND time>=1629129600000000000 AND time<=1629129611000000000),
(SELECT requestNum AS total_requestNum FROM req_table WHERE time>=1629129600000000000 AND time<=1629129611000000000)
    GROUP BY time ASC

The resulting problem:

The query statement written by the user whose query time is longer than 120s does not meet the business requirements and needs to be further optimized.

Grammar improvements:

According to the sub-query principles and solutions described above, the aggregate query should be placed inside the sub-query to reduce the amount of data and speed up the query. The optimized query statement is as follows:

SELECT SUM(success_requestNum)/SUM(total_requestNum) AS success_ratio
FROM
(SELECT SUM(requestNum) AS success_requestNum FROM req_table WHERE statement=’SUCCESS’ AND time>=1629129600000000000 AND time<=1629129611000000000),
(SELECT SUM(requestNum) AS total_requestNum FROM req_table WHERE time>=1629129600000000000 AND time<=1629129611000000000)
    GROUP BY time ASC

The query results are the same:

Optimization effect:

The unoptimized query takes 126s, and the optimized query takes 2.7s, and the performance is improved by 47 times.

*Notice

Using SUM(success_requestNum), the purpose of SUM(total_requestNum) is to align the data. Using SELECTsuccess_requestNum / total_requestNum directly will result in incorrect results because the data at the same time cannot be aligned:

SELECT * 
FROM
(SELECT SUM(requestNum) AS success_requestNum FROM req_table WHERE statement=’SUCCESS’ AND time>=1629129600000000000 AND time<=1629129611000000000),
(SELECT SUM(requestNum) AS total_requestNum FROM req_table WHERE time>=1629129600000000000 AND time<=1629129611000000000)
    GROUP BY time ASC

The total data volume of the query is positively related to the query speed. The larger the data query volume, the slower the query speed. Therefore, whether it is writing a sub-query or a non-sub-query query statement, the first principle is to try to reduce the data volume in the query. , which means that aggregate queries (typically queries that reduce the amount of data) should be placed inside subqueries as much as possible.

0 4  Flexible subqueries and high performance

GaussDB (for Influx) not only provides flexible sub-query capabilities, but also uses vectorization, memory reuse and other technologies to continuously improve query efficiency, meeting the query performance requirements of users in massive data scenarios.

Vectorized query: GaussDB (for Influx) uses the SIMD instruction set to improve the degree of parallelism of data processing. At the same time, using the vectorized data model, one iteration can process a batch of points, which greatly reduces the number of calculation iterations and speeds up the calculation.

Memory reuse: Recycling and allocation of memory by GC is reduced as much as possible during the query process, and the requested memory is managed separately, which solves the problem of memory expansion during the query process, which causes the GC to frequently reduce the query speed.

0 5  Summary

GaussDB (for Influx) supports the sub-query function, which brings us great flexibility in dealing with problems, and also has high requirements for users. Unreasonable sub-queries often lead to problems such as high query delay and high resource consumption. , so you should pay attention to the following points when using GaussDB (for Influx) subqueries:

1. Understand the business logic applicable to sub-queries. Sub-queries are suitable for scenarios where the queried data is processed twice (multiple times);
2. Try to avoid using sub-queries in scenarios where sub-queries can not be used;
3. Must be used In the sub-query scenario, the query that reduces the amount of data is put into the sub-query as much as possible to reduce the overall query data volume and thus speed up the query.

0 6  end

The author of this article: HUAWEI CLOUD Database Innovation Lab & HUAWEI CLOUD Spatiotemporal Database Team
Welcome to join us!
Cloud Database Innovation Lab (Chengdu, Beijing) Resume Delivery Email: [email protected]
HUAWEI CLOUD Spatiotemporal Database Team (Xi'an, Shenzhen) Resume Delivery Email: [email protected]

Click Follow to learn about HUAWEI CLOUD's new technologies for the first time~​

Guess you like

Origin blog.csdn.net/devcloud/article/details/124091727