Slow query index recommendation based on AI + data drive

At present, the daily average number of slow queries within Meituan has exceeded hundreds of millions. How to analyze these slow queries and establish appropriate indexes is a challenge faced by the Meituan Database R&D Center. The R&D team of the Meituan database platform has launched a scientific research cooperation with East China Normal University to recommend indexes for slow queries in parallel with the cost-based method through AI+ data-driven index recommendation to improve the recommendation effect.

1 background

With the continuous growth of Meituan's business volume, the number of slow queries is also increasing. At present, the daily average number of slow queries has exceeded hundreds of millions. It is obviously unrealistic to rely on DBAs and developers to manually analyze these slow queries and build appropriate indexes. In order to solve this problem, Meituan's internal DAS (Database Autonomous Service) platform has integrated cost-based slow query optimization suggestions to automatically recommend indexes for slow queries. However, there are still some problems:

  • The cost-based slow query optimization suggestion uses the cost estimate of the optimizer to recommend the index that improves the query cost the most, but the cost estimate of the optimizer is not completely accurate [1], so there may be missed or wrong selections Recommended indexing issues.
  • Cost-based slow query optimization suggestions need to calculate the improvement of the query cost of the query under different indexes, so a large number of index addition and deletion operations are required, but the real index addition and deletion costs are very high, and the fake index [2] technology is needed. The fake index technology does not create a real physical index file, but only estimates the benefit of the index for the query by simulating the query plan when the index exists. At present, most of Meituan’s business is running on MySQL instances. Unlike the commercial database SQL Server and the open source database PostgreSQL, MySQL does not integrate fake index technology, so it is necessary to build a storage engine that supports fake indexes. The development cost Higher, which is also the solution currently adopted by the DAS platform for cost-based slow query optimization suggestions.

In order to solve the above two problems, the Meituan Database R&D Center and the School of Data Science and Engineering of East China Normal University launched a scientific research cooperation on "Data-Driven Index Recommendation". The two parties integrated AI+ data-driven index recommendation on the DAS platform. To recommend indexes for slow queries in parallel with the cost-based method to improve the recommendation effect.

  • First of all, the cost-based method recommends indexes for slow queries every day, and evaluates whether the recommended indexes really improve the execution time of queries on the sample library, which accumulates a large amount of credible training data for the AI ​​method. The AI ​​model can make up for the problem of missing or wrongly selecting indexes in cost-based methods to a certain extent.
  • Secondly, the AI-based method regards the index recommendation for slow queries as a binary classification problem, and directly judges whether building an index on a certain column or certain columns can improve the execution performance of the query through the classification model, without the help of a query optimizer. and fake indexing techniques, which make AI methods more general and less expensive to develop.

2 Introduction to index recommendation

Index recommendation can be divided into two levels: Workload level and Query level:

  • At the workload level, index recommendation is to recommend a set of optimal index sets to minimize the cost of the entire workload under the limited index storage space or number of indexes.
  • The index recommendation at the query level can be regarded as a simplified version of the index recommendation at the workload level. At the query level, the index recommendation is to recommend missing indexes for a single slow query to improve its performance.

2.1 Cost-based index recommendation

Cost-based index recommendation [3] mostly focuses on workload-level index recommendation. Each column or combination of columns appearing in the query can be regarded as a candidate index that can improve the workload cost. All candidate indexes constitute a huge Search space (set of candidate indexes).

The goal of cost-based index recommendation is to search for a set of optimal index sets in the candidate index set to improve the Workload cost to the greatest extent. If the number of candidate indexes is N and the maximum number of recommended indexes is M, then the search space of the optimal index set is:

KaTeX parse error: Undefined control sequence: \* at position 22: …}^{M}=\\frac{N \̲*̲(N-1) \\ldots(N…

This is a search problem belonging to the category of NP-hard [4]. At present, most cost-based index recommendation methods use a "greedy strategy" to simplify the search process, but this may lead to a suboptimal solution for the final recommended index [5].

2.2 Index recommendation based on AI + data drive

Index recommendation based on AI + data-driven focuses on query-level index recommendation. The starting point is the slow query caused by missing indexes in a certain database. There may be similar index creation cases in other databases: these query statements are similar, so in similar Indexing columns on position may yield similar benefits. For example, in the figure below, queries q_s and q_t are very similar in statement structure and column type. Therefore, we can recommend missing indexes for query q_t by learning the index creation pattern for query q_s.

For index recommendations with different numbers of columns, we will train binary classification models based on XGBoost respectively. For example, we currently support up to three-column index recommendation, so we will train a single-column index recommendation model, a two-column index recommendation model, and a three-column index recommendation model respectively. For a given single-column candidate index and its corresponding slow query, we use the single-column index recommendation model to judge whether the single-column candidate index can improve the performance of the slow query.

Similarly, given a two-column (three-column) candidate index and its corresponding slow query, we use a two-column (three-column) index recommendation model to determine whether the two-column (three-column) candidate index can improve the slow query. query performance. If the number of candidate indexes contained in a slow query is N, then N model predictions are required to complete the index recommendation for this slow query.

3 overall structure

The overall architecture of AI+data-driven index recommendation is shown in the figure below, which is mainly divided into two parts: model training and model deployment.

3.1 Model training

As mentioned above, we collect the daily index recommendation data (including slow queries and verified effective recommended indexes) of the DAS platform for cost-based slow query optimization suggestions as training data. We generate single-column, two-column, and three-column candidate indexes for each query, and use feature engineering to construct feature vectors for each candidate index, and use index data to label the feature vectors. Afterwards, the single-column, two-column and three-column feature vectors will be used to train the single-column, two-column and three-column index recommendation models, respectively.

3.2 Model Deployment

For slow queries that require recommended indexes, we also generate candidate indexes and construct feature vectors. Next, we use the classification model to predict the label of the feature vector, i.e. predict the valid index among the candidate indexes. Then, we created the effective index predicted by the model on the sample database, and observed whether the query performance was improved before and after the index was built by actually executing the query. Only when the query performance is really improved, we will recommend the index to the user.

4 Modeling process

4.1 Generate candidate indexes

We extract the columns that appear in keywords such as aggregate function, WHERE, JOIN, ORDER BY, and GROUP BY in the query as single-column candidate indexes, and arrange and combine these single-column candidate indexes to generate two-column and three-column candidate indexes. At the same time, we will get the existing indexes in the tables involved in the query and delete them from the set of candidate indexes. This step follows the leftmost prefix principle of indexes: if there is an index Idx (col1, col2), then both candidate indexes (col1) and (col1, col2) will be deleted from the set of candidate indexes.

4.2 Feature Engineering

A feature vector of a candidate index includes two parts: sentence features and statistical features. The statement feature describes the position of the candidate index column in the query (using one-hot encoding), and the statistical feature describes the statistical information of the candidate index column, such as the number of rows in the table, Cardinality value, selectivity, etc. These It is an important indicator for judging whether to build an index on the candidate index column.

The following table takes a single-column candidate index (col1) as an example to show some of its important features and their meanings:

The features of the two-column candidate indexes (col1, col2) are formed by splicing the features of the single-column candidate indexes (col1) and (col2). In addition, we will also calculate the common Cardinality value of col1 and col2 as the two-column candidate index (col1, col2) to more fully describe its statistics. Similarly, we will also use this method to build the features of the three-column candidate index (col1, col2, col3). After generating the feature vector of a query, we use the index used by this query to label the generated feature vector.

4.3 Modeling example

The figure below takes query q_1 as an example to show the process of generating and labeling a feature vector for a query in the training set. Query q_1 involves two tables, customer table and warehouse table. The four columns c_w_id, c_id, c_d_id, and c_last of customer table participate in the query, so four single-column feature vectors are generated correspondingly; the w_id column of warehouse table participates in the query, so only A single-column feature vector is generated. The single-column index used by query q_1 is Idx(w_id), so the feature vector corresponding to the single-column candidate index (w_id) is marked as a positive sample, and the rest of the feature vectors are marked as negative samples.

Next, we permutate and combine single-column candidate indexes to generate multi-column candidate indexes and their feature vectors. Since query q_1 q\_1The multi-column index used by q _1 has only one three-column index Idx(c_d_id, c_id, c_last), so we skip generating two-column candidate indexes and only generate three-column candidate indexes. This is because we label the feature vector based on the index used by the query. If the query does not use the two-column index, then all the two-column feature vectors generated are negative samples, which may lead to positive and negative samples in the training set. unbalanced problem.

Finally, based on the three-column indexes used in the query, we mark the feature vectors corresponding to the three-column candidate indexes (c_d_id, c_id, c_last) as positive samples. The above is the whole process of generating feature vectors and labeling for query q_1. Query q_1 contributes five samples (one positive sample and four negative samples) to the training set of the single-column index recommendation model, and contributes to the training set of the three-column index recommendation model. Six samples (one positive sample and five negative samples) were obtained.

4.4 Model Prediction and Index Evaluation

When recommending an index for a slow query, we sequentially generate all single-column, double-column, and three-column candidate indexes in the slow query, and construct feature vectors through the above-mentioned feature engineering. Then, we input the feature vector to the corresponding classification model for prediction, and select a candidate index with the highest prediction probability from the prediction results of the three classification models (that is, a single-column index, a two-column index and a three-column index ) as the index recommended by the model.

Although the more indexes are recommended, the performance of slow queries is more likely to be improved, but some indexes recommended by the model may be invalid, and the storage space overhead and index update overhead caused by these invalid indexes cannot be ignored. Therefore, it is unreasonable to directly recommend all the indexes recommended by the model to users. For this reason, before recommending the index to the user, we will first establish the index recommended by the three classification models on the sampling library for verification. The sampling library is a mini version of the online database, which extracts part of the data of the online database . On the sample database, we will observe whether the execution time of the query is improved after building the recommended index. If it is improved, we will recommend the index recommended by one or more models used in the query as an index suggestion to the user.

5 Project operation status

As mentioned above, the Meituan DAS platform currently uses the cost method and AI model to recommend indexes for slow queries in parallel. Specifically, the AI ​​model can make up for the problem of missed selection of cost methods or wrong selection of recommended indexes in certain scenarios. Just in the past March, on the basis of the recommended index of the cost method, the AI ​​model had an additional 12.16% of the recommended index adopted by users.

The improvement of these additional indexes for the query is shown in the figure above: the upper part shows the number of optimized query executions, and the lower part shows the execution time and reduced execution time of the query after using the recommended indexes. These indexes total About 5.2 billion query executions were optimized, reducing the execution time by 4632 hours.

6 Future plans

At present, large model technology (such as GPT-4) has been more and more recognized, and it can be used for tasks in almost various fields. We plan to try to use Fine-Tune's open source large-scale language model (such as Google's open source T5 model) to solve the problem of index recommendation: input a slow query and let the model generate index suggestions for the slow query.

In the case that the recommended index cannot improve the slow query, we can provide some text suggestions to help users optimize SQL later, such as reducing the return of unnecessary columns, using JOIN instead of subqueries, etc.

7 Authors

Peng Gan, an engineer of Meituan's basic R&D platform, is mainly responsible for the SQL optimization suggestion of Meituan's database autonomous service DAS.

8 special thanks

Special thanks to Professor Cai Peng from the School of Data Science and Engineering, East China Normal University, who has published many papers in important international conferences in the fields of VLDB, ICDE, SIGIR, and ACL. The current research direction is in-memory transaction processing, and adaptive data management system based on machine learning technology. This article is also the specific practice after the scientific research cooperation between Meituan Database R&D Center and Professor Cai Peng.

Meituan's scientific research cooperation is committed to building a bridge and platform for cooperation between Meituan's technical team and universities, scientific research institutions, and think tanks. Relying on Meituan's rich business scenarios, data resources, and real industrial problems, open innovation, gather upward forces, and focus on robots , artificial intelligence, big data, Internet of Things, unmanned driving, operational optimization and other fields, jointly explore cutting-edge technology and industry focus macro issues, promote industry-university-research cooperation and exchange and achievement transformation, and promote the cultivation of outstanding talents. Facing the future, we look forward to cooperating with more teachers and students from universities and research institutes. Teachers and students are welcome to send emails to: [email protected] .

9 References

  • [1] Leis V, Gubichev A, Mirchev A, et al. 2015. How good are query optimizers, really? Proc. VLDB Endow. 9, 3 (2015), 204-215.
  • [2] https://github.com/HypoPG/hypopg
  • [3] Kossmann J, Halfpap S, Jankrift M, et al. 2020. Magic mirror in my hand, which is the best in the land? an experimental evaluation of index selection algorithms. Proc. VLDB Endow. 13,12 (2020), 2382-2395.
  • [4] Piatetsky-Shapiro G. 1983. The optimal selection of secondary indices is NP-complete. SIGMOD Record. 13,2 (1983), 72-75.
  • [5] Zhou X, Liu L, Li W, et al. 2022. Autoindex: An incremental index management system for dynamic workloads. In ICDE. 2196-2208.

Guess you like

Origin blog.csdn.net/qq_34626094/article/details/130091702