One article takes you into the statistical information model

1. Introduction

"Statistics" in a database is a collection of data that describes table and column information in the database. The Optimizer Cost Model relies on the statistical information of tables, columns, predicates and other objects involved in the query to select plans. The optimizer can use statistical information to optimize the selection of plans, so statistical information is selected in the cost model. The key to optimal execution of plans.

Statistical information includes two types: table statistical information (Table Level Statistics) and column statistical information (Column Level Statistics). Each database may not necessarily collect all the statistical information listed below, but will collect it based on each company's actual situation.

1. Table-level statistics

  • Basic information about the table

  • The statistical information type of the table (information levels are divided into GLOBAL, PARTITION and SUBPARTITION)

  • The number of rows in the table

  • Omote exclusiveHiroshiNumber

  • The number of microblocks occupied by the table

  • average row length of table

  • Time to collect statistics for table

  • Whether the statistics of the table are locked

2. Column-level statistics

  • Basic information of the column (including tenant_id, table_id, partition_id, column_id)

  • The statistical information type of the column (information levels are divided into GLOBAL, PARTITION and SUBPARTITION)

  • Number of Distinct Values NDV(Number of Distinct Values)

  • Number of NULL values ​​in the column

  • Maximum and minimum values ​​of a column

  • The sampled data size of the column

  • The density of the column's histogram

  • Number of histogram buckets for the column

  • Histogram type (frequency histogram/contour histogram/TopK histogram/mixed histogram)

In ordinary statistical information, CBO  will assume that the target column data is evenly distributed between its minimum and maximum values, and This is based on the estimated condition selection rate and cardinality of the result set, and then selects the execution plan. But in practice, it is obvious that some data are not evenly distributed, and so-called "data skew" will occur. The execution plan generated at this time is likely to be suboptimal, and may even produce a poor execution plan.

The introduction of histogram is to solve the problem of data skew. A histogram is a special type of column statistic that details the distribution of data in a column. After the histogram collects information, CBO will no longer think that the columns are evenly distributed. It can estimate the conditional selection rate and result set cardinality based on the actual situation, and select the correct execution plan.

2. Common statistical information histogram model

1. Frequency histogram

How many distinct values ​​are in the column, how many records will be stored in the data dictionary. The endpoint value records these distinct values, and the endpoint number records the total number of records up to this distinct value. At this time, the endpoint number is a cumulative value. You can subtract the endpoint number of the previous record from the endpoint number of one record to get the number of records corresponding to the endpoint value of this record. Frequency histograms are generally suitable for tables with relatively few distinct values. The number of histogram buckets >= NDV (Number of Distinct Value), that is, the number of buckets in the histogram is greater than or equal to the number of distinct values ​​on the column.

2. Contour histogram

For this type of histogram, Oracle first sorts all the records in the target table from small to large according to the target column, and then divides the total number of records by the number of Buckets to be used to determine the sorted records that need to be described in each Bucket. number. At this time, the endpoint number records the Bucket number, which starts from 0 and goes to N. The endpoint value corresponding to Bucket No. 0 records the minimum value of the target column, and the endpoint values ​​corresponding to other Buckets record the maximum value of the target column up to this Bucket.

To save space, Oracle will merge and store adjacent records that differ only in endpoint number but have the same endpoint value in the data dictionary. If the corresponding endpoint values ​​of Buckets with endpoint number=2 and 3 are both P, only endpoint number=3 and endpoint value=P will be recorded in the data dictionary. Therefore, the endpoint numbers in the Height Balanced type histogram data dictionary may be discontinuous.

This combination is recorded in the endpoint value of the data dictionary, which Oracle calls popular value. Obviously, the greater the difference between the endpoint number value of the popular value record and the endpoint number value of the previous record, the greater the proportion of popular value in the target column, and the greater the corresponding cardinality. This should also be called popular value. reason.

3. Top-k histogram

A variant of frequency histogram, targeting the situation where the proportion of data covered by k histograms exceeds a certain threshold. Rarely occurring distinct values ​​are ignored in the histogram.

4. Mixed histogram

The hybrid histogram combines the characteristics of the height-based histogram and the frequency histogram, but the difference from the frequency histogram and the Top-k histogram is that a bucket may contain multiple different value values, and the collected data The volume is segmented according to the number of buckets, and all the data in each segment is placed in a corresponding bucket, so that fewer buckets can be used to describe the data distribution of a larger amount of data. The maximum value in the bucket is It will be used as endpoint_value, and there will be one more endpoint_repeat_cnt to record the frequency of endpoint_value. This method enables the optimizer to obtain better selectivity estimates in certain situations.

Simple equal-height histograms are currently rarely used in databases. Many optimization solutions are based on equal-height histograms and add holes (intervals with non-continuous values ​​exceeding a certain threshold are considered holes), high-frequency values ​​and The second highest frequency value is represented by an approximately equal-height histogram that ensures that each value only appears in one interval.

The usage logic of frequency histogram, Top-k histogram and mixed histogram in different scenarios is as follows:

 

When the data distribution on the histogram information column is skewed, the prediction of the result set size of a single table provides relatively accurate results, but it cannot provide an accurate judgment of the size of the related result set when doing related queries.

In the case of related queries, there is another method to determine the size of the result set, Count-Min Sketch. It is a data structure that can handle equality queries, Join size estimation, etc., and can provide strong accuracy guarantees. Since its introduction in 2003 in An improved data stream summar: The count-min sketch and its applications, it has gained widespread use due to its simplicity of creation and use.

Count-Min Sketch maintains a d*w count array. For each value, d independent hash functions are used to map it to a column in each row, and the count values ​​of these d positions are modified accordingly.

In this way, when querying how many times a value appears, d hash functions are still used to find the mapped position in each row, and the minimum value of these d values ​​is taken as the estimated value.

The column-level statistical information provided by KaiwuDB  includes the number of rows, the number of different values, the number of NULL values, etc. Histogram statistics can also be provided for cases where the data on the columns are unevenly distributed. The histogram of KaiwuDB is also an approximate equal-height histogram. For scenarios where the different values ​​of the columns are less than or equal to 200, it can provide an effect similar to the frequency histogram. It can also represent high-frequency values, but Current HyperLoglog-based algorithms are less accurate in handling the number of distinct values ​​on the interval.

Less than or equal to 200 buckets:

Greater than 200 buckets:

Insert 100,000 random values ​​from 1 to 10000, increase the duplicate values ​​of 201, 202, 203, 204, 205, 206 all above 1000, delete the values ​​between 400 and 2000, and create statistics and histograms.

3. Summary

The cost-based optimization model (CBO) is currently the most mainstream optimization model forrelational databases, and statistical information is the basis of CBO. , to a certain extent, the accuracy of statistical information is the key to whether CBO can produce the optimal execution plan. The collection of statistical information is mainly based on table-level statistical information and column-level statistical information. When the data distribution on the columns is uneven, the accuracy of histogram information is particularly important. Currently, the main methods are contour histograms and frequency histograms. and the corresponding deformation histogram.

In addition, although the accuracy of statistical information is the key to generating the optimal execution plan, the calculation of statistical information will also consume system resources. Generally speaking, the more data involved in the calculation under the same strategy, the more accurately the statistical information can describe the true distribution of the data on the table, but the cost is that more resources are consumed and the calculation time is longer.

Usually, the data on the table is sampled to collect statistical information, or statistical information is collected incrementally to improve the efficiency of collecting statistical information. When the data involved in the calculation is determined, a rough estimate will be used to determine which strategy to use to collect statistical information, so that the data distribution can be described more effectively.

For tables with frequently changing data, the validity of statistical information is the focus. How to effectively determine the validity of statistical information and how to correct statistical information through changing data are difficult points in statistical information research. In addition, for some special tables in some converged multi-mode databases, new statistical information strategies can be tried based on the characteristics of the tables themselves. For example, in time series data tables, the data on the table has certain regularity. Some tables even have downsampling rules added. New strategies can generate or replace statistical information based on downsampling rules.

In short, for traditional relational databases, the collection of statistical information is an essential part, and the quality of statistical information directly affects the quality of the execution plan. For new tables in converged databases, the data distribution of the tables can be statistically estimated based on new strategies to achieve statistical information effects.

 

Tang Xiaoou, founder of SenseTime, passed away at the age of 55 In 2023, PHP stagnated Wi-Fi 7 will be fully available in early 2024 Debut, 5 times faster than Wi-Fi 6 Hongmeng system is about to become independent, and many universities have set up “Hongmeng classes” Zhihui Jun’s startup company refinances , the amount exceeds 600 million yuan, and the pre-money valuation is 3.5 billion yuan Quark Browser PC version starts internal testing AI code assistant is popular, and programming language rankings are all There's nothing you can do Mate 60 Pro's 5G modem and radio frequency technology are far ahead MariaDB splits SkySQL and is established as an independent company Xiaomi responds to Yu Chengdong’s “keel pivot” plagiarism statement from Huawei
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5148943/blog/10123199
Recommended