[MySQL] How to quickly retrieve the mailbox number and ID card?

foreword

Whether it is a company project or our own project, we have used the mailbox function, and in some scenarios, we will ask to query the specified mailbox number. For example, send an email to the mailbox number, and use the mailbox to bind user information, etc., but the mailbox number itself has no rules. If you query directly, you basically have to go through the whole table. This is very important for our library with tens of millions of users. It is definitely impossible to have a full table, so we need a way to solve this problem.

Ordinary index and prefix index

The fastest optimization we can think of is to create an index for the mailbox field.
There are two ways, one is to directly create an ordinary index, and the other is to create a prefix index.
We know that the ordinary index will directly use the value of this field as our index segment, while the prefix index uses the first x bits of the current field value to form an index segment, which means that it can save a certain amount of space.
For example, our mailbox number is [email protected], then if we use ordinary index, the content of this index is:
index segment: [email protected] data segment: primary key id
and if we use prefix index, and the length is 5, then index The content is:
index segment: 12345 data segment: primary key id
The difference between these two indexes is not only the length, but also the fact that the former is complete and the latter is fuzzy.
This means that the former can directly return to the result set after querying, and then continue to query until the data does not match,
while the latter needs to return to the table once and determine whether the mailbox number matches. If not, then continue Find the next record.
And because the prefix is ​​used for searching, the number of rows with the same prefix but actually satisfying the data we want to query may be very small, which will increase the number of times we scan.
That is, we use a prefix index, which takes up less space, but may increase the number of additional scans.

Therefore, it is very important to make a reasonable length of the prefix, because it means a reasonable space occupation and fewer scan times.
Then when we create an index, we pay more attention to the discrimination of the prefix. The higher the discrimination, the fewer duplicate key values. So we can try to judge how long the prefix should be by counting how many different values ​​there are on the index.

How to formulate the prefix length?

So how to judge whether the prefix length is reasonable?
insert image description here
I use the following sql statement to search how many different values ​​on the email field

select count(distinct email) as L from sys_user;

Then, select prefixes of different lengths one by one to look at this value. For example, if we want to look at the prefix index of 4~7 bytes, we can use this statement:

select count(distinct left(email,4)) as L4,
    count(distinct left(email,5)) as L5,
        count(distinct left(email,6)) as L6,
            count(distinct left(email,7))  as L7
                from sys_user;

Of course, using a prefix index is likely to lose discrimination, so you need to pre-set an acceptable loss ratio, such as 5%. Then, in the returned L4~L7, find a value not less than L * 95%. Assuming that both L6 and L7 are satisfied here, you can choose the prefix length to be 6.

Impact on Covering Indexes

Prefix indexes not only may affect the number of rows scanned, but also cause covering indexes to fail together.
For example, if we directly create an ordinary index, the query statement is:
select id, email from user where email = 'xxxx'
, then for this statement, since the index directly contains the complete email, and its data segment is id, then you can directly By using the covering index, you can save one query back to the table.
And if you create a prefix index, even if it is the same query condition, he has to go back to the table to judge the data again.
And even if you set the length of the prefix to the same length as the current email, he will still return the table, because he cannot be sure whether all the data has been completely truncated.

Other methods

For retrieving mailbox numbers, our basic solution is the prefix index, so what if we want to retrieve ID numbers with less distinguishing prefixes, because we know that for people who agree with the provinces and regions, their prefixes are the same, The difference may be the birthday and the last 4 digits.
So for the ID number, we can use the reverse order + prefix method.
That is, after inserting the ID number, first use the reverse method to reverse the ID number, and then take the first 4 digits of the reverse as a prefix.

The second method is to use the crc32 function. We can use the crc32 function for the ID number to get a hash value that is almost never repeated. When querying later, we use this hash value as an index to query. Of course, in order to ensure that the query is correct, we It is also necessary to add the and condition later to accurately judge the ID card. Of course, since the hash field crc32 is used as the index, the scope of the query is suddenly much smaller.

Guess you like

Origin blog.csdn.net/Zhangsama1/article/details/131824891