MySQL advanced articles - covering index, prefix index, index pushdown, SQL optimization, primary key design

navigation:   

[Java Notes + Stepping on the Pit Summary] Java Basics + Advanced + JavaWeb + SSM + SpringBoot + St. Regis Takeaway + SpringCloud + Dark Horse Tourism + Guli Mall + Xuecheng Online + MySQL Advanced Chapter + Design Mode + Nioke Interview Questions

Table of contents

8. Prioritize covering indexes

8.1 What is a covering index?

8.1.0 Concept 

8.0.1 In the case of a covering index, the "not equal to" index takes effect

8.0.2 In the case of a covering index, the left fuzzy query index takes effect

8.2 Pros and Cons of Covering Indexes

9. Add an index to a string

9.1 Prefix index

9.2 Prefix index cannot use covering index

10. Index pushdown

10.1 Introduction

10.2 Conditions of use of ICP

10.3 ICP ON/OFF

10.4 ICP Use Cases

10.5 Performance comparison between enabling and disabling ICP

11. Ordinary index vs unique index

11.1 Approximate query performance

11.2 Ordinary index update performance is higher, change buffer

11.3 Use scenarios of change buffer

12. SQL optimization

12.1 Difference between EXISTS and IN

12.2 Recommend COUNT(*) or COUNT(1)

12.3 Suggest SELECT(field) instead of SELECT(*)

12.4 Effect of LIMIT 1 on Optimization

12.5 Use COMMIT more often

13. Primary key design ideas

13.1 Disadvantages of auto-increment primary key

13.2 Try not to use business fields as primary keys

13.3 Primary key design of Taobao order number

13.4 Recommended primary key design

13.4.1 Core and non-core business primary key strategy selection

13.4.2 Characteristics of UUID

13.4.3 MySQL 8.0 Primary Key Scheme: Ordered UUIDs

13.4.4 Primary key scheme before MySQL8.0: manual assignment

13.3.5 Snowflake Algorithm


8. Prioritize covering indexes

8.1 What is a covering index?

8.1.0 Concept 

Covering index: An index that contains data that satisfies the query results is called a covering index, and does not require operations such as returning to the table.

Indexes are one way to find rows efficiently, but in general databases can also use indexes to find data for a column, so it doesn't have to read the entire row. After all, the index leaf nodes store the data they index; when the desired data can be obtained by reading the index, there is no need to read the row.

Covering index is a form of non-clustered index, which includes all columns used in the SELECT, JOIN and WHERE clauses in the query (that is, the indexed fields are exactly the fields involved in the covered query conditions). Simply put, the index column + primary key contains the columns queried between SELECT and FROM .

8.0.1 In the case of a covering index, the "not equal to" index takes effect

In the case of no covering index, the "not equal to" index is invalid:

In the absence of a covering index, using "not equal to" causes the index to fail. Because if you use an index, you need to traverse all the leaf nodes in the non-clustered index B+ tree in turn, the time complexity is O(n), and you have to return to the table after finding the record. The efficiency is not as good as the full table scan, so the query optimizer Choose a full table scan.

CREATE INDEX idx_age_name ON student(age, NAME);
#查所有字段,并且使用“不等于”,索引失效
EXPLAIN SELECT * FROM student WHERE age <> 20;

In the case of a covering index, the "not equal to" index takes effect:

Covering index, the two fields to be checked are covered by the joint index, and the performance is higher. Although it is still necessary to traverse all the leaf nodes in the non-clustered index B+ tree in turn, the time complexity is O(n), but there is no need to return to the table, the overall efficiency is higher than without the index, and the query optimizer uses the index again.

CREATE INDEX idx_age_name ON student(age, NAME);
#查的两个字段正好被联合索引“idx_age_name ”覆盖了,索引成功
EXPLAIN SELECT age,name FROM student WHERE age <> 20;

8.0.2 In the case of a covering index, the left fuzzy query index takes effect

In the case of no covering index, the left fuzzy query causes the index to fail

#没覆盖索引的情况下,左模糊查询导致索引失效
CREATE INDEX idx_age_name ON student(age, NAME);
EXPLAIN SELECT * FROM student WHERE NAME LIKE '%abc';

In the case of a covering index, the left fuzzy query index takes effect

The main reason is also because the non-clustered index B+ tree traverses the leaf nodes without returning to the table, the efficiency will be higher than that of the full table scan, and the query optimizer chooses a high-efficiency solution.

#有覆盖索引的情况下,左模糊查询索引生效
CREATE INDEX idx_age_name ON student(age, NAME);
EXPLAIN SELECT id,age,NAME FROM student WHERE NAME LIKE '%abc';

All of the above uses the declared index, but the following situation is not the case. The query column still has more classIds, and the result is that the index is not used:

CREATE INDEX idx_age_name ON student(age, NAME);
EXPLAIN SELECT id,age,NAME,classId FROM student WHERE NAME LIKE '%abc';

8.2 Pros and Cons of Covering Indexes

benefit:

1. Avoid returning to the table (secondary query for indexing of the Innodb table)

Innodb is stored in the order of the clustered index. For lnnodb, the secondary index stores the primary key information of the row in the leaf node. If you use the secondary index to query data, after finding the corresponding key value, It is also necessary to perform a secondary query through the primary key to obtain the data we actually need.

In the covering index, the required data can be obtained in the key value of the secondary index, avoiding the secondary query of the primary key, reducing IO operations, and improving query efficiency.

2. Random IO can be changed into sequential IO to speed up query efficiency

Since the covering index is stored in the order of the key value, for IO-intensive range search, the data I0 of each row is read from the disk randomly. Read IOs turn into sequential IOs for index lookups.

Since a covering index can reduce the number of tree searches and significantly improve query performance, using a covering index is a common performance optimization method.

Disadvantages:

Specific issues should be analyzed in detail:

There is always a cost to maintaining indexed fields. Therefore, there are trade-offs to be considered when establishing how many indexes to support covering indexes. This is the job of the business DBA, or business data architect.

9. Add an index to a string

9.1 Prefix index

There is a teacher table, the table definition is as follows:

create table teacher(
ID bigint unsigned primary key,
email varchar(64),
...
)engine=innodb;

The lecturer needs to log in with an email address, so a statement similar to this must appear in the business code:

mysql> select col1, col2 from teacher where email='xxx';

If there is no index on the email field, then this statement can only do a full table scan .

MySQL supports prefix indexes. By default, if you create an index without specifying a prefix length, the index will contain the entire string.

mysql> alter table teacher add index index1(email);
#或
mysql> alter table teacher add index index2(email(6));

What is the difference between these two different definitions in terms of data structure and storage? The figure below is a schematic diagram of these two indexes.

as well as

If index1 is used (the index contains the entire string), the order of execution is as follows:

  1. Find the record that satisfies the index value of '[email protected]' from the index tree of index1, and obtain the value of ID2;
  2. Go back to the table to find the row whose primary key value is ID2 on the primary key, judge that the value of email is correct, and add this row record to the result set;
  3. Take the next record at the position just found on the index tree of index1, and find that the condition of email='[email protected]' is no longer satisfied, and the loop ends.

In this process, it is only necessary to retrieve data once from the primary key index, so the system considers that only one row has been scanned.

If index2 is used (the index contains the string prefix email(6)), the execution sequence is as follows:

  1. Find the record that satisfies the index value of 'zhangs' from the index2 index tree, and the first one found is ID1;
  2. Go back to the table and find out the row whose primary key value is ID1 on the primary key, and judge that the value of email is not ' [email protected] ', and discard the record in this row;
  3. Take the next record at the location just found on index2, and find that it is still 'zhangs', take out ID2, and then go back to the table to fetch the entire row on the ID index and then judge that the value is correct this time, and add this row to the result set ;
  4. Repeat the previous step until the value obtained on index2 is not 'zhangs' , the loop ends.

That is to say, using the prefix index and defining the length can save space without adding too much extra query cost. The degree of discrimination has been mentioned before, and the higher the degree of discrimination, the better . Because the higher the degree of discrimination, the fewer duplicate key values.

9.2 Prefix index cannot use covering index

Because the data found by the non-clustered index tree is the prefix and id, the prefix is ​​not complete data, and it must be returned to the clustered index tree.

Therefore, using a prefix index does not need to optimize the query performance of the covering index, which is also a factor that you need to consider when choosing whether to use a prefix index.

10. Index pushdown

10.1 Introduction

Index Condition Pushdown (ICP, Index Condition Pushdown) is a new feature in MySQL 5.6. It is an optimized way to use indexes to filter data at the storage engine layer.

  • If there is no ICP : When a field of the joint index is a fuzzy query (non-left fuzzy), after the field is judged, the following fields cannot be used for direct condition judgment, and the judgment must be made after returning to the table.
  • After ICP is enabled : When a field in the joint index is a fuzzy query (not left fuzzy), after the field is judged, the next few fields can be judged directly. After the judgment is filtered, return to the table to check the conditions of the fields not included in the joint index judge. The main optimization point is to filter before returning to the table to reduce the number of times to return to the table. Main application: fuzzy query (non-left fuzzy) causes the fields behind the field in the index to be out of order, and must be judged by returning to the table. However, if index pushdown is used, there is no need to return to the table, and the judgment is directly in the joint index tree.

If there is no ICP , the storage engine will traverse the index to locate the rows in the base table, and return them to the MySQL server, and the MySQL server will evaluate whether the conditions behind WHERE are reserved.
After ICP is enabled , if part of the WHERE condition can be filtered using only the columns in the index, the MySQL server will put this part of the WHERE condition into the storage engine filter. The storage engine then filters the data by using the index entries and reads rows from the table only if this condition is met.

Benefits: ICP can reduce the number of times the storage engine must access the base table and the number of times the MySQL server must access the storage engine. However, the acceleration effect of ICP depends on the proportion of data filtered by ICP in the storage engine. 

Example:

Joint indexes that do not support index pushdown: for example, index (name, age), query name like 'z%' and age=? , the fuzzy query causes the age to be out of order. When querying the joint index tree, only the name is searched, and the following ages cannot be directly judged by the condition, and the age must be judged after returning to the table.

And the joint index that supports index pushdown: for example, index (name, age), query name like 'z%' and age and address, not only check the name when querying the joint index tree, but also judge the subsequent age, filter and return Table judgment address.

CREATE INDEX idx_name_age ON student(name,age);
#索引失败;非覆盖索引时,左模糊导致索引失效
EXPLAIN SELECT * FROM student WHERE name like '%bc%' AND age=30;
#索引成功;MySQL5.6引入索引下推,where后面的name和age都在联合索引里,可以又过滤又索引,不用回表,索引生效
EXPLAIN SELECT * FROM student WHERE `name` like 'bc%' AND age=30;
#索引成功;name走索引,age用到索引下推过滤,classid不在联合索引里,需要回表。
EXPLAIN SELECT * FROM student WHERE `name` like 'bc%' AND age=30 AND classid=2;

Benefits:  In some scenarios, ICP can greatly reduce the number of table returns and improve performance. ICP can reduce the number of times the storage engine must access the base table and the number of times the MySQL server must access the storage engine. However, the acceleration effect of ICP depends on the proportion of data filtered by ICP in the storage engine .

10.2 Conditions of use of ICP

  • The access type of the table is range, ref, eq_ref or ref_or_null.
  • Storage engine: ICP can be used for InnDB and MyISAM storage engines
  • Secondary indexes are required: For InnoDB tables, ICP is only used for secondary indexes. The goal of ICP is to reduce the number of full row reads, thereby reducing I/O operations.
  • Must not be a covering index: When SQL uses a covering index, the ICP optimization method is not supported. Because using ICP in this case will not reduce I/O.
  • Conditions for correlated subqueries cannot use ICP
  • Must be version 5.6 and above: MySQL version 5.6 is introduced and enabled by default, and previous versions do not support index pushdown.
  • The where field must be in the index column: Not all where conditions can be filtered by ICP. If the field of the where condition is not in the index column, it is still necessary to read the records of the entire table to the server for where filtering.

10.3 ICP ON/OFF

  • Index condition pushdown is enabled by default. It can be controlled by setting the system variable optimizer_switch : index_condition_pushdown
# 打开索引下推
SET optimizer_switch = 'index_condition_pushdown=on';

# 关闭索引下推
SET optimizer_switch = 'index_condition_pushdown=off';
  • When the index condition is pushed down, the content of the Extra column in the output of the EXPLAIN statement is displayed as Using index condition .

10.4 ICP Use Cases

  • Primary key index (simplified diagram)

Secondary index zip_last_first (simplified diagram, data pages and other information are omitted here)

10.5 Performance comparison between enabling and disabling ICP

11. Ordinary index vs unique index

From a performance point of view, do you choose a unique index or a normal index? What is the basis for the choice?

Suppose we have a table whose primary key is ID. There is a field k in the table, and there is an index on k, assuming that the values ​​on field k are not repeated.

The table creation statement for this table is:

mysql> create table test(
id int primary key,
k int not null,
name varchar(16),
index (k)
)engine=InnoDB;

The (ID,k) values ​​of R1~R5 in the table are (100,1), (200,2), (300,3), (500,5) and (600,6) respectively.

11.1 Approximate query performance

Suppose, the statement to execute the query is select id from test where k=5.

  • For a normal index, after finding the first record (5,500) that satisfies the condition, it is necessary to search for the next record until the first record that does not meet the k=5 condition is encountered.
  • For a unique index, since the index defines uniqueness, after finding the first record that meets the condition, the search will stop.

So, what is the performance gap brought about by this difference? The answer is, very little .

11.2 Ordinary index update performance is higher, change buffer

Write cache (change buffer):

When a data page needs to be updated, if the data page is in memory, it will be updated directly, and if the data page is not in memory, InooDB will cache these update operations in the change buffer without affecting data consistency , so that this data page does not need to be read from disk. When the next query needs to access this data page, read the data page into the memory, and then execute the operations related to this page in the change buffer. In this way, the correctness of the data logic can be guaranteed.

merge: The process of applying the operation in the change buffer to the original data page to get the latest result is called merge. In addition to accessing this data page will trigger the merge, the system has a background thread that will merge periodically. The merge operation is also performed during a normal database shutdown.

If the update operation can be recorded in the change buffer first to reduce disk reads , the execution speed of the statement will be significantly improved. Moreover, reading data into memory requires the buffer pool, so this method can also avoid occupying memory and improve memory utilization.

The update of the unique index cannot use the change buffer , in fact, only ordinary indexes can be used.

Make a distinction:

  • Read data using the buffer pool buffer pool ;
  • The redo log has a redo log buffer , which is to write the updated data in the buffer pool into the redo log buffer. When the transaction is committed, the redo log buffer is flushed to the redo log file or page cache according to the flushing strategy.

11.3 Use scenarios of change buffer

  • How to choose ordinary index and unique index? In fact, there is no difference in query capabilities between these two types of indexes . The main consideration is the impact on update performance . Therefore, it is recommended that you try to choose a common index .

  • In actual use, it will be found that the combined use of ordinary indexes and change buffers is very obvious for updating and optimizing tables with large amounts of data .

  • Not suitable for change buffer situations: If all updates are immediately followed by queries to this record, then you should turn off the change buffer. In other cases, the change buffer can improve update performance.

  • When the transaction is committed, the change buffer operation will also be recorded in the redo log , so when the crash is recovered, the change buffer can also be retrieved.

  • Since the unique index does not use the change buffer optimization mechanism, if the business is acceptable, it is recommended to give priority to non-unique indexes from a performance perspective. But if "the business may not be guaranteed", how to deal with it?

    • First, business correctness takes priority. Our premise is that "the business code has been guaranteed not to write duplicate data" to discuss performance issues. If the business cannot be guaranteed, or the business requires the database to be a constraint, then there is no choice but to create a unique index. In this case, the significance of this section is to provide you with an additional troubleshooting idea if a large amount of data is inserted slowly and the memory hit rate is low.
    • Then, in some "archive library" scenarios, you can consider using unique indexes. For example, online data only needs to be kept for half a year, and then historical data is stored in the archive library. At this point, archiving data already ensures that there are no unique key conflicts. To improve archiving efficiency, you can consider changing the unique index in the table to a common index.

12. SQL optimization

12.1 Difference between EXISTS and IN

question:

I don't quite understand which situation should use EXISTS and which situation should use IN. The selection criterion is to see if the index of the table can be used?

answer:

12.2 Recommend COUNT(*) or COUNT(1)

Use COUNT(1), COUNT(*) as far as possible to count the number of rows: when COUNT(1), COUNT(*), the query optimizer will give priority to selecting the secondary index tree with indexes and occupying the smallest space for statistics. Clustered index tree statistics are used when accessing non-clustered index trees, which takes up a lot of space. Of course, COUNT (minimum space secondary index field) can also be used, but the trouble is not as good as automatic selection by the optimizer.

SELECT COUNT(*) FROM student;
SELECT COUNT(1) FROM student;

 Question: There are three ways to count the number of rows in a data table in MySQL: SELECT COUNT(*), SELECT COUNT(1) and SELECT COUNT (specific fields). What is the query efficiency between these three methods?

Answer: If you want to count the number of non-null data rows in a certain field, it is another matter. After all, the premise of comparing execution efficiency is that the results are the same.

COUNT(*) and COUNT(1): COUNT(*) and COUNT(1) both perform COUNT(*) on all results , and there is essentially no difference between COUNT(*) and COUNT(1) (the execution time of the two may be There is a slight difference, but you can still regard the execution efficiency of the two as equal). If there is a WHERE clause, it will count all the data rows that meet the filtering conditions. If there is no WHERE clause, it will count the number of data rows in the data table.

MylSAM statistics only need O(1): If it is the MylSAM storage engine, the number of rows in the statistical data table only needs the complexity of O(1) , because each MyISAM data table has a meta information that stores the row_count value , Consistency is guaranteed by table-level locks. If it is an InnoDB storage engine, because innoDB supports transactions and uses row-level locks and MVCC mechanism, it cannot maintain a row_count variable like MyISAM, so it needs to scan the entire table, which is O(n) complexity , and loop+ Counting is done by way of counting.

Suggestion for selection: In ImnoDB, if you use COUNT (specific field) to count the number of data rows, try to use secondary indexes . Because the primary key is a clustered index, and the leaf nodes of the clustered index contain the entire record, the amount of data to be loaded into the memory during statistics is larger, and the performance is worse. For COUNT(*) and COUNT(1), they do not need to search for specific rows, but only count the number of rows, and the system will automatically use the secondary index that occupies less space for statistics . If there are multiple secondary indexes, the secondary index with smaller key_len will be used for scanning. When there is no secondary index, the primary key index will be used for statistics.

12.3 Suggest SELECT(field) instead of SELECT(*)

In the table query, it is recommended to specify the fields, do not use * as the field list of the query, it is recommended to use SELECT <field list> query. reason:

① During the parsing process, MySQL will query the data dictionary to convert "*" into all column names in sequence , which will consume a lot of resources and time.

② Covering index cannot be used

12.4 Effect of LIMIT 1 on Optimization

It is aimed at SQL statements that scan the entire table . If you can be sure that there is only one result set , when adding LIMIT 1, the scan will not continue when one result is found, which will speed up the query.

If the data table has established a unique index for the field, you can query through the index. If you do not scan the entire table, you do not need to add LIMIT 1.

12.5 Use COMMIT more often

Whenever possible, use COMMIT as much as possible in your program, so that the performance of the program will be improved, and the demand will be reduced because of the resources released by COMMIT.

Resources released by COMMIT:

  • Information used to restore data on the rollback segment
  • locks acquired by program statements
  • Space in the redo / undo log buffer
  • Manage internal spend in the above 3 resources

13. Primary key design ideas

Let’s talk about a practical question: How is the primary key designed for Taobao’s database?

Certain wrong outrageous answers are still circulating on the Internet year after year, and even become the so-called MySQL military regulations. Among them, one of the most obvious mistakes is about MySQL's primary key design.

Most people's answers are so confident: use 8-byte BIGINT as the primary key instead of INT. Wrong !

Such an answer is only at the database level, without thinking about the primary key from a business perspective . Is the primary key an auto-increment ID? At present, using auto-increment as the primary key may not even pass the architectural design .

13.1 Disadvantages of auto-increment primary key

The auto-increment ID is used as the primary key, which is easy to understand. Almost all databases support the auto-increment type, but the implementation is different. In addition to being simple, self-incrementing IDs are disadvantages. Generally speaking, there are the following problems:

  • low reliability

    There is a problem with auto-increment ID backtracking, which was not fixed until the latest version of MySQL 8.0.

    Backtracking problem: For example, insert three data rows whose primary keys are 1, 2, and 3 into a new table. At this time, use SHOW CREATE TABLEthe command to check that the value of the table AUTO_INCREMENTis 4, which is no problem.

    Then delete the data row with ID=3, and AUTO_INCREMENTthe value queried again is still 4, which is no problem.

    But if you restart MySQL, this value will change back to 3 instead of 4, and a backtracking occurs.

  • low security

    The exposed interface can be very easy to guess the corresponding information . For example, an interface like /User/1/ can easily guess the value of the user ID and the total number of users, and can also easily crawl data through the interface.

  • poor performance

    The auto-increment ID has poor performance and needs to be generated on the database server side.

  • Additional execution functions are required to know the self-increment value, which affects performance

    The business also needs to execute a function similar to last_insert_id() to know the self-increment value just inserted, which requires one more network interaction. In a massively concurrent system, one more SQL statement means one more performance overhead .

  • The global is not unique, self-increasing lock competition affects performance during high concurrency

    The most important point is that the auto-increment ID is locally unique, unique only in the current database instance, not globally unique, and unique among any server. For current distributed systems, this is simply a nightmare.

  • Auto-increment is no longer applicable when sub-database and table are migrated.

13.2 Try not to use business fields as primary keys

In order to uniquely identify a member's information, a primary key needs to be set for the member information table. So, how to set the primary key for this table to achieve our ideal goal? Here we consider the business field as the primary key.

The table data is as follows:

In this table, which field is more appropriate?

  • Select card number (cardno)

The membership card number (cardno) seems more appropriate, because the membership card number cannot be empty and is unique, which can be used to identify a membership record.

mysql> CREATE TABLE demo.membermaster
-> (
-> cardno CHAR(8) PRIMARY KEY, -- 会员卡号为主键
-> membername TEXT,
-> memberphone TEXT,
-> memberpid TEXT,
-> memberaddress TEXT,
-> sex TEXT,
-> birthday DATETIME
-> );
Query OK, 0 rows affected (0.06 sec)

Different membership card numbers correspond to different members, and the field "cardno" uniquely identifies a certain member. If this is the case, the membership card number corresponds to the member one by one, and the system can operate normally.

But the actual situation is that the membership card number may be reused . For example, Zhang San moved away from his original address due to a job change, and no longer went to the merchant's store to consume (the membership card was returned), so Zhang San was no longer a member of the merchant's store. However, the merchant didn't want the membership card to be empty, so he sent the membership card with the card number "10000001" to Wang Wu.

From the point of view of system design, this change only modifies the member information whose card number is "10000001" in the member information table, and will not affect the data consistency. That is to say, if you modify the member information whose membership card number is "10000001", each module of the system will obtain the modified member information, and there will be no "some modules obtain the member information before modification, and some modules obtain the modified The later member information, resulting in data inconsistency within the system". Therefore, from the information system level, there is no problem.
But from the business level of using the system, there are big problems, which will affect the merchants.

For example, we have a sales flow table (trans), which records all sales flow details. On December 01, 2020, Zhang San bought a book at the store and spent 89 yuan. Then, there is a record of Zhang San buying books in the system, as shown below:

Next, let's check the membership sales records on December 1, 2020:

mysql> SELECT b.membername,c.goodsname,a.quantity,a.salesvalue,a.transdate
-> FROM demo.trans AS a
-> JOIN demo.membermaster AS b
-> JOIN demo.goodsmaster AS c
-> ON (a.cardno = b.cardno AND a.itemnumber=c.itemnumber);
+------------+-----------+----------+------------+---------------------+
| membername | goodsname | quantity | salesvalue | transdate |
+------------+-----------+----------+------------+---------------------+
|     张三   | 书         | 1.000    | 89.00      | 2020-12-01 00:00:00 |
+------------+-----------+----------+------------+---------------------+
1 row in set (0.00 sec)

If the membership card "10000001" is issued to Wang Wu again, we will change the membership information form. When resulting in a query:

mysql> SELECT b.membername,c.goodsname,a.quantity,a.salesvalue,a.transdate
-> FROM demo.trans AS a
-> JOIN demo.membermaster AS b
-> JOIN demo.goodsmaster AS c
-> ON (a.cardno = b.cardno AND a.itemnumber=c.itemnumber);
+------------+-----------+----------+------------+---------------------+
| membername | goodsname | quantity | salesvalue | transdate |
+------------+-----------+----------+------------+---------------------+
| 王五        | 书        | 1.000    | 89.00      | 2020-12-01 00:00:00 |
+------------+-----------+----------+------------+---------------------+
1 row in set (0.01 sec)

The result obtained this time is: Wang Wu bought a book on December 1, 2020 and spent 89 yuan. Obviously wrong! Conclusion: Do not use the membership card number as the primary key.

  • Select member phone number or ID number

Can a member phone number be used as the primary key? No way. In actual operation, the mobile phone number is also taken back by the operator and reissued to others.

What about the ID number? It seems to be possible. Because the ID card will never be repeated, there is a one-to-one correspondence between the ID number and a person. But the problem is that the ID number belongs to personal privacy , and customers may not be willing to give it to you. If it is mandatory for members to register their ID numbers, many customers will be driven away. In fact, the customer phone also has this problem, which is why we allow the ID number and phone number to be empty when designing the member information form.

Therefore, it is recommended not to use business-related fields as primary keys . After all, as project design technicians, none of us can predict which business field will be repeated or reused due to the business requirements of the project throughout the life cycle of the project.

Experience: When you first start using MySQL, many people are prone to make the mistake of using business fields as primary keys. They take it for granted that they understand the business needs, but the actual situation is often unexpected, and the cost of changing the primary key setting is very high .

13.3 Primary key design of Taobao order number

In Taobao's e-commerce business, order service is a core business. Excuse me, how is the primary key Taobao designed in the order table? Is it an auto-increment ID?

Open Taobao and look at the order information:

As can be seen from the above figure, the order number is not an auto-increment ID ! Let's look at the above 4 order numbers in detail:

1550672064762308113
1481195847180308113
1431156171142308113
1431146631521308113

The order number is 19 digits in length, and the last 5 digits of the order are all the same, 08113. And the first 14 digits of the order number are monotonically increasing.

Boldly guess, Taobao's order ID design should be:

订单ID = 时间 + 去重字段 + 用户ID后6位尾号

Such a design can be globally unique and extremely friendly to distributed system queries.

13.4 Recommended primary key design

13.4.1 Core and non-core business primary key strategy selection

Non-core business : The primary key auto-increment ID of the corresponding table, such as alarms, logs, monitoring and other information.

Core business  : The primary key design should at least be globally unique and monotonically increasing. The global uniqueness is guaranteed to be unique between each system, and the monotonous increase is to hope that the insertion will not affect the database performance. It is recommended to use MySQL8.0 to transform it into an ordered UUID. Specifically, use the function uuid_to_bin(@uuid,true) to convert UUID into an ordered UUID

13.4.2  Characteristics of UUID

The simplest primary key design is recommended here: UUID.

Globally unique , occupying 36 bytes, the data is out of order, and the insertion performance is poor.

Recognize UUIDs:

  • Why are UUIDs globally unique?
  • Why UUID takes 36 bytes?
  • Why are UUIDs unordered?

The UUID composition of the MySQL database is as follows:

UUID = 时间+UUID版本(16字节)- 时钟序列(4字节) - MAC地址(12字节)

Let's take the UUID value e0ea12d4-6473-11eb-943c-00155dbaa39d as an example:

Why are UUIDs globally unique? 

The time part in UUID occupies 60 bits , and the stored timestamp is similar to TIMESTAMP, but it represents the count of 100ns from 1582-10-15 00:00:00.00 to the present. It can be seen that the time precision of UUID storage is higher than that of TIMESTAMPE, and the probability of duplication in the time dimension is reduced to 1/100ns .

The clock sequence is to avoid the possibility of the clock being dialed back and causing time duplication . The MAC address is used for global uniqueness .

Why UUID takes 36 bytes?

UUIDs are stored in terms of strings, and are designed with useless "-" strings, so a total of 36 bytes is required.

Why are UUIDs random and unordered?

Because in the design of UUID, the low bit of time is placed at the front , and the data in this part is always changing and out of order.

13.4.3 MySQL 8.0 Primary Key Scheme: Ordered UUIDs

Transformation into order: If the high and low bits of the time are swapped, the time is monotonically increasing, and it becomes monotonically increasing. MySQL 8.0 can replace the storage method of low time and high time, so that UUID is an ordered UUID.

Optimize space occupation: MySQL 8.0 also solves the space occupation problem of UUID, removes the meaningless "-" string in the UUID string, and saves the string in binary type, thus reducing the storage space to 16 bytes.

The above functions can be realized through the uuid_to_bin function provided by MySQL8.0 . Similarly, MySQL also provides the bin_to_uuid function for conversion:

SET @uuid = UUID();
SELECT @uuid,uuid_to_bin(@uuid),uuid_to_bin(@uuid,TRUE);

The UUID is converted into an ordered UUID by the function uuid_to_bin(@uuid,true) . Globally unique + monotonically increasing , isn't this the primary key we want!

Ordered UUID performance test:

How does the 16-byte ordered UUID compare to the previous 8-byte self-incrementing ID in terms of performance and storage space?

Let's do a test, insert 100 million pieces of data, each piece of data occupies 500 bytes, and contains 3 secondary indexes. The final result is as follows:

From the above figure, we can see that it is the fastest to insert 100 million data ordered UUIDs, and in actual business use, ordered UUIDs can be generated on the business side . It is also possible to further reduce the number of SQL interactions.

In addition, although the ordered UUID has 8 bytes more than the self-incrementing ID, it only increases the storage space of 3G, which is acceptable.

In today's Internet environment, the database design with self-incrementing ID as the primary key is not recommended. A globally unique implementation like ordered UUID is more recommended.

In addition, in a real business system, the primary key can also be added to business and system attributes, such as the user's tail number, the information of the computer room, and so on. Such a primary key design will test the architect's level even more.

13.4.4 Primary key scheme before MySQL8.0: manual assignment

Manually assign the field as the primary key!

For example, design the primary key of the membership table of each branch, because if the data generated by each machine needs to be merged, the problem of primary key duplication may occur.

You can have a management information table in the headquarters MySQL database, and add a field to this table to record the maximum value of the current membership number.

When adding a member, the store first obtains the maximum value from the MySQL database of the headquarters, adds 1 to this basis, and then uses this value as the "id" of the new member, and at the same time, updates the current member in the management information table of the MySQL database of the headquarters The maximum value of the number.

In this way, when each store adds members, it operates on the data table fields in the same headquarters MySQL database, which solves the problem of member number conflicts when each store adds members.

13.3.5 Snowflake Algorithm

Ordered ids.

A 64-bit integer of the Long data type: composed of a 1-bit sign bit, a 41-bit timestamp, a 10-bit working machine id , and a 12-bit serial number.

advantage:

  • Ordered: All generated ids are incremented according to the time trend
  • Distributed and non-repetitive: no duplicate ids will be generated in the entire distributed system.

shortcoming:

  • Relying on the machine clock: relying on the machine clock, if the machine clock is dialed back, it will cause duplicate ids to be generated.
  • Unsynchronized distributed clocks lead to increment failure: increment on a single machine, but if in a distributed environment, the clocks of each machine may not be synchronized, it may not be a global increment.
  • Loss of precision: 64-bit binary numbers are converted to 19 digits in decimal, but the front-end js can only guarantee the accuracy of the first 16 digits. When the front-end gets this data, it will round the last three digits. Precision is lost.

Guess you like

Origin blog.csdn.net/qq_40991313/article/details/130804019