MYSQL06 Advanced_Why use indexes, advantages and disadvantages, index design, solutions, clustered indexes, joint indexes, precautions

①. Why use index

  • ①. Index is a data structure used by storage engines to quickly find data records. It is like going to the library to find a book, or looking for a word in the Xinhua Dictionary. It is equivalent to a directory, which can help us quickly find the location of the data.

  • ②. The same is true in MySQL. When searching for data, first check whether the query conditions match the index. If they match, the relevant data will be searched for through the index. If they do not match, a full table scan is required, that is, searching for records one by one until the matching conditions are found. Records that meet the conditions
    Insert image description here

  • ③. If the data is stored using a data structure such as a binary tree,
    the purpose of establishing an index as shown in the figure below is to reduce the number of disk I/O and speed up query efficiency.
    Insert image description here

②. Index and its advantages and disadvantages

  • ①. MySQL’s official definition of index is: Index is a data structure that helps MySQL obtain data efficiently.

  • ②. Essence: Index is a data structure, which can be understood as a "sorted and fast search data structure"

  • ③. Indexes are implemented in storage engines, so the indexes of each storage engine are not necessarily exactly the same, and each storage engine does not necessarily support all index types.

  • ④. Advantages

  1. Improve data retrieval efficiency and reduce database IO costs
  2. By creating a unique index, you can ensure the uniqueness of each row of data in the database table
  3. In terms of achieving referential integrity of data, it can accelerate the connection between tables and tables. In other words, when jointly querying dependent child tables and parent tables, the query efficiency can be improved.
  4. When using grouping and sorting clauses for data query, the time for grouping and sorting in the query can be significantly reduced, and the CPU consumption is reduced.
  • ⑤. Disadvantages
  1. Creating and maintaining indexes takes time, and as the amount of data increases, the time spent will also increase.
  2. Indexes need to occupy disk space. In addition to the data space occupied by the data table, each index also occupies a certain amount of physical space and is stored on the disk. If there are a large number of indexes, the index file may reach the maximum file size faster than the data file.
  3. Although the index greatly improves the query speed, it also reduces the speed of updating the table. When adding, deleting, and modifying data in the table, the index must also be dynamically maintained, which reduces the data maintenance speed.

③. InnoDb - index design

  • ①. First, build a table:
    This new index_demo table has 2 INT type columns, 1 CHAR type column, and specifies the c1 bit. This table uses Compact row format to actually store records. The row format will be changed later. learned. Here is a simplified diagram of the row format:
    Insert image description here
CREATE TABLE index_demo(
	c1 INT,
	c2 INT,
	c3 CHAR(1),
	PRIMARY KEY (c1)
) ROW_FORMAT = Compact;
  • ②. We only show and record these parts in the schematic diagram
  1. record_type: an attribute of the record header information, indicating the type of record, 0 means ordinary record, 2 means minimum record, 3 means maximum record, 1 is directory entry
  2. next_record: An attribute of the record header information, indicating the address offset of the next address relative to this record. We use arrows to indicate who the next record is. Can be understood as a linked list
  3. Values ​​of each column: Only three columns in the index_demo table are recorded here, namely c1, c2, c3
  4. Other information: All information except the above three types of information, including values ​​​​of other hidden columns and additional information recorded

Insert image description hereInsert image description here

  • ③. The primary key value of the user record in the next data page must be greater than the primary key value of the user record in the previous page.
  1. Assumption: Each data page can store up to three records. In fact, a data page is very large and can store many records. Insert 3 records into the table and fill the data page
INSERT INTO index_demo VALUES(1,4,'u'),(3,9,'d'),(5,3,'y');
  1. Then these records have been concatenated into a one-way linked list according to the size of the primary key value
    Insert image description here3. At this time, we insert a record with the primary key 4. This data page can no longer be displayed, so we can only create a new data page, and because 4 < 5, so this record should be saved in page 10. Move the record with the primary key 5 to the next page. This process is called page splitting.
    Insert image description here
  • ④. Create a directory entry for all pages - Since the numbers of data pages may be discontinuous, the following situations may occur after inserting multiple records.
    Insert image description here
  • ⑤. Because these data pages are not continuous in physical storage, if we want to quickly locate the location of certain records based on the primary key value from so many pages, we need to make a directory for them, and each page corresponds to a directory. Each directory entry includes the following two parts:
  1. For example: to find records with a primary key value of 20, the specific search process is divided into two steps:
  2. First, quickly determine from the directory entry based on the dichotomy method that the record with the primary key value 20 is in directory entry 3 (because 12 < 20 < 209), and its corresponding page - is page 9
  3. Then locate the specific record on page 9 according to the method of searching for records on the page mentioned earlier.
    Insert image description here
  • ⑥. The simple directory for the data page is completed. This directory has an alias called the index

④. Index scheme in InnoDb

  • ①. Iteration 1: The page of directory item records - this is how we put the directory items used previously into the data page.
    Insert image description here
  • ②. As can be seen from the picture, we have newly allocated a page numbered 30 to specifically store directory entry records. Here again we emphasize the differences between directory entry records and ordinary user records:
  1. The record_type value of directory entry records is 1, while the record_type value of ordinary user records is 0
  2. Directory entry records only have two columns: primary key value and page number, while the columns of ordinary user records are defined by the user and may contain many columns. In addition, there are hidden columns added by InnoDB itself.
  3. Understand: There is also an attribute called min_rec_mask in the record header information. Only the directory entry record with the smallest primary key value in the page where the directory entry record is stored has a min_rec_mask value of 1, and the min_rec_mask value of other records is 0.
  • ③. Similarity: both use the same data page and will generate a Page Directory for the primary key value, so that the dichotomy method can be used to speed up the query when searching based on the primary key value.
  1. Now, taking the search for a record with a primary key of 20 as an example, the steps to search for a record based on a certain primary key value can be roughly divided into the following two steps:
  2. First go to the page where the directory entry record is stored, that is, page 30 and quickly locate the corresponding directory entry through the dichotomy method. Because 12 < 20 < 209, the page where the corresponding record is located is page 9.
  3. Then go to page 9 where user records are stored and quickly locate the user record with a primary key value of 20 based on the dichotomy method.
  • ④. Iteration 2 times: pages with multiple directory entry records
    Insert image description hereInsert image description here
  • ⑤. Iterate 3 times: The directory page of the directory item record page
    is as shown in the figure. We generated a page 33 that stores higher-level directory items. The two records in this page represent page 30 and page 32 respectively. If the primary key of the user record If the value is between [1, 320), go to page 30 to find more detailed directory entry records. If the primary key value is not less than 320, go to page 32 to find more detailed directory entry records.
    Insert image description here
  • ⑥. We can use the picture below to describe it - this data structure, its name is B+ tree

Insert image description here

  • ⑦. A B+ tree node can actually be divided into many layers. The bottom layer, which is the layer where our user records are stored, is designated as layer 0, and then it is added upwards. We made a very extreme assumption before: the page that stores user records can store up to 3 records, and the page that stores directory entry records can store up to 4 records. In fact, the number of records stored on a page in a real environment is very large. Assume that the data pages represented by all leaf nodes storing user records can store 100 user records, and the data pages represented by all internal nodes storing directory entry records can store 1,000 records. directory entry record, then
  1. If the B+ tree has only one level, that is, there is only one node used to store user records, it can store up to 100 records.
  2. If the B+ tree has 2 levels, it can store up to 1000×100=10,0000 records.
  3. If the B+ tree has 3 levels, it can store up to 1000×1000×100=1,0000,0000 records.
  4. If the B+ tree has 4 layers, it can store up to 1000×1000×1000×100=1000,0000,0000 records. Quite a lot of records
  5. Can your table store 100000000000 records? So under normal circumstances, the B+ tree we use will not exceed 4 levels. Then to find a record through the primary key value, we only need to do a search in 4 pages at most to find 3 directory entry pages and a user record page. , and because there is a so-called Page Directory in each page, the dichotomy method can also be used to quickly locate records within the page.

⑤. Index-clustered index

  • ①. Use the size of the record's primary key value to sort records and pages, which has three meanings:
  1. The records in the page are arranged in a one-way linked list in order of primary key size.
  2. Each page that stores user records is also arranged in a doubly linked list based on the primary key size of the user records in the page.
  3. The pages that store directory entry records are divided into different levels. Pages in the same level are also arranged in a doubly linked list based on the primary key size of the directory entry records in the page.
  • ②. The leaf nodes of the B+ tree store a complete user record
    . A complete user record means that the values ​​of all columns are stored in this record, including hidden columns.

  • ③. We call the B+ tree with these two characteristics a clustered index, and all complete user records are stored at the leaf nodes of this clustered index. This kind of clustered index does not require us to explicitly use the INDEX statement in the MySQL statement to create it. The InnoDB storage engine will automatically create a clustered index for us.

  • ④. Advantages of clustered index

  1. Data access is faster because a clustered index keeps the index and data in the same B+ tree, so getting data from a clustered index is faster than a non-clustered index
  2. Clustered indexes are very fast for primary key sort searches and range searches.
  3. According to the clustered index arrangement order, when the query displays a certain range of data, since the data are closely connected, the database does not need to extract data from multiple data blocks, thus saving a lot of IO operations.
  • ⑤. Disadvantages of clustered index
  1. The insertion speed depends heavily on the insertion order. Inserting in the order of the primary key is the fastest way. Otherwise, page splits will occur, seriously affecting performance. Therefore, for InnoDB tables, we generally define an auto-incrementing ID column as the primary key.
  2. Updating the primary key is expensive because it causes the updated rows to move. Therefore, for InnoDB tables, we generally define the primary key as non-updatable.
  • ⑥. Restrictions
  1. For MySQL database, currently only the InnoDB data engine supports clustered indexes, while MyISAM does not support clustered indexes.
  2. Since there can only be one physical storage sorting method for data, each MySQL table can only have one clustered index. Usually it is the primary key of the table
  3. If no primary key is defined, InnoDB will choose a non-empty unique index instead. If there is no such index, InnoDB will implicitly define a primary key as a clustered index.
  4. In order to make full use of the clustering characteristics of the clustered index, try to choose ordered sequential IDs for the primary key columns of the indexed InnoDB table. It is not recommended to use unordered IDs, such as UUID, MD5, HASH, and string columns as primary keys that cannot guarantee data. The order of growth
  • ⑦. Clustered index is not a separate index type, but a data storage method (all records are stored in leaf nodes), which is the so-called index as data, and data as index.

⑥. Index - secondary index

  • ①. Secondary index is also called auxiliary index and non-clustered index. The clustered index introduced above can only work when the search condition is the primary key, because the data of the B+ tree is sorted according to the primary key. In actual
    development In the process, we often use other columns as search conditions. If the frequency of using this column as a search condition is very high, we can consider using this column to create a secondary index to improve the search speed.
    Insert image description here
  • ②. Use the size of the record c2 column to sort records and pages, which includes three aspects:
  1. The records in the page are arranged in a one-way linked list according to the size of column c2.
  2. Each page that stores user records is also arranged in a doubly linked list based on the size of the c2 column of the user records in the page.
  3. Pages storing directory entry records are divided into different levels. Pages in the same level are also arranged in a doubly linked list according to the size of the c2 column of the directory entry records in the page.
  • ③. The leaf nodes of the B+ tree do not store the complete user record, but only the values ​​of the c2 column + the primary key.

  • ④. The directory entry record is no longer the combination of primary key + page number, but the combination of column c2 + page number.

  • ⑤. To search through the secondary index, you need to first find the qualified primary key id from the B+ tree of the secondary index, and then search in the B+ tree of the clustered index. Some students here will want to ask, why not store complete user records in the B+ tree of the secondary index?
    If you put the complete user records in the leaf nodes, you don't need to return the table, but it takes up too much space. It is equivalent to copying the user records every time you create an index, which is a huge waste of storage space.

  • ⑥. The principles of clustered index and non-clustered index are different, and there are also some differences in use.

  1. The leaf nodes of the clustered index store our data records, and the leaf nodes of the non-clustered index store the data location. Non-clustered indexes do not affect the physical storage data of the table
  2. A table can only have one clustered index, because there can only be one sorted storage method, but there can be multiple non-clustered indexes
  3. When using a clustered index, the data query efficiency is high, but if the data is inserted, deleted, updated, etc., the efficiency will be lower than that of a non-clustered index. This is because when modifying a non-clustered index, you only need to operate the current B+ tree, but if you modify a clustered index, you not only need to operate the current B+ tree, but also the IDs of all leaf nodes of the non-clustered index need to be modified.
  • ⑦. The joint index is also a type of secondary index. If we use multiple columns to create an index, for example, sort by c2 first, and c2 is also sorted by c3, then this is a joint index.
    Compared with the secondary index, the joint index stores more data on each node. I will draw a sketch here and won’t go into details.

Insert image description here

⑦. Precautions for B+ tree index

Things to note about InnoDB’s B+ tree index

  • ①. The root interface position remains unchanged for thousands of years
    . When I introduced the B+ tree index earlier, for the convenience of everyone’s understanding, I first drew all the leaf nodes that store user records, and then drew the internal nodes that store directory entry records. In fact, B+ The journey of the tree is as follows:
  1. Whenever a B+ tree index is created for a certain table, the clustered index is not considered to be created and is already there by default. When 创建一个根节点页面there is no data in the table at the beginning, the root node corresponding to each B+ tree index will be There is no user record and no directory entry record
  2. When subsequently inserting user records into the table, first store the user records in this根节点中
  3. When the available space in the root node is used up, records will continue to be inserted. At this time, all records in the root node will be copied to a newly allocated page, such as page a, and then a page split operation will be performed on this new page to obtain another page. A new page, for example 页b. This is the newly inserted record, which is simply the primary key value in the clustered index. The size of the corresponding index column value in the secondary index will be allocated to page a or page b, and the root node will be upgraded to storage
    Special attention should be paid to the page record of directory entries in this process: the root node of a B+ tree index will not move from the date of birth. In this way, as long as we create an index on a certain table, the page number of its root node will be It will be recorded somewhere, and then InnoDB存储引擎whenever this index needs to be used, the page number of the root node will be taken out from that fixed place to access the index.
  • ②. The uniqueness of the directory entry records in the internal nodes.
    We know that the content of the directory entry records in the internal nodes of the B+ tree index is 索引列+页号a perfect match, but this match is a bit loose for the secondary index. Let’s take index_demoa table as an example. Suppose the data in this table looks like this:
c1	c2	c3
1	1	‘u’
3	1	‘d’
5	1	‘y’
7	1	‘a’

If the content of the directory entry record in the secondary index is just 索引列 +页号a match, then it is c2recommended that the B+ tree after indexing should look like this:
Insert image description hereIf we want to insert a new row of records, the c1、c2、c3values ​​​​of which are: 9、1、c, then modify this to create a column for c2 When the secondary index corresponds to the B+ tree, a big problem is encountered: since the directory entry record stored in page 3 is composed of the value of column c2 + page number, the values ​​​​of column c2 corresponding to the two directory entry records in page 3 are 页3both 1, and the value of column c2 of our newly inserted record is also the same 1. So should our newly inserted record be placed 页4in , or should it be placed 页5in ? The answer is: Sorry, I'm confused.
In order for the newly inserted record to find which page it is on, we need to ensure 在B+树的同一层内节点的目录项记录除页号这个字段以外是唯一的. Therefore, the content of the record of the directory entry record of the internal node of the secondary index actually consists of three parts:

  1. index column value
  2. primary key value
  3. The page number
    means that we 主键值add the directory entry record to the node in the secondary index. This ensures that each directory entry record in the node at each level of the B+ tree is unique except for the page number field, so we The c2 column suggests that the diagram after the secondary index should actually look like this:
    Insert image description hereIn this way, when we insert a record (9,1,'c'), since the directory entry record stored in page 3 is composed of c2列 + 主键 + 页号values, we can first add the c2 class value of the new record Compare with the value of the c2l column recorded in each directory entry on page 3. If the value of the c2 column is the same, you can then compare the primary key value, because the records of different directory entries in the same layer of the B+ tree are definitely different, so in the c2列 + 主键的值end The only directory entry record can definitely be located. In this case, it is finally determined that the new record should be inserted 页5.
  • ③. A page stores at least 2 records.
    A B+ tree can easily store hundreds of millions of records with only a few levels, and the query speed is quite good! This is because the B+ tree is essentially a large multi-level directory. Every time it passes through a Directory will filter out many invalid subdirectories until the directory where the real data is stored is finally accessed. So what will be the effect if only one subdirectory is stored in a large directory? That is, there are many, many directory levels, and only one record can be stored in the last directory that stores real data. Only one real user record can be stored after a long time of effort?所以InnoDB的一个数据页至少可以存放两条记录

⑧. Index scheme in MyISAM

  • ①. The applicable storage engines for B-tree index are as shown in the table:
  1. The default index of Innodb and MyISAM is Btree index; while the default index of Memory is Hash index.
  2. The MyISAM engine uses B+Tree as the index structure. The data field of the leaf node stores数据记录的地址
    Insert image description here
  • ②. The following figure is the schematic diagram of MyISAM index
    Insert image description here
  • ③. If we create a secondary index on Col2, the structure of this index is as shown below:
    Insert image description here
  • ④. Comparison between MyISAM and InnoDB
  1. In the InnoDB storage engine, we only need to 聚簇索引perform a search based on the primary key value pair to find the corresponding record, but in MyISAMInnoDB, we need to perform one 回表operation, which means that the index established in MyISAM is equivalent to all二级索引
  2. InnoDB's data file itself is an index file, while MyISAM index file and data file are 分离的. The index file only saves the address of the data record.
  3. InnoDB's non-clustered index data field stores corresponding records 主键的值, while MyISAM index records 地址. In other words, all non-clustered indexes in InnoDB reference the primary key as the data field
  4. MyISAM's table return operation is very 快速fast, because it takes the address offset to fetch data directly from the file. On the other hand, InnoDB obtains the primary key and then searches for the record in the clustered index. Although it is not slow, it is still slower than You can’t access it directly using the address.
  5. InnoDB requires that if the table 必须有主键(MyISAM可以没有)is not explicitly specified, the MySQL system will automatically select a column that can be non-null and uniquely identify the data record as the primary key. If such a column does not exist, MySQL automatically generates an implicit field as the primary key for the InnoDB table. This field is 6 bytes long and of type long integer.

Guess you like

Origin blog.csdn.net/TZ845195485/article/details/132178038