Mysql advanced learning summary 6: the concept and understanding of indexes, the detailed explanation of the B+ tree generation process, the comparison between MyISAM and InnoDB

1. Introduction of index

1.1 Why use indexes

I have learned a lot of sql syntax before, now I need to understand that we have entered a select query statement, how does mysql find the data corresponding to the query statement ? Also, how did you find it quickly?

Here we can introduce the concept of index: an index is a data structure used by the storage engine to quickly find data records .

I don’t know if you can see the above concept clearly, the index is a data structure !

1.2 Find data records

Next, through examples, to slowly understand, the index is a data structure, what exactly does it mean...

For example, at this time, we have an employees table. After we enter the select statement, how should we find this data? Let's analyze it next.

select * from employees where employee_id=100;

insert image description here

First, create a simpler table index_demo table. There are only 3 fields in this table: c1, c2, c3, and the c1 field is the primary key.

CREATE TABLE index_demo(
    c1 int,
    c2 int,
    c3 char(1),
) ROW_FORMAT = Compact;

The row format used here is Compact, which is the format in which each row of data is actually saved :

  1. record_type : An attribute of the record header information, indicating the type of record. 0 means normal record, 1 means directory record, 2 means minimum record, 3 means maximum record .
  2. next_record : An attribute of the record header information, indicating the address offset of the next address relative to this record.
  3. The value of each column: Here only three columns in index_demo are recorded, namely c1, c2, c3.
  4. Other information: All information except the above three types of information, including the values ​​of other hidden columns and additional information recorded.

insert image description here

1.3 Insert data, multi-page search

After creating the table, start inserting data and start searching, first insert 3 pieces of data:

INSERT INTO index_demo VALUES 
(1, 4, 'u'),
(3, 9, 'd'),
(5, 3, 'y');

Mysql loads the data records in the disk by pages, and the size of one page is 16KB , so we can load up to 16KB of data once we load it.

Although we only inserted 3 pieces of data, assuming that the 3 pieces of data have reached 16KB, then a page can be filled, as shown below:
insert image description here

It can be seen that at this time, the data is concatenated into a one-way linked list in one page according to the size of the primary key (c1). If you want to find a piece of data, you can search sequentially on this page.

Since these three pieces of data have already filled one page, if another piece of data is inserted at this time, the address of one page will be reassigned to store the new data record. After storage, since we sort by the primary key, we need to check whether the record needs to be moved. If the record needs to be moved, the record needs to be moved according to the size of the primary key. This process is called paging .

For example, another piece of data is inserted at this time. Then you need to move the record with the primary key value of 5 to a newly allocated page 28, and then insert the record with the primary key value of 4 into page 10.

INSERT INTO index_demo VALUES 
(4, 4, 'a');

insert image description here

Well, with the increase of database records, more and more pages will be newly allocated in the database, roughly as shown in the figure below:
insert image description here
1) Since each page may not actually be continuous in the disk, between each page The doubly linked list connection used.
2) As for each page, due to the slow query speed of the singly linked list, an array can be maintained to record the address of each piece of data. If you need to query, you can find this piece of data on this page faster through binary search.

So if you want to find (20,2,'e') this piece of data.
1) You can find page 10 first, and then find that it is not found through 2-point search;
2) Then find the next page 28 through the linked list, and then find that it is still not found through binary search;
3) Then find the next page 9 through the linked list, Then through the binary search, it can be found at this time, and this is to return the data record.

1.4 Page lookup based on directory entry records

In 1.3, it can be seen that there is an obvious problem with multi-page search, that is, after the amount of data increases, the speed of sequential search one by one is too slow.

Therefore, a directory item can be made for each page, and each directory item includes the following two parts:

  1. The minimum primary key value of the page, represented by key
  2. Page number, represented by page_no

Therefore, the multiple pages in 1.3 can be represented by the following figure at this time.
At this time, we still look for (20,2,'e') this data.
1) In this page of catalog items, find catalog item 3 by 2-point search. Because the primary key 20 is greater than the minimum primary key 12 of directory item 3, the minimum primary key 209 of Xiaoyu directory item 4.
3) Go to page 9 in directory item 3, and find this record through 2-point search.

At this point, you can see that in 1.3, the search is performed page by page, and a page of data from the disk must be loaded each time, which takes a lot of time. After adding the directory, you only need to load the pages of the disk twice to find the data.

It should be noted that the time-consuming of loading disk pages is much greater than the time-consuming of in-memory, and the magnitude of the two is at least 10 or more . Therefore, at this time, there is no need to worry about whether the time complexity of an algorithm in the program is 0(n) or O(n2). Because if the disk loads pages many times, the time-consuming is far greater than the time to execute the program in memory.
insert image description here

At this time, the data structure of the page based on the directory entry record is as shown in the figure below.
insert image description here
It can be seen that the first layer is the directory entry record, and the second layer is the data record.
The directory entry records only the minimum value of the primary key and the physical address of the corresponding page. The data record actually contains the data of this record. Their distinction is based on the record_type attribute introduced earlier:

  • 0: common user record
  • 1: directory entry record
  • 2: Minimum record
  • 3: Maximum record

It should be noted that since data records contain data, the number of directory records stored in one page (16KB) is often greater than the number of existing data records.

For example, the size of a data record is 160B, then a page of disk can store 100 data records.
Since a directory record only has the minimum primary key value of the page and the physical address of the page, there are only 2 values, assuming that the size of a directory record is 16B, then a page of disk can store 1000 records.
Therefore, the number of data records that can be stored in the 2-level directory structure at this time is: 1000 * 100 = 100,000, that is, 100,000 records.

Therefore, for 100,000 records, we can quickly locate the physical address of the page through a 2-point search in the first layer, and then quickly find this piece of data through a binary search in this page. Therefore, 100,000 pieces of data can be found by loading the disk page about 2 times.

1.5 Catalog Page Based on Catalog Item Record Page

According to the above example, what if there are more than 100,000 pieces of data? That one directory entry record is definitely not enough. For example, if there are 100 million pieces of data, then according to the above example, 1000 directory item record pages are required:
insert image description here

At this time, if you want to search for a certain piece of data, you need to search one by one on the first-level directory entry record page, and the disk page must be loaded every time, so the speed is very slow.

Therefore, you can refer to the above method and add another layer: the catalog page of the catalog entry record page. As shown below:
insert image description here

The search method is similar to the previous one, and now there is an additional layer. According to the above example, the number of data that can be stored at this time is:
1000 (the first-level directory item page) × 1000 (the second-level directory item page) * 100 = 1,0000,0000 pieces of data, that is, 100 million pieces of data.

Of course, if the amount of data is larger, you can continue to increase the number of layers. If one more layer is added, 100 billion pieces of data can be stored, which is already a lot for general business, so the number of layers of general index will not exceed 4 layers .

1.6 B+ tree

The above has analyzed how to create a data structure to quickly find data records in the database. This data structure is roughly as shown in the figure below: The
insert image description here
name of this data structure is B+ tree.

Whether it is a data page storing user records or a data page storing directory item records, we store them in the data structure of the B+ tree, so we also call these data pages nodes. It can be seen from the figure that the actual user records are actually stored on the bottom nodes of the B+ tree. These nodes are also called leaf nodes, and the rest of the nodes used to store directory items are called non-leaf nodes or internal nodes. The node at the top of the B+ tree is also called the root node.

Under normal circumstances, the B+ tree we use will not exceed 4 layers!
Although the above example has been given, here is a summary.
Assuming that the size of a data record is 160B, then a disk page (16K) can store up to 100 pieces of data. Since the directory page only needs to store the minimum primary key value of the data record and the address of the data record page, the directory entry data stored in a disk page must be more than the number of data items, assuming that 1000 entries can be stored.

  1. If the B+ tree has only 1 layer: a disk page (16K) can store up to 100 pieces of data.
  2. If the B+ tree has 2 layers: it can store up to 1000 × 100 = 100,000 (100,000 pieces of data)
  3. If the B+ tree has 3 layers: it can store up to 1000 × 1000 × 100 = 1,0000,0000 (100 million pieces of data)
  4. If the B+ tree has 4 layers: it can store up to 1000 × 1000 × 1000 × 100 = 1000,0000,0000 (100 billion pieces of data)

Therefore, for 100 billion pieces of data, you only need to load up to 4 disk pages (3 directory item pages, 1 user data record page) to find the data through the primary key value, and there is also a Page Directory (page directory) in each page directory), that is, it can be quickly located through the dichotomy method, without having to query one by one through the linked list.

2. Index overview

Through the first section, the entire process of B+ tree retrieval of data records in mysql has been analyzed. At this time, it may be better to understand the concept and advantages and disadvantages of indexes. Otherwise, when you come up and look at a lot of text descriptions, you may be very confused.

So it needs to be explained, why should we build an index?
From the above, we can know that the purpose of indexing is to reduce the number of disk I/0 and speed up query efficiency.

2.1 Index overview

An index is a data structure that helps mysql colleges and universities to obtain data, so an index is a data structure .

Indexes are implemented in storage engines, so the indexes of each storage engine are not necessarily identical, and each storage engine does not necessarily support all index types.

At the same time, the storage engine can define the maximum number of indexes and the maximum index length for each table. All storage engines support at least 16 indexes per table, with a total index length of at least 256 bytes. Some storage engines support more indexes and larger index lengths.

2.2 Advantages of indexes

  1. Similar to the bibliographic index built by a university library, the main reason for creating an index is to improve the efficiency of data retrieval and reduce the IO cost of the database.
  2. By creating a unique index, the uniqueness of each row of data in the database table can be guaranteed.
  3. In terms of achieving referential integrity of data, it can speed up the joins between tables. In other words, the query speed can be improved when the dependent child table and the parent table are jointly queried.
  4. When using grouping and sorting clauses for data query, the time of grouping and sorting in the query can be significantly reduced, and the CPU consumption can be reduced

2.3 Disadvantages of indexes

  1. Index creation and index maintenance take time, and as the amount of data increases, the time consumed increases.
  2. Indexes need to occupy disk space. In addition to the data space occupied by the data table, each index also occupies a certain amount of physical space and is stored on the disk. If there are a large number of indexes, index files may reach their maximum file size sooner than data files.
  3. Although the index greatly improves the query speed, it will slow down the speed of updating the table. When adding, deleting and modifying the data in the table, the index should also be maintained dynamically, which reduces the speed of data maintenance.

3. Common concepts of indexing

According to the physical implementation of the index, it can be divided into two types: clustered index and non-clustered index. Non-clustered indexes are also called secondary indexes or auxiliary indexes.

3.1 Clustered Index

Clustered index is not a separate index type, but a data storage method (all user records are stored in leaf nodes), that is, the so-called index is data, and data is index .

This kind of clustered index does not require us to explicitly use the INDEX statement to create it in mysql. The InnoDB storage engine will automatically create a clustered index for us .

Advantages :

  • Data access is faster . Because a clustered index stores the index and data in the same B+ tree, fetching data from a clustered index is faster than a non-clustered index.
  • Clustered indexes are very fast for primary key sort lookups and range lookups .
  • According to the order of the clustered index, when the query displays a certain range of data, since the data is closely connected, the database does not need to extract data from multiple data blocks, and the index saves a lot of io operations .

Disadvantages :

  • The insertion speed depends heavily on the insertion order . Inserting in the order of the primary key is the fastest way, otherwise page splits will occur, seriously affecting performance. Therefore, for InnoDB tables, an auto-incrementing ID column is generally defined as the primary key .
  • Updating a primary key is expensive because the row being updated will be moved. Therefore, for innoDB tables, we generally define the primary key as non-updatable .
  • Secondary index access requires 2 index lookups . The primary key value is found for the first time, and the row data is found for the second time based on the primary key value.

Restrictions :

  • For the mysql database, only the innodb data engine currently supports clustered indexes, while myisam does not support clustered indexes.
  • Since there can only be one physical storage method for data, each mysql table can only have one clustered index . Usually it is the primary key of the table.
  • If no primary key is defined, InnoDB will choose a non-empty unique index instead. If there is no such index, InnoDB will implicitly define a primary key as a clustered index .
  • In order to make full use of the clustering characteristics of the clustered index, the primary key column of the innodb table should use an ordered sequence id as much as possible, and it is not recommended to use an unordered id, such as UUID, MD5, HASH, and a string column as the primary key cannot guarantee data order of growth.

3.1 Non-clustered index (secondary index, auxiliary index)

The clustered index introduced above can only work when the search condition is the primary key, because the data in the B+ tree is sorted according to the primary key. So what if we want to use other columns as search criteria?

You can build several more B+ trees .
The data in different B+ trees adopts different sorting rules. For example, the size of column c2 in the above example can be used as the data page to build another B+ tree.
insert image description here
The concept of returning to the table :
According to the B+ tree sorted by the size of the c2 column, we can only determine the primary key value of the record we want to find, so if we want to find the complete user record, we still need to check it again in the clustered index. This process called back table.

Because this kind of B+ tree built according to the non-primary key column needs a table return operation to locate the complete user record, so this kind of B+ tree is also called secondary index (secondary index), or auxiliary index.

The presence of a nonclustered index does not affect the organization of the data sub-clustered index, so a table can have multiple nonclustered indexes.

summary:

  1. The leaf nodes of the clustered index store our user data records , and the leaf nodes of the non-clustered index store the data location . Nonclustered indexes do not affect the physical storage order of data tables.
  2. A table can only have one clustered index , because there can only be one way of sorting and storing, but there can be multiple non-clustered indexes , that is, multiple index directories provide data retrieval.
  3. When using a clustered index, the data query efficiency is high , but if the data is inserted, deleted, updated, etc., the efficiency will be lower than that of the non-clustered index.

3.3 Joint Index

A joint index can be understood as a type of non-clustered index, except that it indexes multiple columns at the same time.

For example, use the c2 and c3 columns introduced above to create an index:
insert image description here

4. Precautions for InnoDB's B+ tree index

4.1 The location of the root page remains unchanged for ten thousand years

The root node of a B+ tree index will not move since its birth. That is, whenever a B+ tree index is created for a table, a root node page will be created , which stores user records at the beginning , and when the page is full, page splitting occurs, so that user data arrives Layer 2, then the root node page becomes the directory entry record page .

A more popular explanation is that the B+ tree introduced above is created slowly from top to bottom.

4.2 Uniqueness of directory entry records in internal nodes

If the non-leaf nodes, that is, the internal node directory entry records are completely consistent, as shown in the figure below.
Then there is a new piece of data: 0,1,'c', I don't know which page to insert it into.
insert image description here

At this time, it is necessary to ensure that the directory entry record of the node in the same layer of the B+ tree is unique except for the field of page number, then the primary key value can be added at this time , so that the directory entry record of the internal node must be unique. :

  • the value of the indexed column
  • primary key value
  • page number

insert image description here

4.3 A page stores at least 2 records

A data page of InnoDB stores at least 2 records, otherwise the B+ tree structure scheme introduced above would be meaningless.

5. Index scheme in MyISAM

5.1 The principle of MyISAM index

The MyISAM engine uses a B+ tree as an index structure, but the data field of its leaf nodes stores the addresses of data records .

The index in InnoDB is data (.idb) , that is, the leaf node of the B+ tree of the clustered index contains complete user data records.
Although MyISAM also uses a tree structure, it stores indexes and data separately .

  1. MyISAM stores the records in the table in a separate file in the order of insertion, which is called a data file (.MYD) . Since the data is not deliberately sorted according to the size of the primary key when inserting data, it is not possible to use the dichotomy method to search on these data.
  2. MyISAM stores index information in a file called an index file (.MYI) . MyISAM will create an index separately for the primary key of the table, but what is stored in the leaf node of the index is not the complete user record, but the primary key value + user data record address .

The following figure shows the storage format of an index file with col1 as the primary key.
insert image description here
The following figure is a secondary index built with col2.
insert image description here

5.2 Comparison between MyISAM and InnoDB

The indexing methods of MyISAM are all non-clustered. InnoDB, in addition to non-clustered , also contains a clustered index.

  1. InnoDB's data files are themselves index files (.idb) . The index file (.MYI) and data file (.MYD) of MyISAM are separated , and the index file only saves the address of the data record .
  2. If InnoDB searches the clustered index based on the primary key value, it only needs to find the user data record once. However, since the MyISAM index file stores the address of the user data record, there must be a table return operation .
  3. InnoDB's non-clustered index stores the primary key value of the data record, and then needs to go back to the table to find the data record through the primary key value. The MyISAM index records the address of the user record, so the return table operation of MyISAM is definitely faster than that of InnoDB .
  4. InnoDB requires that the table must have a primary key. If it is not explicitly specified, it will automatically select a column that can be non-null and uniquely identify the data record as the primary key. If it is not found, it will automatically generate an implicit field as the primary key. The length of this field is 6 bytes, the type is a long integer. And MyISAM can not.

Summary:
Understanding the index implementation methods of different storage engines is very helpful for the correct use and optimization of indexes.
Example 1: After knowing the index implementation of InnoDB, it is easy to understand why it is not recommended to use too long fields as primary keys . Because all secondary indexes refer to the primary key index, a long primary key can make the secondary index too large.
Example 2: In InnoDB, it is not a good idea to use non-monotonic fields as primary keys. The non-monotonic primary key will cause the data file to be frequently split and adjusted to maintain the characteristics of the B+ tree when inserting a new record, which is very inefficient. Using an auto-increment field as the primary key is a good choice .

Guess you like

Origin blog.csdn.net/xueping_wu/article/details/125351669