[MySQL] MySQL index - the difference between clustered index and non-clustered index

foreword

1. The concept of clustered index and non-clustered index

Indexes of database tables can be divided into clustered indexes and non-clustered indexes in terms of data storage methods. "Clustering" means that rows of data are stored closely together one by one in a certain order. We are familiar with the two major engines of InnoDB and MyISAM. The default data structure of InnoDB is a clustered index, while MyISAM is a non-clustered index.

Clustered Index (Clustered Index) is not a separate index type, but a data storage method. When a table has a clustered index, the data rows of the table are stored in the leaf pages of the index tree. It is impossible to put data rows in two different places, so only one clustered index is allowed for a table. InnoDB's clustered index actually stores the index and data in the same B-Tree. InnoDB aggregates data through the primary key. If no primary key is defined, InnoDB will choose a unique non-empty index instead. If there is no such index, InnoDB will implicitly define a primary key as a clustered index.

Non-clustered index (NoClustered Index), also known as secondary index. What is saved in the leaf node of the secondary index is not the physical pointer to the row, but the primary key value of the row. When looking up a row through the secondary index, the storage engine needs to find the corresponding leaf node in the secondary index, obtain the primary key value of the row, and then use the primary key to find the data row in the clustered index, which requires two B-Tree lookups.

2. Both are introduced in detail

2.1 Clustered Index

Because clustering and non-clustering indexes are essentially data storage methods, they need to depend on the carrier, that is, using InnoDB to explain clustering indexes, and using MyISAM to explain non-clustering indexes. The diagrams explained below are all quoted from "High Performance MySQL".

For the InnoDB engine, data is stored in the form of a clustered index:
insert image description here

Each leaf node of its clustered index contains the primary key value, transaction ID, rollback pointer (for transactions and MVCC), and the remaining columns. It can also be seen from the physical file that the InnoDB data file only has the data structure file .frm and the data file .ibd. The data and index information stored in .ibd are stored together.

InnoDB's secondary index is also very different from the primary key index. The secondary index stores the primary key value instead of the row pointer, which reduces the overhead of maintaining the secondary index when moving data or splitting, because there is no need to update the row pointer of the index .

insert image description here

  • Features of clustered index:

    • advantage:

      1. Related data can be kept together. For example, when implementing e-mail, data can be aggregated according to the user ID, so that only a small number of data pages need to be read from the disk to get all the emails of a user. If a clustered index is not used, each email may cause a disk IO .
      2. Data access is faster, and the clustered index stores the index and data in the same B-Tree, so getting data from a clustered index is usually faster than searching in a non-clustered index.
      3. Queries that use covering index scans can directly use the primary key values ​​in the page nodes.
    • shortcoming:

      1. Clustering data maximizes the performance of IO-intensive applications, but if the data is all in memory, the order of access is not so important, and the clustered index has no advantage.
      2. The insertion speed depends heavily on the insertion order. Inserting in the order of the primary key is the fastest way to load data into the InnoDB table, but if the data is not loaded in the order of the primary key, it is best to use the optimize table command to reorganize the table after loading .
      3. Updating clustered index columns is expensive because it forces InnoDB to move each updated row to a new location.
      4. When a table based on a clustered index inserts a new row, or when the primary key is updated and the row needs to be moved, it may face the problem of page splitting. When the primary key value of the row requires that the row must be inserted into a full page, the storage The engine will split the page into two pages to accommodate the row. This is a page split operation, and the page split will cause the table to occupy more disk space.
      5. Clustered indexes can cause full table scans to be slower, especially if rows are sparse, or if data storage is not contiguous due to page splits.
      6. Secondary indexes can be larger than expected because the leaf nodes in the secondary index contain the primary key columns of the referenced rows.
      7. Secondary index access requires two index lookups instead of one.

2.2 Non-clustered index

For the MyISAM engine, data is stored in the form of a non-clustered index:

Raw data:
insert image description here

Storage method:

insert image description here

The index is organized according to the column value and row number, and what is stored in the leaf node is actually a pointer to the data block. It can also be seen from the physical files that the MyISAM index file .MYI and data file .MYD are stored separately and are relatively independent.

Example: Execution process: select * from user where id =1
1. Check if there is an index tree indexed by id in the myi index file of the user table
2. Find the corresponding node through the id value on the id index tree to obtain the node The data (the leaf node stores the index value and data address, and the data address points to the specific line in the current table myd data file)
3. According to the data address, find the corresponding data in the myd file and return.

insert image description here

3. The difference between the two

3.1 Data storage method

The most intuitive difference is reflected in the data storage method. In the MySQL database, the InnoDB (clustered) and MyISAM (non-clustered) data storage file formats are as follows:

The storage engine is InnoDB, and you will see 2 types of files in the data directory: .frm, .ibd
(1) *.frm – the file of the table structure.
(2) *.ibd – table data file

The storage engine is MyISAM, and you will see three types of files in the data directory: .frm, .myi, and .myd
(1) *.frm—table definition, which is a file describing the table structure.
(2) *.MYD–"D" data information file is the data file of the table.
(3) *.MYI–"I" index information file, which is the data tree of any index in the table data file

Schematic diagram, the storage engine of test1 is InnoDB, and the storage engine of test2 is MyISAM:

insert image description here

The difference between the storage methods of clustered index and non-clustered index:

  1. In the MyISAM engine, the index and data are stored separately, while in InnoDB, the index and data are stored together in the form of an idb file.
  2. In terms of access speed, clustered indexes are faster than non-clustered indexes. The non-clustered index needs to query the index file first, get the index, and obtain the data according to the index. The leaf nodes of the index tree of the clustered index directly point to the data row to be searched.

3.2 Secondary index query

This structure is adopted for the primary key index B+Tree of the InnoDB engine using the clustered index, the primary key index tree of MyISAM, and the secondary index B+Tree of MyISAM.

insert image description here

But InnoDB's secondary index B+Tree is like this:

insert image description here

It can be concluded that
  when using the secondary index for query, InnoDB first obtains the primary key index of the data row through the secondary index B+Tree, and then queries the data through the primary key index tree. Therefore, in the secondary index, the performance consumption of InnoDB is relatively large.
  However, this situation has certain optimizations in InnoDB. It is not controlled by the thought, but implemented by the engine. If there are many secondary index queries, InnoDB will generate an adaptive hash index.

Refer to the graph of high-performance MySQL to see the difference more clearly:

insert image description here

It can be seen from the figure that the leaf nodes of the InnoDB secondary index store the KEY field + primary key value, so the primary key value is first found through the secondary index, and then the corresponding primary key value is found in the primary key index according to the primary key value. data file. However, MyISAM's secondary index stores the combination of column values ​​and row numbers, and the leaf nodes store pointers to physical data. Therefore, there is no difference between the structure of its primary index and secondary index, except for the primary key index. The index value is unique and non-empty, and the MyISAM engine does not need to set the primary key. The InnoDB engine must set the primary key, and needs to rely on the primary key to generate a clustered index.

Guess you like

Origin blog.csdn.net/u011397981/article/details/130686424
Recommended