[MySQL] clustered index and non-clustered index (secondary index)

The difference between the two

  • Clustered index: Put the data storage and index together, the leaf node of the index structure saves the row data, and the data is found when the index is found.

  • Non-clustered index: separate data and index storage, the leaf node of the index structure points to the row corresponding to the data, MyISAM caches the index in the memory through the key_buffer, and when the data needs to be accessed (accessing the data through the index), in the memory Search the index directly, and then find the corresponding data on the disk through the index, which is why the index is slow when the key_buffer is not hit.

In InnoDB , the index created on the clustered index is called the auxiliary index. Auxiliary index access data always requires a second lookup. Non-clustered indexes are auxiliary indexes, such as composite indexes, prefix indexes, unique indexes, and auxiliary indexes. The index leaf node is no longer the physical location of the row, but the primary key value .

When to use clustered and non-clustered indexes

Insert picture description here

Clustered index is unique

The clustered index is unique. Because the clustered index puts the data and the index structure together, there is only one clustered index for a table.
The physical order of the rows in the table is the same as the physical order of the rows in the index. Create a clustered index before creating any non-clustered index . This is because the clustered index changes the physical order of the rows in the table, and the data rows are arranged in a certain order. And automatically maintain this order;

A misunderstanding: the primary key is automatically set to a clustered index

The clustered index is the primary key by default . If the primary key is not defined in the table, InnoDB will choose a unique non-empty index instead. If there is no such index, InnoDB will implicitly define a primary key as a clustered index . InnoDB only aggregates records in the same page. Pages containing adjacent key values ​​may be far apart. If you have set the primary key as a clustered index, you must first delete the primary key, then add the clustered index we want, and finally restore the primary key.

At this time, other indexes can only be defined as non-clustered indexes. This is the biggest misunderstanding. Some primary keys are still meaningless auto-increment fields. In that case, the efficiency of the clustered index is completely wasted.

As mentioned earlier, the clustered index has the best performance and uniqueness, so it is very precious and must be set carefully. Generally, the choice should be made according to the most commonly used SQL query method for this table. A certain field is used as a clustered index or a combined clustered index. This depends on the actual situation.

Remember that our ultimate goal is to reduce logical IO as much as possible under the same result set.

Take a closer look at the picture

Insert picture description here
Insert picture description here

  1. InnoDB using a clustered index , the primary keys are organized into a B + tree , whereas the line data is stored in the leaf nodes, the use of "where id = 14"such criteria to the primary key, follow search algorithm B + Tree can find the corresponding leaf Node, and then get the row data.
  2. If you perform a conditional search on the Name column, two steps are required: the first step is to retrieve the Name in the auxiliary index B+ tree, and reach its leaf node to obtain the corresponding primary key. The second step is to use the primary key to perform another B+ tree retrieval operation on the primary index B+ tree species, and finally reach the leaf node to get the entire row of data. (The key point is that auxiliary indexes need to be established through other keys)
  3. MyISM uses a non-clustered index . The two B+ trees of the non-clustered index look the same . The structure of the nodes is exactly the same, but the contents are different. The nodes of the primary key index B+ tree store the primary key, and the secondary key index B+ tree storage The auxiliary keys. The table data is stored in a separate place. The leaf nodes of the two B+ trees both use an address to point to the real table data. For table data, there is no difference between the two keys. Since the index tree is independent, the index tree of the primary key does not need to be accessed through the secondary key search.

Advantages of clustered index

It seems that the efficiency of the clustered index is obviously lower than that of the non-clustered index, because every time you use the auxiliary index to search, you must go through two B+ tree lookups . Isn't this unnecessary? What are the advantages of clustered indexes?
1. Since the row data and leaf nodes are stored together, there will be multiple rows of data in the same page. When accessing different rows of the same data page, the page has been loaded into the Buffer. When it is accessed again, the access will be completed in the memory. , Without having to access the disk. In this way, the primary key and the row data are loaded into the memory together, and the row data can be returned immediately if the leaf node is found. If the data is organized according to the primary key Id, the data can be obtained faster.
2. The secondary index uses the primary key as a "pointer" instead of using the address value as a pointer. The advantage is that it reduces the maintenance work of the secondary index when a row moves or a data page is split. Using the primary key value as a pointer will make the secondary index occupy More space, in exchange for the benefit is that InnoDB does not need to update this "pointer" in the auxiliary index when moving rows. That is to say, the position of the row (positioned by the 16K Page in the implementation) will change with the modification of the data in the database (the previous B+ tree node split and the Page split), and the use of a clustered index can ensure that the primary key is ignored No matter how the nodes of the B+ tree change, the auxiliary index tree will not be affected.
3. The clustered index is suitable for sorting occasions, and the non-clustered index is not suitable for
4. When extracting a certain range of data, use the clustered index
5. The secondary index requires two index lookups, not one time to get the data , Because the storage engine needs to find the leaf node of the index through the secondary index for the first time, so as to find the primary key of the data, and then use the primary key to find the index again in the clustered index, and then find the data
6. You can save the related data together. For example, when implementing an e-mail box, data can be aggregated based on user ID, so that only a few data pages need to be read from the disk to get all the mail of a certain user. If a clustered index is not used, each message may cause a disk I/O.

Disadvantages of clustered index

1. Maintaining indexes is very expensive, especially when new rows are inserted or the primary key is updated leading to page splitting. It is recommended to use OPTIMIZE TABLE to optimize the table after inserting a large number of new rows in a time when the load is low, because the row data that must be moved may cause fragmentation. Using a dedicated table space can weaken fragmentation.
2, since the table used UUID (random ID) as the primary key, so that the data stored in the sparse, it will be possible to have a clustered index is slower than the full table scan surface,
Insert picture description here
it is recommended to use int the auto_increment primary key.
Insert picture description here
The value of the primary key is sequential, so InnoDB stores each record after the previous record. When the maximum fill factor of the page is reached (InnoDB's default maximum fill factor is 15/16 of the page size, leaving some space for later modification), the next record will be written to the new page. Once the data is loaded in this order, the primary key page will be approximately filled with sequential records (the secondary index page may be different)

If the primary key is relatively large, the secondary index will become larger, because the leaf of the secondary index stores the primary key value; too long primary key value will cause non-leaf nodes to occupy more physical space.

Why is the primary key usually recommended to use self-incrementing id

The physical storage order of clustered index data is consistent with the index order , that is, as long as the indexes are adjacent, the corresponding data must also be adjacently stored on the disk . If the primary key is not an auto-increment id, then you can imagine what it will do, constantly adjusting the physical address and paging of the data. Of course, there are other measures to reduce these operations, but they cannot be completely avoided. However, if it is self-increasing, it is simple. It only needs to be written page by page, the index structure is relatively compact, the disk fragmentation is small, and the efficiency is high.

Because the main index of MyISAM is not a clustered index, the physical address of his data must be messy. After getting these physical addresses, I/O reading is performed according to the appropriate algorithm, so it starts to seek non-stop rotation . A clustered index requires only one I/O. (Strong contrast)

However, if it involves operations such as sorting, full table scans, counts of large amounts of data, MyISAM still has the advantage, because the index occupies a small space, these operations need to be completed in memory.

Setting of clustered index in mysql

The clustered index is the primary key by default. If the primary key is not defined in the table, InnoDB will choose a unique non-empty index instead. If there is no such index, InnoDB will implicitly define a primary key as a clustered index. InnoDB only aggregates records in the same page. Pages containing adjacent key values ​​may be far apart.

About the difference between the two commonly used MySQL engines: The difference between MyISAM and InnoDB
Original address: https://www.jianshu.com/p/fa8192853184

Guess you like

Origin blog.csdn.net/dl962454/article/details/114382231