Difference between clustered index and non-clustered index

Excerpted and organized, non-original! ! !

Official definition:

In the book "Database Principles", the difference between a clustered index and a non-clustered index is explained as follows:
The leaf nodes of the clustered index are the data nodes.
The leaf nodes of a non-clustered index are still index nodes, but have pointers to the corresponding data blocks.

 

Clustered index : The table data is stored in the order of the index, that is to say , the order of the index entries is consistent with the physical order of the records in the table . For a clustered index, the leaf nodes store the actual data rows, and there are no separate data pages. Only one clustered index can be created on a table at most , because there can only be one physical order of real data.

"Aggregated" means that the actual data rows and associated key values ​​are kept together

Secondary index of clustered index : The leaf node does not save the physical location of the referenced row, but saves the primary key value of the row

 

Note: The physical storage order of data is consistent with the index order, that is, as long as the indexes are adjacent, the corresponding data must also be adjacently stored on the disk. If the primary key is not an auto-incrementing id, then it is conceivable that it What will be done, constantly adjusting the physical address of the data, paging, and of course there are other measures to reduce these operations, but it cannot be completely avoided. However, if it is self-incrementing, it is simple, it only needs to write page by page, the index structure is relatively compact, the disk fragmentation is less, and the efficiency is also high.

 

Non-clustered index: The order in which table data is stored has nothing to do with index order . For a non-clustered index, the leaf node contains the index field value and a logical pointer to the data row of the data page, and the number of rows is the same as the data volume of the data table row .

The data on the leaf node of MyISAM's B+Tree is not the data itself, but the address where the data is stored . There is no difference between the primary index and the secondary index , but the key in the primary index must be unique

 

A clustered index is an algorithm that reorganizes the actual data on disk to sort by the values ​​of a specified column or columns. The characteristic is that the order of stored data is consistent with the index order. In general, the primary key will create a clustered index by default, and only one clustered index is allowed in a table.

 

Therefore, it is well explained that different data storage engines in MYSQL support clustered indexes differently. Below, we can look at the index structure of the MYISAM and INNODB engines in MYSQL.

If the original data is:




 

The data storage method of the MyISAM engine is shown in the figure:



 

MYISAM organizes indexes by column value and row number. What is stored in its leaf node is actually a pointer to the physical block where the data is stored. From the physical files stored in MYISAM, we can see that the index file (.MYI) and data file (.MYD) of the MYISAM engine are independent of each other.

InnoDB stores data in the form of clustered indexes, so its data layout is very different. The structure it stores data is roughly as follows:



 

Note: Each leaf node in the clustered index contains the primary key value, transaction ID, rollback pointer (for transactions and MVCC) and the remaining columns (such as col2).

Secondary indexes in INNODB are very different from primary key indexes. The leaves of InnoDB's secondary indexes contain primary key values ​​rather than row pointers, which reduces the overhead of maintaining secondary indexes when moving data or when data pages are split, because InnoDB does not need to update the index's row pointers. Its structure is roughly as follows:



 

Comparison of primary key indexes and secondary indexes of INNODB and MYISAM:



 

The leaf node of InnoDB's secondary index stores the KEY field plus the primary key value. Therefore, the primary key value is first found through the secondary index query, and then InnoDB finds the corresponding data block through the primary key index according to the found primary key value. The secondary index leaf node of MyISAM stores the combination of column value and row number, and the leaf node stores the physical address of the data. So it can be seen that there is no difference between the primary key index and the secondary index of MYISAM. The primary key index is just a unique, non-empty index called PRIMARY, and the MYISAM engine can not have a primary key.

 

 为了更形象说明这两种索引的区别,我们假想一个表如下图存储了4行数据。其中Id作为主索引,Name作为辅助索引。图示清晰的显示了聚簇索引和非聚簇索引的差异。

 


 

对于聚簇索引存储来说,行数据和主键B+树存储在一起,辅助键B+树只存储辅助键和主键,主键和非主键B+树几乎是两种类型的树。对于非聚簇索引存储来说,主键B+树在叶子节点存储指向真正数据行的指针,而非主键。

InnoDB使用的是聚簇索引,将主键组织到一棵B+树中,而行数据就储存在叶子节点上,若使用"where id = 14"这样的条件查找主键,则按照B+树的检索算法即可查找到对应的叶节点,之后获得行数据。若对Name列进行条件搜索,则需要两个步骤:第一步在辅助索引B+树中检索Name,到达其叶子节点获取对应的主键。第二步使用主键在主索引B+树种再执行一次B+树检索操作,最终到达叶子节点即可获取整行数据。

MyISM使用的是非聚簇索引,非聚簇索引的两棵B+树看上去没什么不同,节点的结构完全一致只是存储的内容不同而已,主键索引B+树的节点存储了主键,辅助键索引B+树存储了辅助键。表数据存储在独立的地方,这两颗B+树的叶子节点都使用一个地址指向真正的表数据,对于表数据来说,这两个键没有任何差别。由于索引树是独立的,通过辅助键检索无需访问主键的索引树。

为了更形象说明这两种索引的区别,我们假想一个表如下图存储了4行数据。其中Id作为主索引,Name作为辅助索引。图示清晰的显示了聚簇索引和非聚簇索引的差异。

 



 

我们重点关注聚簇索引,看上去聚簇索引的效率明显要低于非聚簇索引,因为每次使用辅助索引检索都要经过两次B+树查找,这不是多此一举吗?聚簇索引的优势在哪?

1 由于行数据和叶子节点存储在一起,这样主键和行数据是一起被载入内存的,找到叶子节点就可以立刻将行数据返回了,如果按照主键Id来组织数据,获得数据更快。

2 辅助索引使用主键作为"指针" 而不是使用地址值作为指针的好处是,减少了当出现行移动或者数据页分裂时辅助索引的维护工作,使用主键值当作指针会让辅助索引占用更多的空间,换来的好处是InnoDB在移动行时无须更新辅助索引中的这个"指针"。也就是说行的位置(实现中通过16K的Page来定位,后面会涉及)会随着数据库里数据的修改而发生变化(前面的B+树节点分裂以及Page的分裂),使用聚簇索引就可以保证不管这个主键B+树的节点如何变化,辅助索引树都不受影响。

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326298900&siteId=291194637