The data structure of Mysql index

The data structure of Mysql index

table of Contents

1. Index

1.1. Data structure

1.1.1. Binary Tree

1.1.2. Red-Black Tree

1.1.3.Hash tree

1.1.4.BTree

1.1.5.B+Tree

1.2. Index Engine

1.2.1. MyISAM (because the data and the index are not together so it is a non-clustered index)

1.2.2. InnoDB (because the index and data are together, it is also called a clustered index)

1.3. Joint Index


 

1. Index

An index is a sorted data structure that helps MySQL obtain data efficiently.

1.1. Data structure

1.1.1. Binary Tree

Binary tree is an important type of tree structure. The data structure abstracted from many practical problems is often in the form of a binary tree, even a general tree can be easily converted into a binary tree, and the storage structure and algorithm of the binary tree are relatively simple, so the binary tree is particularly important. The characteristic of the binary tree is that each node can only have two subtrees at most, and there are left and right subtrees.

First analyze according to actual problems:

When I did not use the index SQL statement query is as follows:

select *  from table where Col2 = 23

A simple SQL query requires 7 disk IO operations (a downward query one by one), which is undoubtedly consumption, so a binary tree is introduced to solve this problem.

 When I use a binary tree for data storage, this Col2 will be sorted into this type. When we are querying 23, at this time, we only need to perform 4 disk IO to find the corresponding data.

This is nothing more than solving the problem of direct full table query we encountered, but the use of binary trees may be accompanied by new problems.

1.二叉树的定义是每个节点只有两个字数,那么每一层的存储是有限的(当面对MySQL大的数据量的情况下,子叶数据过多,层级过多也会导致大量的IO产生)
2.在特殊情况下,二叉树是存在问题的(如下图所示)

 

When using an ordered auto-increment sequence, the binary tree will have the following structure.

Suppose I use Col1 as the index, the binary tree will have the problem as shown in the figure above, so the meaning of using the binary tree is not great.

1.1.2. Red-Black Tree

Red Black Tree is a self-balancing binary search tree, a data structure used in computer science, and its typical use is to implement associative arrays.

The red-black tree is a specialized AVL tree (balanced binary tree), which maintains the balance of the binary search tree through specific operations during insertion and deletion, so as to obtain higher search performance.

红黑树和平衡二叉树的区别,红黑树在结构的变化上还算是比较稳定的。
所以就会导致平衡二叉树相对于红黑树保持更加严格的平衡,所以平衡二叉树的查询是比较快的(但是以多旋的方式保持平衡增加可新增、修改、删除的开销)
这也就是HashMap的树使用的是红黑树而不使用平衡二叉树的原因。

First, we use the red-black tree to view the index of Col2.

Check again the index sorting of Col1

The red-black tree does solve the problem of auto-increment of the serial number in the binary tree, and the method of spin optimization is carried out, but it still does not solve the problem when the amount of data is too large (the problem of more leaf nodes, more layers, and more IO).

1.1.3.Hash tree

A hash tree (or hash trie) is a persistent data structure that can be used to implement collections and mappings and is intended to replace hash tables in pure functional programming. In its basic form, the hash tree stores the hash value of its key (considered a bit string) in the trie, where the actual key and (optional) value are stored in the "final" node of the trie.

First of all, the setting of the Hash tree can meet the requirements of the goal, and quickly locate the specified location according to the Hash value.

In the choice of indexing method, the choice of Hash is allowed, but even if the query is fast, it is rarely used in the actual indexing situation. Why is this?

First of all, it does solve the problem of slow query, but if you encounter a range of values, Hash is helpless. Only full table scans can be compared one by one. This is not what we want to see.

1.1.4.BTree

Before explaining the definition of BTree, we can first look at the schematic diagram of Btree, which is as follows:

Features of BTree

叶节点具有相同的深度,叶节点的指针为空
所有索引元素不重复
节点中的数据索引从左到右递增排列

The default in Mysql is that the size of each cotyledon block is 16K

Btree can perfectly solve our current actual situation, but we need to observe that there will be corresponding data under each index. When there are few data fields in the table, it can be reluctantly accepted, but if there are many fields, The index that can be stored in each leaf block is also very small, so this is not very satisfactory.

1.1.5.B+Tree

Before explaining B+Tree, we need to look at the schematic diagram of B+Tree, as shown below.

 The characteristics of B+Tree are as follows:

非叶子节点不存储data,只存储索引(冗余),可以放更多的索引
叶子节点包含所有索引字段
叶子节点用指针连接,提高区间访问的性能

Does B+Tree solve the actual problem?

第一:每一个叶子节点不会专门的带有Data的数据了,那么每个叶子节点存放的数据就会更多,估计到了第三层的时候就能打倒2000W左右的数据存储(解决了二叉树、红黑树、BTree问题)
第二:解决了二叉树连续递增的问题(不仅解决了,这里之后还更加依赖了整形类的数据递增,因为这样的话,才能更好的组装B+Tree的结构,不是递增的情况下有中位数据的产生,有可能导致
数据结构的变化,对于数据库来说也是很大的开销)(解决了二叉树问题)
第三:在最底层的时候会有环行链(首尾相连)的结构在里面,解决了Hash树的范围查询的问题(解决了Hash树问题)

1.2. Index Engine

There are two mainstream search engines in MySQL indexing engine, one is MyISAM, the other is InnoDB

1.2.1. MyISAM (because the data and the index are not together so it is a non-clustered index)

First look at the way MyISAM is essentially stored in the data

 

My User table is defined as MyISAM type, there will be three files in the DATA data

*.frm 存储表结构
*.MYD 存储表数据
*.MYI存储表索引

The actual query logic is shown in the figure below:

First query the data of MYI to determine the location of the data storage, and then locate the entire data according to the location information (0x07).

Then find the MYD file through the location information, and get the content of the data directly.

1.2.2. InnoDB (because the index and data are together, it is also called a clustered index)

Check how InnoDB is stored

The table sys_uesr_role is defined by InnoDB, there will be two file data in Data

*.frm 存储表结构
*.idb 存储的是索引和数据

The actual implementation logic is shown in the figure below:

Primary key index:

Non-primary key index:

problem:

为什么InnoDB表必须有主键,并且推荐使用整型的自增主键?
答:没有主键的话,系统会默认的使用 rowid作为主键进行排序,但是rowid是默认的,大小是不确定的,在插入的时候可能会导致索引树结构的变更。这也同时回答了为什么用整数自增型来作为主键
为什么非主键索引结构叶子节点存储的是主键值?(一致性和节省存储空间)
答:非主键这么存储的原因是因为,为了只是维护一份数据,假设如果非主键数据也存在Data,idb的数据是按照倍数增加,而且如果非主键索引存储了数据,
就会包含一致性事务在里面,白白浪费了很多无用的开销。

1.3. Joint Index

The joint index has the leftmost principle

Guess you like

Origin blog.csdn.net/baidu_31572291/article/details/115163734