Indexes in MySQL

1. Zero foreshadowing

    Before introducing the B tree, let's look at another magical tree, the Binary Sort Tree. First of all, it is a tree. The description of "binary" is already obvious. It is a tree on the tree. The root branch has two forks, so it is a binary tree recursively (as shown in the figure below), and the nodes on this tree are already sorted. The specific sorting rules are as follows:

  • If the left subtree is not empty, the value of all nodes in the left subtree is less than the value of its root node
  • If the right subtree is not empty, the value of all nodes in the right subtree is greater than the value of its root node
  • Its left and right subtrees are also binary sorted numbers (recursive definition)

    As can be seen from the figure, when the binary sorting tree organizes data, it is more convenient to use for search, because each time a node is passed through, the possibility can be reduced by up to half, but in extreme cases, all nodes are located on the same side, Intuitively, it is a straight line, so the efficiency of this kind of query is relatively low, so it is necessary to balance the height of the left and right subtrees of the binary tree, so there is a balanced binary tree (Balanced Binary Tree).

   The so-called "balance" means that the height of each branch of this tree is uniform, and the absolute value of the difference between the heights of its left subtree and right subtree is less than 1, so that there will be no particularly long branch. . Therefore, when searching in such a balanced tree, the total number of node comparisons does not exceed the height of the tree, which ensures the efficiency of the query (time complexity is O(logn)).

2. The origin of B-tree

     B-trees were first proposed by German computer scientist Rudolf Bayer and others in the paper "Organization and Maintenance of Large Ordered Indexes" in 1972, but I went to see the original text and found that the author did not explain why it was called B-trees. , so simply interpreting the B of the B-tree as Balanced or Binary is not particularly rigorous. Maybe the author named it after the initials of its name Bayer...

3. What does a B-tree look like?

    It is clearer to look directly at the picture. As shown in the picture, the B-tree is actually a balanced multi-fork search tree, which means that at most m forks (m>=2) can be opened, which we call the m-order b-tree , in order to reflect the conscience of this blog, unlike other places where you can see a 2-order B-tree, a 5-order B-tree is specially drawn here.

In general, the m-order B-tree satisfies the following conditions:

  • Each node can have at most m subtrees
  • The root node has at least 2 nodes (or in extreme cases, a tree has only one root node, and a single-celled organism is a root, a leaf, and a tree).
  • The non-root and non-leaf nodes have at least Ceil(m/2) subtrees (Ceil means rounding up, and the 5th-order B-tree in the figure, each node has at least 3 subtrees, that is, at least 3 forks).
  • The information in the non-leaf node includes [n,A0,K1,A1,K2,A2,…,Kn,An], where n represents the number of keywords stored in the node, K is the keyword and Ki<Ki+ 1, A is a pointer to the root node of the subtree.
  • Each path from the root to the leaf has the same length, that is, the leaf nodes are in the same layer, and these nodes have no information, in fact, these nodes indicate that the specified value cannot be found, that is, point to these nodes pointer is null.

    The query process of B-tree is similar to that of binary sorting tree. Each node is compared in turn from the root node, because the keywords in each node and the left and right subtrees are ordered , so as long as the keywords in the nodes are compared, Or you can quickly find the specified keyword along the pointer. If the search fails, it will return the leaf node, that is, a null pointer.

For example, the K in the alphabet in the query graph

  1. Starting from the root node P, the position of K is before P, and enters the left pointer
  2. In the left subtree, compare C, F, J, M in turn, and find that K is between J and M
  3. Follow the pointer between J and M, continue to visit the subtree, and compare them in turn, and find that the first keyword K is the value of the specified search

4. Plus version - B+ tree

As an enhanced version of B-tree, the difference between B+ tree and B-tree is that

  • A node with n subtrees contains n keywords (also considered n-1 keywords)
  • All leaf nodes contain all keywords and pointers to records containing these keywords, and the leaf nodes themselves are connected according to the keywords from small to large
  • The non-leaf node can be regarded as the index part, and the node only contains the largest (or smallest) keyword in its subtree (root node)

    The search process of the B+ tree is similar to that of the B tree, except that when searching, if the keyword on the non-leaf node is equal to the given value, it does not terminate, but continues to follow the pointer until the position of the leaf node . Therefore, in the B+ tree, no matter whether the search is successful or not, each search is a path from the root to the leaf node.

5. How MySQL uses B-trees

说明:事实上,在MySQL数据库中,诸多存储引擎使用的是B+树,即便其名字看上去是BTREE。

1、innodb的索引机制

先以innodb存储引擎为例,说明innodb引擎是如何利用B+树建立索引的。首先创建一张表:zodiac,并插入一些数据

复制代码
复制代码

   对于innodb来说,只有一个数据文件,这个数据文件本身就是用B+树形式组织B+树每个节点的关键字就是表的主键,因此innodb的数据文件本身就是主索引文件,如下图所示,主索引中的叶子页(leaf page)包含了数据记录,但非叶子节点只包含了主键,术语“聚簇”表示数据行和相邻的键值紧凑地存储在一起,因此这种索引被称为聚簇索引,或聚集索引。

    这种索引方式,可以提高数据访问的速度,因为索引和数据是保存在同一棵B树之中,从聚簇索引中获取数据通常比在非聚簇索引中要来得快。

   所以可以说,innodb的数据文件是依靠主键组织起来的,这也就是为什么innodb引擎下创建的表,必须指定主键的原因,如果没有显式指定主键,innodb引擎仍然会对该表隐式地定义一个主键作为聚簇索引。

    同样innodb的辅助索引,如下图所示,假设这些字符是按照生肖的顺序排列的(其实我也不知道具体怎么实现,不要在意这些细节,就是举个例子),其叶子节点中也包含了记录的主键,因此innodb引擎在查询辅助索引的时候会查询两次,首先通过辅助索引得到主键值,然后再查询主索引略微有点啰嗦。。。

2、MyISAM的索引机制

    MyISAM引擎同样也使用B+树组织索引,如下图所示,假设我们的数据不是按照之前的顺序插入的,而是按照图中的是顺序插入表,可以看到MyISAM引擎下B+树叶子节点中包含的是数据记录的地址(可以简单理解为“行号”),而MyISAM的辅助索引在结构上和主索引没有本质的区别,同样其叶子节点也包含了数据记录的地址,稍微不同的是辅助索引的关键字是允许重复。

六、简单对比

1、Innodb辅助索引的叶子节点存储的不是地址,而是主键值,这样的策略减少了当出现行移动或者数据页分裂时辅助索引的维护工作,虽然使用主键值当作指针会让辅助索引占用更多空间,但好处是,Innodb在移动行时无需更新辅助索引中的主键值,而MyISAM需要调整其叶子节点中的地址。

2、innodb引擎下,数据记录是保存在B+树的叶子节点(大小相当于磁盘上的页)上,当插入新的数据时,如果主键的值是有序的,它会把每一条记录都存储在上一条记录的后面,但是如果主键使用的是无序的数值,例如UUID,这样在插入数据时Innodb无法简单地把新的数据插入到最后,而是需要为这条数据寻找合适的位置,这就额外增加了工作,这就是innodb引擎写入性能要略差于MyISAM的原因之一。

Innodb和MyISAM索引的抽象图

 


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325661220&siteId=291194637