Why use MySQL index in B + tree implementation

I. Introduction

  These days in the study of MySQLrelevant content, while MySQLan index of content is more important. To MySQLindex there is to know should know, B + tree is MySQLa major index data structure implemented. Today, this blog will be a brief introduction B tree, B + tree and MySQLindex reason for using this data structure implementation.


Second, the text

2.1 B tree

  Operational details about the B-tree I have not described in detail here, here mainly introduce the structure of the B-tree, B-tree so that we have a general understanding.

  Here we must first correct a problem, a large number of online article called the B-tree B- (minus) tree, but in fact this is a wrong name for. It is so called because the English name of the B-tree is "B-Tree", translation error led to the wrong name for this gradually spread. But actually here "-" is the bar , rather than decrease , due to the presence of B + tree, so we all feel that there is a B- (minus) tree is normal, but in fact, the full wording of the English B + tree should be B + -Tree, "+" is added, but "-" is the bar. In addition, even suggested that the B-tree is a binary search tree (Binary Search Tree, referred to as BST), which is more outrageous wrong.

  B-tree is more than one balanced search tree, I believe many people find out about binary search trees, B-trees and binary search tree is similar, but each node B-tree can have more than two children. The B-tree, each node may have several child nodes particular, the B-tree which tree order , whereas the order m of the tree generally represented by the letter. B-tree maintenance operations aside aside, B tree can be simply understood as an m binary search tree .

  We define a tree, each node in the maximum allowed number of child nodes, this number is called the order, generally represented by the letter m. For example: Suppose the number B of a stage 5, the number on each node B, can be up to 5 sub-nodes. In addition to be noted that the tree order m B> 2.

  Below us directly to understand the structure by a B-tree diagram:

  The above image is a standard B-tree, each node of his there may be a plurality of values (the number of values for each node record, i.e. the red numbers figure above), and a plurality of sub-points node pointer, and the number of values of the number of pointer = -. 1 m = -. 1 . For example, the above figure, there a root value 50(referred to in FIG key), and the left child of the root node there two values, 10and 30. And because there is only one value of the root node, so he has two pointers to child nodes, it can be seen from the figure, these two pointers are located on both sides of values. We have said before, B tree can be considered an approximation of m binary search trees, so all the values of the left sub-tree diagram above, the root node is less than the value of the root node 50, and the right subtree value of all nodes value greater than the root node 50.

  Root only one value, and two children, and a binary tree similar to so it is not a typical example, we are now to the left child node of the root node example again. The left child node of the root node there are two values, i.e., 10 and 30, and he has three pointers pointing to child nodes. In each node, the plurality of values ​​is sorted, such as the figure above is from small to large, then left at 10 30. For the left subtree pointer 10, contains the value less than 10; subtree pointer located between points 10 and 30, a timing value included between 10 and 30; 30 and the right pointer of subtree of child nodes must be greater than 30.

  We now take an example to illustrate the process of looking for a B-tree, assuming the figure above, we want to search for value 35, then go through the following steps:

  1. By 35comparison with the value in the root node, the root node is only one 50, , 35<50then the search to the left child of the root node;
  2. The first value is the left child node of the root node is 10, , 35>10then the value is determined, the next value 30, 35>30the next value is determined to continue, but there is no next node value, so the 35pointer to the right Find a child node, which is the third leaf node;
  3. To the leaf node, we found only one value, that is 35, so to find success;

  The above process is to find a value in the B-tree.


2.2 Why need a B-tree?

  At each node in the B-tree, we have more than one value is stored, the stored number of values depends on the specific order of the B-tree. And we find a value in the process, the need for a traversal of all the values contained in the nodes currently located, in order to determine whether the value of the current search in the current node. This means that, compared to the binary search tree, balanced binary tree, red-black tree data structure such as, B tree Compare more times to find a desired value. Assume a B-tree stage Shi 100, which means that in the worst case, we have access to each node, the need to compare 100times, and the number of comparisons of three kinds of data structures mentioned above will not exceed depth of the tree, which is only a small number of comparisons. Since the B-tree compared to their relatively more reps need to find the appropriate value, then why B trees? It depends on the application scenario.

  Speaking of B-tree, the first thought is that most index database, MySQLthe index used mainly for BTree索引(B + tree is actually implemented, talk about this later). We can see from the B-tree structure, used in a large number of B-tree relationship between the nodes of the pointer maintenance, which means that B-trees on the physical storage is not continuous. Single node data is stored contiguously, but are generally stored separately between a plurality of nodes, and reference each other through pointers. In the actual store, BTree索引are generally stored in the disk, and then only when required, it will use part of the node loaded into memory, compared determination.

  Why not one-time BTree index of all loaded into memory? Because in actual production, often we need to maintain an index of millions or even millions of lines of data, which led the index itself occupies a lot of memory, coupled with often more than one index we use, plus the memory required to run other programs, so the index disposable loading into memory is unrealistic thing. Only need to use the current was only partially loaded into memory, instead of using part of the disk is left in or removed from memory.

  And on top of this there is a problem of what to use to load only way? Every time we need to find a node in the tree, you need to conduct a disk IO, this node will be loaded into memory from disk. For a balanced binary tree or B-tree data structure such as mentioned earlier, the number of nodes need to access up, actually the depth of the tree (think of a value search process can understand). For the B-tree, he may store a plurality of each node values, binary balanced binary tree like structure, each node stores only a value, which means that in the case where the value is equal to the number, the B-tree depth less than the depth of a binary tree. This means that as a B-tree index can be less number of disks IO.

  For a tree with n elements, binary search tree depth between n-log2 (n), and the depth of the balanced binary tree is log2 (n), similar to the red-black tree with a balanced binary tree, the average depth is log2 ( n). However, a B-tree design simple and efficient maintenance operations, so that the depth of the B-tree is maintained at about log (ceil (m / 2) ) ~ between logm (n) (n), is greatly reduced tree height (ceil upward rounding function, for example, 5/2 = 3).

  Here faced with a problem? Speed of the disk, as opposed to very slow memory, the speed of the disk Find It is about 100,000 times slower than memory . In other words, from the disk to find 1the time it takes for data, you can find from memory 100000data. This also means that we use the index to find data in the process, time is mainly spent in the disk IOon the comparison, rather than data. So, we want as little as possible disk IO. And as a B-tree index, due to the depth of the tree is small, compared to those binary tree may be less disk IO, this is the biggest advantage of the B-tree.

  Binary tree, a node is typically stored only one element, but when disk data is loaded into memory, the fact is loaded by the page, each page is the minimum unit loaded from disk into memory data, typically 4K. This means that when we use these binary tree data structure, when a page is loaded into the memory node is located, this page has a lot of memory is wasted. And each B-tree node can store a plurality of data, so we can modify the order of the B-tree, each node so that his size substantially occupies a page (4K), in order to minimize the depth of the B-tree, improve memory utilization.


2.3 B tree problem occurring

(1) is difficult to store specific data

  B-tree above us in the introduction of the process, for the elements stored in B-tree, are using "value" is the word to explain, but you can see in the above picture is, the figure is written, keyword, because we in actual use, the need to store key-valuedata type. For example, a database index, we need to look through the index, the index value is key, but what we really need is the index value corresponding to the data lines, that is value.

  There is a relatively simple solution, we can directly stored in the node B tree keyand value, by so keyit is possible to directly find out the elements value. However, this can lead to another problem. We mentioned above, a B-tree node, the size is generally limited in size (4K) a disk page, if we are existing in a node key, and keep value, it will lead to a reduction in the number of nodes that can be stored in the elements, valuethe larger, less elements can be stored, so the depth of the tree increases, contrary to our use of disk B-tree reduction IOpurposes, so this method is not desirable. Of course, in fact, you can also make valuean address to store data, but it may be necessary to consider the following issues, all failed to do so.

(2) B tree is not suitable for query processing range

  Frequency in the database, a range lookup is very high, such as finding wages in 1000-2000the query of all employees of this type. However, B-trees is not suitable for performing such a range to find, because the B-tree, each node for storing data, is not a linear structure between them, for convenient range query. Want to be in the range of B-tree query, you can find a range of upper and lower bounds, through DFS (or BFS), traversing all nodes contain lower bound to the upper bound, but this is not convenient.

  B + tree It is against these two problems, while the B-tree made some changes derived from the past.


2.4 B + tree

  B + tree with respect to the B-tree is mainly made the following modifications:

  • B + tree in each non-leaf nodes are not stored value value, only the storage key key, and concrete valueall stored in a child node, which means once for each need to access the number of nodes are fixed, we need down Find the child node;
  • Each child node has a next pointer pointing to a child node, all child nodes interconnected in series to form a linear structure;
  • For an morder B + tree, each node m at most child node, while storing mone key(for the mB-tree order, only m-1one key);
  • Each child node of the minimum (or maximum) key, is also included in the parent node (this is understood by the following example);

  Let us look at the structure of the B + tree by a map:

  Figure above, is a B + tree. Storing root node 3a key, keyvalues are 5,28,65, and are stored in ascending order of arrival. While the root node contains 3a pointer to its 3child node. Smallest first child node keyis 5, is the smallest root key, and this all the nodes key, the size of the root node is the first key(5)to the second key(28)between (not included 28); a second child node minimum keyis 28, the second is the root node key, the child node in all key, the size of the root node in the second key(28)to third key(65)between; Similarly third child nodes, the root node comprising a first three key(65), and in which all keyare >=65. Further down the child node is the same reason.

  According to the graph we can see that in the lowest layer of the leaf node, all of the stored keyvalue (although some keyhave occurred in the upper node), and stores not only the keyvalue, but also stores these keyvalues corresponding value. In addition, each leaf node contains a pointer to the next leaf node. These leaf nodes are connected in series, constitute a keyvalue from small to large sorted linear structure.

  So what good is it treated? Our advantage is stored in the tree value, but due to valuestorage in a leaf node, so for non-leaf node as an index, it did not increase their size, so that did not lead to increase in height of the tree. In addition, because valueit is stored in the leaf nodes, and leaf nodes are connected in series, so it is very convenient to carry out the scope of the query. For example, on the map, we're looking keyfor the 26-60corresponding data, we first find 26the leaf node is located, we found it in the third leaf node, so we will be the third leaf node is read into memory, and then found not contain all the data , then find a pointer through the fourth leaf node, read the fourth leaf nodes into memory, does not contain all, then the fifth leaf node also reads, then it contains all the data.


2.5 achieve InnoDB and MyISAM index

  In the MySQL5.1before, MySQLthe default storage engine is MyISAM, but after that changed to InnoDB. The two storage engines, are used B+Treeto achieve the index, but the way to achieve the distinction.

(1) InnoDB the clustered index

  What is a clustered index, the index refers to data stored in a database table together. InnoDBUse a B + tree to achieve a clustered index, the data rows in a database table actually stored in the leaf nodes of the B + tree, we said above, the B + tree key-value, which valuerefers to a particular row of data, we leaves node found key, in fact, it has been the row corresponding to the key data. So, strictly speaking, not just the clustered index is an index, it is a data storage structure.

  InnoDBUsing as the primary key of the table key, establishing clustered index, if the table is not the primary key, to select a unique non-empty index Alternatively, if no, the implicit definition of a primary key to establish.

(2) MyISAM non-clustered index

  In MyISAM, a clustered index is not used, that is to say MyISAMby the B + tree implementation index, the table does not contain specific data line, the child node value, the address is stored in the line data, and data that is index stored separately.


Third, the summary

  In MySQLpractical applications, BTree索引it is established by the B + tree, instead of a B-tree. In InnoDBusing the clustered index, the leaf nodes of the B + tree in the row of data is stored directly in the table; and MyISAMdo not use a clustered index leaf node B + tree, it is the address of the data row.


Fourth, the reference

Guess you like

Origin www.cnblogs.com/tuyang1129/p/12635692.html