The difference between Mysql InnoDB B+ tree index and hash index? Why does MongoDB use B-trees?

The most important difference between a B-tree and a B+ tree is that the B+ tree only has leaf nodes to store data, and the rest of the nodes are used for indexing, while the B-tree is that each index node will have a Data field.

B+ tree

A B+ tree is a balanced search tree (not a binary tree) designed for disks and other storage aids. In the B+ tree, all recorded nodes are stored in the leaf nodes of the same layer in order of size, and each leaf node is connected by a pointer.

The B+ tree index in the database is divided into clustered index (clustered index) and non-clustered index (secondary index). The common point of these two kinds of indexes is that they are both B+ trees, the heights are balanced, and the leaf nodes store all the data. The difference is whether the leaf node stores a whole row of data.

The B+ tree has the following characteristics :

  • Each node of the B+ tree can contain more nodes for two reasons, one is to reduce the height of the tree. The other is to change the data range into multiple intervals. The more intervals, the faster the data retrieval.
  • Each node no longer just stores one key, it can store multiple keys.
  • Non-leaf nodes store keys, and leaf nodes store keys and data.
  • The two pointers of leaf nodes are linked to each other, and the sequential query performance is higher.

In layman's terms

  • The non-leaf nodes of the B+ tree only store keys and occupy a very small space, so the data range that can be indexed by nodes at each layer is wider. In other words, more data can be searched per IO operation.
  • The leaf nodes are connected two by two, which conforms to the read-ahead feature of the disk. For example, the leaf node stores 50 and 55. It has a pointer to the leaf node 60 and 62. When we read the data corresponding to 50 and 55 from the disk, due to the read-ahead feature of the disk, 60 and 62 will be stored by the way. The corresponding data is read out. At this time, it belongs to sequential reading, not disk seek, which speeds up the speed.
  • It supports range queries, and some range queries are very efficient. The range that can be indexed by each node is larger and more accurate. It also means that the amount of information in a single disk IO of the B+ tree is larger than that of the B-tree, and the I/O efficiency is higher.

The reason is that the data is stored in the layer of leaf nodes, and there are pointers to other leaf nodes, so range queries only need to traverse the layer of leaf nodes, without traversing the entire tree.

The principle of locality and disk read-ahead

Due to the gap between the access speed of the disk and the memory, in order to improve efficiency, it is necessary to minimize the disk I/O. The disk is often not read strictly on demand, but read ahead every time. After the disk reads the required data, it will Sequentially reads data of a certain length backwards into memory. The rationale for this is the well-known locality principle in computer science:

When a piece of data is used, its nearby data is usually used immediately, and the data required during program operation is usually concentrated.

B-tree

B-tree, where B stands for balance (balance), B-tree is a multi-way self-balancing search tree. It is similar to an ordinary balanced binary tree. The difference is that B-tree allows each node to have more child node.

B-trees have the following characteristics

  • All key values ​​are distributed throughout the tree.
  • Any one keyword appears and only appears in one node.
  • Searches may end at non-leaf nodes.
  • Do a search in the complete set of keywords, and the performance is close to binary search.

The difference between B-tree and B+ tree

  • Nodes in the B+ tree do not store data, and all data are stored in leaf nodes, resulting in a fixed query time complexity of log n.
  • The time complexity of B-tree query is not fixed, and it is related to the position of the key in the tree, preferably O(1).
  • B+ leaf nodes are connected in pairs, which can greatly increase the accessibility of the interval, and can be used in range queries, etc.
  • If the key and data of each node of the B-tree are together, the interval search cannot be performed.
  • B+ trees are more suitable for external storage (storing disk data). Since the inner node has no data field, each node can index a larger and more precise range.

Why does MongoDB use B-trees?

Nodes in the B+ tree do not store data, and all data are stored in leaf nodes, resulting in a fixed query time complexity of log n. The time complexity of B-tree query is not fixed, and it is related to the position of the key in the tree, preferably O(1)

We said that as little disk IO as possible is an effective means of improving performance. MongoDB is an aggregate database, and the B-tree just aggregates the key and data fields together .

As for why MongoDB uses B-tree instead of B+ tree, it can be considered from its design point of view. It is not a traditional relational database, but nosql stored in Json format. The purpose is high performance, high availability, and easy expansion. . First of all, it gets rid of the relational model, and the above-mentioned advantages 2 requirements are not so strong. Secondly, because Mysql uses B+ tree, the data is all on the leaf nodes, and each query needs to access the leaf nodes, while MongoDB uses B-tree . , all nodes have a Data field, which can be accessed as long as the specified index is found. No doubt, a single query is faster than Mysql on average .

hash index

Simply put, a hash index uses a certain hash algorithm to convert the key value into a new hash value. When searching, it does not need to search from the root node to the leaf node level by level like a B+ tree, only one hash algorithm is required. You can locate the corresponding position immediately, and the speed is very fast.

The difference between B+ tree index and hash index

  • If it is an equal-value query, then the hash index obviously has an absolute advantage , because the corresponding key value can be found only after one algorithm; of course, the premise is that the key value is unique. If the key value is not unique, you need to find the location of the key first, and then scan backwards according to the linked list until the corresponding data is found.
  • If it is a range query retrieval, the hash index is useless at this time , because the original key value is ordered, after the hash algorithm, it may become discontinuous, and there is no way to use the index to complete the range. Query retrieval.
  • Similarly, hash indexes cannot use indexes to complete sorting, and partial fuzzy queries like 'xxx%' (this partial fuzzy query is actually a range query in essence).
  • Hash indexes also do not support the leftmost matching rule for multicolumn union indexes .
  • The keyword retrieval efficiency of the B+ tree index is relatively average, and it does not fluctuate as much as the B tree. In the case of a large number of duplicate key values, the efficiency of the hash index is also extremely low, because there is a so-called hash collision problem .
  • Summary of methods for resolving Hash collision conflicts https://blog.csdn.net/zeb_perfect/article/details/52574915

Remarks: The above content is excerpted from the Internet, not original, and is only for personal learning and communication. I hope that you will find something wrong, and give me your advice, just leave a message .

refer to

Recommended reading

Pay attention to WeChat public account benefits

Follow the WeChat public account "Souyunku" to get the latest articles

[ Welfare ] Reply to " enter the group " in the background of the official account
[ Welfare ] Invite you to join the WeChat " Technology Sharing Group "
[ Welfare ] There are many technical leaders in the group, ask questions for free, and learn from each otherFollow the official account - Soyun Library

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325099039&siteId=291194637