Point B tree did not mind, how can we thoroughly understand the underlying principles of database indexes?

Binary Tree (Binary Search Trees)

Binary tree each node has at most two sub-trees tree structure. Subtree generally referred to as "left subtree" (Left Subtree) and "right subtree" (Right Subtree). Binary tree is often used to implement a binary search tree and a binary heap.

Binary Tree has the following features:

  • Each node contains a subtree elements and n, where 0≤n≤2.

  • Left subtree and right subtree is in order, the order can not be arbitrarily reversed. Value is less than the left subtree is the parent node, the value is greater than the right subtree of the parent node.

Boring bit look concept, suppose we now have such a set of numbers [35274812293855], the sequence number is inserted into a structure, the following steps:

Well, this is a binary tree it! We can see that, after a series of insert operations, originally a group of disorderly count has become an ordered structure, and this tree and meet the characteristics of the two above-mentioned binary tree!

But if the same number is above that group, and then insert ourselves in ascending order, that is inserted in accordance with the order of [48,552,729,353,812], what will happen?

Since ascending inserted, the newly inserted data is always bigger than node data that already exists, so every time to the right node is inserted, leading to serious side branches of this tree!

The figure is the worst-case scenario, that is, a tree degenerates into a linear list, so naturally search efficiency is low, and did not exploit the advantages of the tree of it!

To find efficiency play a larger binary tree, binary tree so that no side branches, subjects maintained a balance, so with a balanced binary tree!

Balanced binary tree (AVL Trees)

Balanced binary tree is a special binary tree, so he satisfies two properties of a binary tree mentioned earlier, as well as a characteristics: absolute value of difference of the height of its left and right subtrees of no more than 1, and the left and right subtrees It is a balanced binary tree.

We also saw the front [35274812293855] FIG After insertion is completed, in fact, it has been a balanced binary tree.

So, if in accordance with [12,272,935,384,855] inserted into a balanced binary tree of order, what will happen?

We look at the process of inserting and balance:

This tree has always been to meet the balanced binary tree of several characteristics to maintain a balance! So our tree will not degenerate into a linear list!

We need to find a number of times has been able to look down along the roots, so the search efficiency and a binary search is the same too!

** a balanced binary tree can accommodate the number of nodes it? ** This is the height of the tree with the relationship, assuming that the tree height is h, a maximum number of nodes that received each layer is 2 ^ (n-1), the whole tree nodes to accommodate up to 2 ^ 0 + 2 ^ 1 + 2 ^ 2 + ... + 2 ^ (h-1).

On this basis, the height of the tree data 100w probably about 20, that is to find a balanced binary tree with data from 100w of data, the need to find 20 times in the worst case.

If memory operation, efficiency is very high! But the data in our database are basically on the disk, each reading a binary tree node is a disk IO, so if we find a piece of data to go through 20 times the disk IO?

That performance has become a big problem! That we can not compress this tree, so that each layer is able to accommodate more nodes it? Although I am short, but I'm fat ah ...

B-Tree

Zheke chunky B-Tree is a tree, the middle note is not bars Save fine bars, so do not read as a Tree Save ~ B

What are the characteristics that B-Tree has it? An m-order B-Tree has the following characteristics:

  • Each node up to m sub-nodes.

  • In addition to the root and leaf nodes, each node has a least m / 2 (rounded up) subnodes.

  • If the root is not a leaf node, the root node that contains at least two sub-nodes.

  • All leaf nodes are at the same level.

  • Each node contains k elements (keywords), where m / 2≤k.

  • Each node elements (keywords) in ascending order.

  • (Key) word value of each element of the left node, less than or equal to the elements (keywords). Value of the right node is greater than or equal to the elements (keywords).

With the wife's mother is not feeling mouth ask you to bride price as a bunch of condition column, and each one will make you very ignorant force!

Here we insert a 3-order B-Tree, for example to a [6,7] array to string together all the conditions are, you will understand!

So, whether you are a few characteristics of the B-Tree are clearly of it? In the binary tree, each node has only one element.

However, the B-Tree, each node may contain a plurality of elements, and the non-leaf nodes in the element has a pointer to the left and right child nodes.

If you need to find an element that the process is kind of how it? We see the figure, if we are to find the keyword in the following 24 in B-Tree, that process is as follows:

From this process we can see, B-Tree query efficiency seems to be no higher than balanced binary tree. But the number of nodes through which the query is much less, which means much less time disk IO, which enhance the performance is great.

From the foregoing FIGS. B-Tree operation, we can see, the elements 2, 3 is similar to such values.

But the data in the database is a section of the data, if a database to store B-Tree data structure of data, how that data stored in it?

We look at the next chart:

Common Node B-Tree, the element number is one. But the figure above, we split into the elements of key-data portion of the form, Key is the primary key data, Data is the specific data.

We are looking for a number when you look down along the root OK, efficiency is relatively high.

B+Tree

B + Tree is an optimization based on the B-Tree, and make it more suitable to realize an external storage index structure.

B + and B-Tree Structure Tree of the like, but there are several own characteristics:

  • All non-leaf nodes only store keyword information.

  • All satellite data (specific data) are present in a leaf node.

  • All leaf node contains information about all elements.

  • It has a chain of pointers between all leaf nodes.

If the above B-Tree FIG becomes B + Tree, it should be as follows:

We carefully compared in Figure B-Tree can find what is the difference?

  • On a non-leaf node has only Key information, and meet the first characteristic point above!

  • We have all leaf nodes below a Data area, to meet the second characteristic point above!

  • Data of the non-leaf node can be found on the leaf nodes, such as the root element 4,8 can also be found on the lowest level leaf nodes, to meet the third characteristic point above!

  • Note that the arrows between the figures leaf nodes, to meet the above characteristics of 4:00!

B-Tree or B+Tree?

Speaking before the selection of these two data structures in the database, we also need to know a knowledge of the operating system to read data from disk into memory is a disk block (Block) as the basic unit, in the same disk block the data will be read out one-time, rather than what it takes to take what.

Even if only one byte, the disk will begin from this position, a length of data sequentially read back into the memory.

To do so is based on the theory of computer science in the famous locality principle: When a data is used, the data in its vicinity also often be used immediately.

The length is generally pre-read page (Page) is an integer multiple. Logical page blocks of computer memory management, hardware and operating systems are often divided main memory and disk storage for the successive blocks of equal size, each memory block called an (in many operating systems, the page size is generally 4K).

B-Tree and B + Tree How to choose? What are the pros and cons of it?

①B-Tree because of the non-leaf nodes also save specific data, so finding the time to find a keyword to return.

B + Tree and all data in the leaf nodes, are obtained once for each leaf node. So at the same high level of B-Tree and B + Tree of, B-Tree to find more efficient for a keyword.

② Since the B + Tree all data in the leaf node, and connected between the node has a pointer, when looking at greater than or less than a keyword in a keyword data, B + Tree just need to find the key list is then traversed along it, and the B-Tree is also needed to traverse the root node to the search keyword.

③ Since each node (this node can be understood as a page of data) stored primary key of the actual data + B-Tree, and B + Tree leaf node stores only non-keyword information, and the size of each page is limited, so that the same page of the B-Tree can store data storage than B + Tree less.

Thus the same amount of data, the B-Tree depth will be greater, the query is increased when the disk I / O times, thereby affecting the efficiency of the query.

In view of the above comparison, it is in a conventional relational database, are selected B Tree data structure to store data +!

Here we have a MySQL InnoDB storage engine as an example to explain, other similar SQL Server, Oracle of principle!

InnoDB storage engine data

In the InnoDB storage engine, also has the concept of pages, each page is the default size is 16K, which is the size of 4 * 4K are read each time data is read!

Suppose we now have a user table, we entered, and write data:

One thing to note here is insert a new row in a page, in order to reduce mobile data, usually back into the current line or deleted rows to stay in space, so the data within a certain page and not completely ordered (page moieties behind go into detail).

But in order to access the data of the order, in each record has a pointer to a pointer to the next record, and thus constitutes a one-way sorted linked list, but here for the convenience of the presentation I was arranged in order!

Since the data is still relatively small, you can only put a page, so only a root, primary keys, and data are also stored in the (specific data of the left number represents the primary key, the right name, gender representation) root.

Suppose we write 10 after the data, Page1 full, then write the new data will be how to store it?

We continue to look:

A man named "Qin Shousheng 'friends came, but the data does not fit Page1 already, and this time on the need for page division, generated a new Page.

Process in InnoDB is kind of how it?

  • Generate new Page2, then copy the contents of Page1 to Page2.

  • Generate new Page3, "Qin Shousheng" data into Page3.

  • The original Page1 still as root, but became a page does not store data only store the index, and there are two sub-nodes Page2, Page3.

There are two problems that need attention are:

① Why Page1 to Page2 copy instead of creating a new page as root, so that less of the cost of copying step?

If you re-create the root, and that root physical address stored may often become, harder to find.

And in InnoDB root node will be read ahead in the memory, the physical address of the fixed node would be better!

② Page1 ten original data, the article 11 is inserted in the fission time data, based on B-Tree + Tree understanding of the foregoing characteristics, B, and that this is at least one order of tree 11, each node after fission element point of at least 11/2 = 5.

It was not after the fission primary key should be 1-5 pages or pages in the original, primary key data 6-11 will be placed in a new page, the root node holding the primary key 6?

If this is the case, a new page space utilization rate of only 50%, and will lead to more frequent pages division.

So on this point InnoDB has been optimized, the new data into the newly created page, do not move any record of the original page.

With the writing of data, the tree gradually flourish, as shown below:

Each time new data is a page filled, and then create a new page continues to write, in fact, there is a hidden condition, and that is the primary key increment!

Primary key increment the newly inserted when writing data will not affect the original page, high insertion efficiency! And the utilization of high page!

However, if the primary key is disorderly or random, and that each of the insertion may lead to frequent splitting of the original page, affect the efficiency of insertion! Reduce the utilization of the page! This is also the reason why the recommended setting of the primary key increment in InnoDB!

This tree stored on a non-leaf nodes are the primary key, that if a table has no primary key happen? In InnoDB, if a table has no primary key, you will find that the default built a unique index of the column, if there is no will generate an invisible field as the primary key!

There would have to delete data insertion, if the user table frequent insertions and deletions, that would lead to the data pages become fragmented, low utilization of space page, but also results in a tree becomes "High" and reduce the query efficiency! This can be eliminate debris improve query performance by rebuilding the index!

InnoDB data search engine

How to find the data into it?

  • ** find the page where the data resides. ** This search process when it comes to just in front of the B + Tree search process is the same, from the root to start looking up to the leaf node.

  • ** look for specific data within the page. Read from step 1 ** leaf node data into memory, and then locate the specific data by the method of block search.

We find that with a Chinese character is the same as in the Xinhua Dictionary, first navigate to the page where the Pinyin dictionary by index and then to specify the page to find specific characters.

After InnoDB locate the page in which strategy to quickly find a master key to it? That we need to begin to understand the page structure.

Referred to the blue region of the left Page Directory, this region composed of a plurality of Slot, it is a sparse index structure, i.e., a groove may belong to a plurality of records, at least part of four records, records belong to up to eight.

Data storage tank is ordered, so when we find a dichotomy first data when you can find an approximate location in the slot through.

Right area of ​​the data area, each data page contains a plurality of rows of data. Note that Figure in the top and bottom rows of two special Infimum and Supremum, these are two virtual rows.

In the absence of other user data in the next record pointer pointing Infimum Supremum.

When the user data, the pointer to the next record in the Infimum directed current minimum user records page, pointer to the next record in the maximum subscriber record this page, point Supremum, point all rows within the whole pages form a single to the list.

Page Directory rows are divided into a plurality of logical blocks, between the blocks are ordered, i.e., "4" is the primary key groove directed in the maximum data block rows are better than "8" the groove minimum primary key rows within the data block points is smaller. But the rows inside the blocks are not necessarily in order.

Region (the pink area in the figure) has a row for each record n_owned, n_owned identifies the number of data blocks.

recording the value of the pseudo n_owned Infimum always 1, the recording of Supremum n_owned range is [1,8], other users n_owned recording range of [4,8].

Each block and only the largest piece of recording n_owned will have value to other users is recorded n_owned 0.

So when we're looking for the primary key record 6, first find the corresponding slots in the sparse index by dichotomy, which is the Page Directory in "8" in this slot.

"8" of the slot points to the data block is recorded in the largest, and is a singly linked list data structure, it is not backward lookup.

Therefore, the need to find a groove that is "4," the groove, and then look to the target sequence along the chain by the recording pointer "4" is the largest user record groove.

& Non-clustered index clustered index

Achieve a clustered index on the front of the stored data is demonstrated, if the above user table need to establish a "user name" a non-clustered index, is how to achieve it?

We look:

Storage structure previously non-clustered index is the same, except that the data in the leaf node is no longer part of the stored specific data, but the data clustered index Key.

So through the process of non-clustered index lookup is to find Key clustered index of the corresponding index Key, then let's gather Key index to the primary key index tree to find the corresponding data, this process is called back to the table!

PS: the figure of these names are from the network, there is no hope of harming're reading this article, you ~ ^ _ ^

InnoDB and MyISAM engine comparison

Above, including storage and search are taking the InnoDB engine, for example, that MyISAM and InnoDB storage so what's different in it? Hold back words shrink, Figure:

Pictured MyISAM storage structure on the primary key index, we can see different is:

  • Data area of ​​the leaf node is not the primary key index tree to store the actual data, is stored in the data record address.

  • Primary key storage is not stored in the order data, is located in the order written.

That InnoDB engine data is physically stored primary key sequence, while the data stored by the MyISAM engine in order to physically inserted.

And MyISAM leaf node does not store data, storage structures and non-clustered index clustered index is similar, you can directly find the address of the data by non-clustered index tree when using non-clustered index to find data does not need back to the table, this search will be higher than the efficiency InnoDB it!

Index Optimization Tips

You can often see the use of some of the recommended index in many articles or books, for example:

  • like fuzzy query begins with%, the index will lead to failure.

  • Build a table of the index as not more than five.

  • Try to use a covering index.

  • Try not to build indexes on columns duplicate data and more.

  • ......

Many here will not list them! That reading this article, can we go with questions analyze why have these recommendations?

Why start with a query like fuzzy%, the index will lead to failure? Why build a table of the index as not more than 5?

why? why? ? why? ? ? I believe you see here plus some of their own thinking should have the answers, right?

Guess you like

Origin juejin.im/post/5df09dd0f265da33ce4567fb