4. InnoDB indexes and algorithms

Column address:

MySQL series articles column



1. Common index models

The purpose of indexing is to improve the efficiency of data query and retrieval. Common index models are: hash table, ordered array, and search tree .

Hash table

The hash table stores data in a key-value form. Since the storage structure is not ordered, the hash index is not suitable for range query, index completion sorting, etc., but it has an absolute advantage in the equivalent query scenario.

In Redis ZSET, a jump table structure is added on the basis of the hash table, so that it can support both hash access and fast range query.

Ordered array

Ordered arrays have excellent performance in both equivalent query and range query scenarios, and the time complexity reaches O(lgN). However, the cost of data addition and deletion is too high, and it is only used for static storage engines.

Search tree

The characteristics of a binary tree are: each parent node is larger than its left son and smaller than its right son. The time complexity of the query is O(lgN). In order to maintain O(lgN) query complexity, it is necessary to keep this tree as a balanced binary tree, so the update operation is also O(lgN).

The search tree can have multiple forks, and the child nodes keep increasing in order from left to right.

The search efficiency of the binary tree is the highest, but it is rarely used in the database index. The reason is that the index not only exists in the memory, but also written to the disk. In order to reduce the number of IOs, it is necessary to access as few data blocks as possible during the query process.

2. B+ tree index

B+Tree is a balanced multi-path search tree designed for hard disk or other external storage . Compared with a balanced binary search tree, B+Tree is a multi-tree , each node stores multiple elements, and the height is lower, which can effectively reduce IO times .

Both MyISAM engine and InnoDB use B+Tree as the index structure. The difference is that MyISAM index and data are separated, and the address of the data is stored in the B+ tree; in InnoDB, the table data is organized according to B+Tree, and the leaf nodes are stored For data records, this table structure is called an index-organized table. In the database, the height of B+Tree is generally 2-4 layers, and the query operation only needs 2-4 times of IO.

2.1 B-Tree-ordered array + balanced multi-tree

  1. A B+Tree of order m has at most m child nodes and m-1 keys per node. (PS: m-1 keys are divided into m intervals, the child node does not appear the existing elements of the parent node, and each node element only appears once )
  2. Except for the root node and leaf nodes, each node has at least m/2 child nodes.
  3. The elements of each node are sorted from smallest to largest.

Insert picture description here

2.2 B+Tree-ordered array + linked list + balanced multi-branch tree

  1. The node element has the same number of its children. (M order, m elements, m child nodes)
  2. The parent node holds the minimum value of each subtree .
  3. It can be seen from 2 that the leaf node stores all elements .
  4. The leaf nodes are connected by pointers (doubly linked list)

Insert picture description here

Insert picture description here

2.3 Comparison

Each element in B-Tree only appears once, while all elements in B+Tree appear in leaf nodes. Therefore, B-Tree needs to store data in each node, while B+Tree only needs to store data in the leaf nodes, so the internal structure of B+Tree is smaller and the order can be larger . B + Tree except random search of further support support sequential retrieval . The query efficiency of B+Tree is more stable. The search for any element needs to be from the root node to the leaf node, and the path length is the same.

2.4 B+ tree operation

In order to ensure the orderliness of the index, the B+ tree requires necessary maintenance when the index changes.

Find

Similar to binary tree search, binary search is used inside the node.

increase

Random insertion of the index, if the current data page is full, it will cause the page split -apply for a new data page, and then move part of the data to the new data page. Page splits will cause performance loss (involving disk operations), and also reduce page usage. InnoDB, in order to reduce page splits, when the sibling node is not full, it will perform a rotation operation and transfer the record to the sibling node.

Appending will not cause page splits, but like random insertion, inserting a node into a parent node will cause the B+ tree to grow upward-the number of parent nodes is also exceeded, and the B+ tree becomes higher.

delete

After two adjacent pages have been deleted and the utilization rate is low, the data pages are merged. InnoDB uses the fill factor to control the delete change of the tree.

2.5 Why InnoDB Choose B+ Tree

B+Tree is a balanced multi-path search tree designed for random access to external memory.

Compared with a binary tree, each node can store more elements, and each IO can read more elements; the height of the tree is therefore lower, and fewer IOs are required.

Compared with B-Tree, B+Tree only saves data in leaf nodes, the internal structure is more compact, the order is larger, and therefore the number of IOs is also less; B+Tree query efficiency is more stable, and it also supports sequential retrieval.

However, for child nodes that are closer to the root node, B-Tree is faster because the child nodes contain both Key and Value (data addresses), so there is no need to access the leaf nodes.

3. Index Organized Table

The table data in InnoDB is a B+ tree constructed according to the primary key. This storage method is called an index-organized table. The leaf nodes of the primary key index tree store complete data records.

3.1 Clustered Index

Clustered index (clustered index) defines the storage structure of the data in the table, the primary key index in InnoDB is also called clustered index.

InnoDB requires the table to have a primary key (MyISAM may not). If not, InnoDB will automatically select a column (non-empty unique key) that can uniquely identify the record as the primary key; if it does not exist, it will generate a 6-byte hidden field rowid as the primary key.

Why InnoDB must have a primary key:

  1. The storage structure is an index-organized table
  2. The secondary index back to the table needs to use the primary key to query the record

3.2 Secondary index (auxiliary index)

Non-primary key indexes are also called secondary indexes, and their leaf nodes store the secondary index and its corresponding clustered index, that is, the primary key. InnoDB will put the primary key field behind the index definition field, and of course it will also de-duplicate it. If part of the primary key column (composite index) already contains the secondary index, the column will not be recorded repeatedly.

When using a secondary index to query, compared to using the primary key to query, an extra table - back operation is required-after the primary key is obtained, the search is performed in the clustered index tree.

Using a covering index can reduce the number of return tables. At the same time, InnoDB also introduces index pushdown and MRR to reduce the number of return tables and overhead. The former pushes down some query conditions (columns contained in indexes and primary keys) to the engine layer, and InnoDB is returning tables The former will be filtered first; the latter will sort the primary keys before returning to the table to reduce random IO (the returning of the table is still the primary key query row by row).

3.3 How much data can a B+ tree store

InnoDB uses B+ tree as the storage data structure, which is a balanced multi-path search tree optimized for external memory random IO, and it is an "N-ary" tree. The size of N depends on the size of the data page and index.

In InnoDB, the address pointer of the data page is 6 bytes, and the size of the data page is 16k when the compressed page is not turned on. Assuming that the primary key index uses the 8-byte bigint type, each index occupies approximately 14 bytes. In the non-leaf nodes of the B+ tree, ignoring the fixed overhead of data pages, each data page can store approximately 16,384➗14=1170 indexes, which is estimated to be 1024.

Assuming that the size of each record is 1k, then each data page of the leaf node can store 16 records.

A B+ tree with a height of 3 can store: 1024 1024 16=1677216 records with a size of 1024 1024 16KB=16GB.

Therefore, the height of the B+ tree in InnoDB is generally 1-3 layers, which can satisfy tens of millions of data storage. When searching for data, one search on the page represents one IO, so the primary key index query usually only requires 1-3 IO operations to find the data . Considering that the data page at the root of the tree is generally located in memory, the number of IOs will be less.

4. Hash Index

See the adaptive hash index AHI in InnoDB's overall architecture and key technologies.

5. Full text search

Full-Text Search is implemented using Inverted Index. The inverted index can quickly obtain the document matrix containing the word according to the word. Its common format is: key = word, value = document ID + word position.

InnoDB stores inverted indexes in auxiliary tables, and in order to improve parallelism, multiple auxiliary tables are used.

MySQL uses MATCH ('Col Name') AGAINST ('Search') for full-text search, and the result set is sorted in descending order of relevance. The relevance is calculated based on factors such as whether the word appears in the document, the number of times it appears, and how many other documents contain the word.

Guess you like

Origin blog.csdn.net/cooper20/article/details/108636464