MySQL performance optimization - underlying data structure and algorithm

an index

Index
MySQL official website explanation: Index is a sorted data structure that helps MySQL efficiently obtain data

Index data structure:

  • binary tree
  • red black tree
  • Hash table
  • B-Tree

Case: There is a table with two columns and seven rows.
insert image description here
If our query sql statement is:

select * from t where t.col2=89;

Under normal circumstances, the value of col2 needs to be taken out line by line, and then compared with 89 until it is found;

For the data in a MySQL table, multiple rows of data are not necessarily stored side by side on the disk (because if a row of data is saved, the second row of data is saved a few days later, there may be other data in the middle period Stored in the disk), it is randomly stored; execute a query sql statement (if there are a lot of data in the table), every time a piece of data is fetched from the disk, it needs to do an I/O read interaction with the disk, and compare the data after fetching the data To see if it is the data we need, the performance is very low; our purpose is to reduce the number of interactions with the disk when looking for the data we need (reduce the number of searches), as long as this number is controlled within a certain range, the efficiency will be A lot of improvement; at this point, the index was born;

As follows, index col2, as mentioned earlier, the index is a data structure, such as a binary tree

Binary tree
Then we put the col2 column data in the binary tree (the left child node is smaller than the parent node, and the right child node is greater than the parent node), as follows: if you
insert image description here
search for 89, you can find it by searching only twice; the first time you get 34. It is found that the data we are looking for is not the data we are looking for, and the data we are looking for is greater than 34. It should be searched in the right child node of 34, and 89 can be found for the second time;

In the above tree, each node stores key/value, where the key stores the value corresponding to the col2 field (34, 77...89, 23), and the value stores the disk file address of the row where the index is located;

In fact, the bottom layer of the MySQL index is not a binary tree, the reasons are as follows

If our query col1 is like this:

select * from t where t.col1=6;

If it is a binary tree, then the binary tree corresponding to col1 is like this:

insert image description here

At this time, the binary tree is equivalent to a linked list, and the number of times to search for col1=6 is still 6 times, which does not improve the query efficiency; that is,
if the index uses a binary tree, the data in this column is incremental data, and the binary tree will not rise. It works, so the bottom layer of the index is not made with a binary tree;

Red-black tree
Red-black tree is also called binary balanced tree, which has the function of automatically balancing the tree. The red-black tree corresponding to col1 is as follows: At
insert image description here
this time, the number of times to search for col1=6 is 3 times;

The bottom layer of the MySQL index is not a red-black tree. The reasons are as follows:
The height of the tree is limited: when the amount of data in the table is too large, such as 500w, the height of the tree is very high. For example, the height of the tree reaches 20, and you need to check The data is located at the bottom leaf node, at least 20 searches are required, and 20 disk IOs are required; so what we need to do is to reduce the height of the tree, such as height <= 4, or height <= 3, etc., we are Acceptable - B tree;

B-tree

  • The leaf nodes have the same depth, and the leaf node's pointer is null
  • All index elements are not repeated
  • The data index in the node is arranged in ascending order from left to right

The previous red-black tree has only one root node, and the B-tree has multiple root nodes (horizontal expansion).
insert image description here
The bottom layer of the MySQL index does not use a pure B-tree, but optimizes the B-tree, that is, the B+ tree

B+Tree (BTree variant)

  • Non-leaf nodes do not store data, but only store indexes (redundancy), and more indexes can be placed (that is, leaf nodes contain all index elements in the table)
  • Non-leaf nodes are called redundant indexes. After obtaining some data from leaf nodes, non-leaf nodes construct B+ trees (that is, non-leaf nodes are for building B+ trees)
  • Leaf nodes contain all index fields
  • Leaf nodes are connected by pointers (B-trees have no pointers), which store the location of the current node on the disk and improve the performance of interval access

Each line below we call a page

insert image description here
The bottom layer of the MySQL index uses a B+ tree;
if we search for col1=30, we will first load all the pages of the root node (15, 56, 77) into the memory (RAM) (relatively time-consuming), and then load the Compare 30 with these data in the memory (relatively time-consuming), if you use binary search to quickly locate 30, it is between 15 and 56; then the data on page 15 (15, 20, 49) is also Load it into the memory and compare it with 30...;

Then why not remove other nodes, leave only the leaf nodes, put all the data in the leaf nodes, and then load the leaf nodes into the memory at one time, and directly search 30 and the data in the memory in half? If the amount of data is so large, the memory is easy to burst;

The size of each page is about 16K

#查看mysql页大小:16384字节——16KB
SHOW GLOBAL STATUS  LIKE 'Innodb_page_size'

How much data can be stored after the B+ tree is full?
Why 16KB?
If the bigInt type (8bit) is used, each index occupies 8bit, and the address between 15 and 16 in the above figure is the address of the next row (page) (the address of 15, 20, 49), and this address occupies 6bit; After the 16KB of page data is full, the number of index elements that can be placed: 16kb/(8+6)b=1170; the leaf node is special, take the leaf node 15 as an example, the index 15 may be stored in the data The address of the disk space where it is located may also store all the other columns of the row. The data data may be relatively large. If it is a row of data, it will be 1kb if it is full (a row of records will generally not exceed 1kb), then this leaf node The approximate amount of data that can be stored is: 1kb/(8+6)b=16 (since it is generally less than 1kb, the value of 16 obtained is a hypothetical value, not actually calculated here);

To sum up, when the B+ tree is full, the amount of index data that can be stored is:
1170X1170X16=21,902,400, that is, more than 20 million; while the height of the tree is only 3, that is, the data can be found after 3 IOs;

The root node of MySQL is actually directly in the memory (the root node is resident in the memory, that is, 15, 56, and 77 in the above figure are already in the memory at the very beginning), that is to say, it is not actually 3 IOs, but It is 2 times; after the higher version of MySQL, all non-leaf nodes are put into the memory, which is faster;

Why does the bottom layer of MySQL index use B+ tree instead of B tree?
As mentioned above, if the B+ tree stores 20 million numbers, the height of the tree is only 3; what if it is a B tree?
The B tree is as follows:
insert image description here

The maximum size of each piece of data is 1kb, and each page is 16kb, so each page (row) of data can only hold 16 index elements, that is, the nth power of 16 must reach 20 million, and this n is the height of the tree; obviously, n This height is much greater than the height 3 of the B+ tree;

Tables and indexes are stored on disk. If the configuration is not changed, the default location is:
insert image description here
insert image description here

Two MySQL table storage engine

2.1 Storage Engine Introduction

Does the storage engine use the database or the database table? is a database table.

When we use MySQL's Navcat to build a table, we can choose a storage engine, as follows: the
insert image description here
generally selected storage engine is InnorDB, and the earlier version uses the MyISAM storage engine
insert image description here

2.2 MyISAM storage engine (no longer used)

Create a new table and use MyISAM as the storage engine, as follows
insert image description here

  • .frm: Store the information of the data table structure (frame frame for short)
  • .MYD: store data (MY is the first letter of MyISAM, D is DATA)
  • .MYI: store the index (MY is the first letter of MyISAM, I is the index index)

MyISAM index files and data files are separate (non-clustered)

insert image description here
If the check conditions are as follows

select * from t where t.col1=30;

MySQL will first locate the index element in the MYI file index tree 0xF3, and then 0xF3find a row of data in the disk in the MYD file according to the address of the disk file;

2.3 InnoDB storage engine

Create a new table and use InnorDB as the storage engine, as follows

  • .frm: Store the information of the data table structure (frame frame for short)
  • .ibd: store data and index()

InnoDB index implementation (aggregation)

  • The table data file itself is an index structure file organized by B+Tree
  • Clustered index - leaf nodes contain complete data records

insert image description here
As can be seen from the figure above, the leaf node stores the data of other columns of the current row, for example, the node 15 stores the data of all other columns of the row 15, 34, Bob, etc. (clustered index);

That is, InnoDB data and indexes are in the same tree (same file, clustered index), while MyISAM is not in the same tree (non-clustered index);

Which is faster, a clustered index or a non-clustered index?
Aggregation is fast, because the clustered index does not need to search across files;

Why is it recommended to create a primary key for InnoDB tables, and it is recommended to use integer auto-increment primary keys?

The ibd file must be organized with a B+ tree, so where does this B+ tree come from? If the table has its own primary key, then directly use the column data of this primary key to construct the data of the entire table of the B+ tree. What about without a primary key? If there is no primary key, it will start from the first column to select a column with no repeated data as the primary key, and use this column of data to organize a B+ tree; if no eligible column is selected (no column of data is not equal)? Then MySQL will create a new hidden column, which will maintain a unique id to organize the data of the entire table;

To sum up: After we have built the primary key, we don’t need to be so troublesome, and we don’t need MySQL to do so much extra work;

Then why is it recommended that the primary key should be plastic and auto-incremented?

Plastic reasons

  • When looking for an index, the size comparison operation is performed in the B+ tree, and the uuid is a string, and the size comparison needs to be compared through the sequence of ASSIC codes, and compared character by character, so the shaping efficiency is high;
  • And the space occupied by shaping is relatively small;

Reasons for self-increment
Let’s first understand the Hash structure

When building an index, the default is B+Tree, and you can also choose the Hash structure
insert image description here
Hash structure

  • Perform a hash calculation on the key of the index to locate the location of the data storage
  • In many cases, Hash index is more efficient than B+ tree index
  • Can only satisfy "=", "IN", does not support range query
  • hash conflict problem

The table is as follows
insert image description here

If col3 is used as a hash index, when a piece of data is inserted, a hash algorithm (md5 and many other algorithms) will be performed on the data, and the obtained hash value will be put into the hash bucket (hash array). If the obtained hash value is the same If there is a hash conflict, a linked list is generated to store data with the same hash value; for example, if we want to find a row of data whose name=Alice, we first perform a hash operation on Alice, get the hash value, and then traverse the corresponding linked list; in the linked list In addition to storing the index element, each node of .com also stores the disk file address of the row where the index is located;
insert image description here
it seems that this kind of hash search is faster; then why not use the Hash structure, but use the B+ tree? The main reason is that hash does not support =, in, and range queries; the B+ tree has a bidirectional pointer at the leaf node, and the B+ tree is sorted, so it supports range queries;

Non-auto-increment: When new data is added, the node will split, and then the tree will be balanced;
Auto-increment: When new data is added, the node will not be split, and a new node will be created;

Why does the leaf node of the non-primary key index structure store the primary key value? (Consistency and saving storage space)
As follows, after indexing col3, the leaf node Alice stores the primary key value 18
insert image description here
The secondary index first finds the primary key index, and then finds the specific data through the primary key index (the secondary index has a return table operation );

Triple joint index (composite index)

It is not recommended to create multiple single-valued indexes for a table; generally, by creating 2~3 joint indexes, more than 80% of the query SQL statements are covered;

Create a joint primary key index for three fields : name, age, position
insert image description here

It will be sorted according to the order in which the index was established. First compare the name, then compare the age, and then compare the position to determine the order. After sorting, it will be placed in the index tree; if the name is a
string type, then compare each character according to Assic , when the order can be sorted by name, age and position are not considered; if the name is the same (both are called Bill), then compare age, if the age is the same, compare position, because it is a joint primary key, so here these three Fields cannot be equal at the same time;

index leftmost principle

Under the premise of establishing a joint index above, which statement below will use the index?

# 走索引
1 SELECT * FROM employees WHERE name = 'Bill' and age = 31;
# 不走索引
2 SELECT * FROM employees WHERE age = 30 AND position = 'dev';
# 不走索引
3 SELECT * FROM employees WHERE position = 'manager';

For the joint index, it must be used in the order in which the index was created; then why should there be the principle of the leftmost index, why should the index be used in the query order of name, age, and position?

The data inserted into the index tree is sorted, and the sorting rules are based on the order of name, age, and position when the index is built;

If we do not conform to the leftmost principle, directly check age=30. In the entire table, age is not sorted, so the index does not play a role, and the entire table needs to be scanned;
insert image description here

Guess you like

Origin blog.csdn.net/qq_33417321/article/details/121130741