1. Understand the underlying structure and algorithm of [mysql] index

1. Index

1.The nature of index

Index: A data structure that helps MySQL efficiently obtain data and sort it .

As a simple example, the format for saving a string of data into a mysql table is as follows.
mysql数据保存到磁盘时,每条数据保存的位置并不一定是连续的磁盘地址进行保存的,即mysql表中相邻的两条数据,对应的磁盘地址不一定相邻。
Insert image description here

Execute a query statement

SELECT * FROM TABLE WHERE TABLE.column2 = 15 

    In the absence of an index, he starts searching within the range of all table data, starting from the first column 2, which is 20, and searching downwards one by one until he finds 15. Every time a record is queried, an I/O operation is performed with the disk to read the data stored in the disk. If the amount of data is large, the number of disk I/O required may be very high, and the performance will be poor. High and very inefficient.
    Based on this scenario, we can design a simple index to solve this problem. Save this data into a binary tree ( mysql并没有使用二叉树作为索引存储结构这里只是举例).
In a binary tree, each root node is greater than the left child node and less than the right child node.
    According to this rule, when searching for 15, first compare the root node 20, 15<20, then go to the left side of the root node and continue searching. Node 15, 15 is equal to 15. Found it after searching twice.
Insert image description here

2.mysql index structure

    So why doesn't mysql use binary trees?
    Let's take a look at the data in the first column of the table. Its value starts from 1 and increases continuously. According to the above logical node, the value of the left child node is less than the value of the right child node.
In this way, the data structure demonstration website
Insert image description here
    becomes a linked list. When searching for a certain value, it becomes a full table scan.
    Can it be improved into a red-black tree?
Insert image description here
    Although it will not become a linked list, in actual circumstances, the data saved in the mysql table cannot only be a few pieces of data, but may be tens of thousands, hundreds of thousands or even more data. In this case, a red-black tree is used The height can be very high. If there are 1 million data, assuming the height of the tree is 50 levels, if the data to be searched is at the leaf node of the tree, then the number of searches required is 50 times. Therefore, when the amount of data is extremely large, the effect of red-black tree solution is very limited. The higher the number, the more searches are required, that is, the more I/O times, the lower the efficiency.

    Can he control Kazuki's height so that he doesn't get so tall?
    Then it is necessary to understand the B-tree structure a little bit. Baidu Encyclopedia
= "After understanding the B tree, here is another example. "=

Insert image description here
A node designation is:
Insert image description here

In the picture above, 每个节点都是按照从左到右值依次递增的,所有的元素不会重复,叶节点具有相同深度,叶节点的指针为空。assume that the dark green part with numerical values ​​represents the value of the first column in the mysql table and interprets it as the id column. The data below stores all its corresponding data, that is, all the contents of a row of data corresponding to an id. Take 15 For example, the data part a under 15 saves the relevant information corresponding to the value 15. This data structure solves the tree height problem of red-black trees, but in fact mysql does not use this data structure. Instead, it was transformed into a B+ tree.

=》B+ tree example. "=

Insert image description here
B+树:非叶子节点不存data,只存索引(冗余数据),这样可以放更多的索引。叶子节点包含所有的索引字段。叶子节点用指针连接,提高区间访问的性能。

The value in each node of the B+ tree, as shown in the figure below, can be understood as a binary tree. That is, the value of the subtree of node 20 is less than 20, and the value of the right subtree is greater than or equal to 20.
Insert image description here
The adjacent nodes of the leaf nodes of the B+ tree are directly connected by pointers, and the values ​​​​increase from left to right, that is, they are sorted.
Insert image description here
A node in the B+ tree is actually a page in mysql. The default size of one page of data in mysql is 16KB . (This size can be changed, but it is not recommended. Why does mysql set 16? It must be obtained after theoretical testing and other processes, and it is consistent with most scenarios.) At this time, light green represents the
Insert image description here
next The address of the page data.
Insert image description here
Then when we want to find a value, such as 50, mysql will load one page of data into the memory each time, because one page of data is a node of the B+ tree, and the values ​​in the node are all sorted, so it can be searched by Search using an algorithm, such as searching through a binary search algorithm, and ultimately it is found or not found. When 50 is found, the data here may save all the row data corresponding to this record, or it may be an address (this actually depends on the type of index and the storage engine. This is the InnoDB engine, which will be discussed below).

At this time, we can roughly calculate how much data each node can save. One page of data is composed of specific data and the address of the next page of data.
Insert image description here
Assume that 15 is of int type. If the int type in mysql occupies 4 bytes, the next page address occupies about 6 bytes.
16KB * 1024 = 16384 bytes.
16384 / (4+6) ≈ 1638,
so to save int type data, one page can save approximately 1638 items. If the B+ tree has three layers, how much data can be saved?
The leaf node has one page of 16Kb data, which contains not only the int value, but also data data. It is estimated that each leaf node can hold 16 pieces of data. How much data can be stored in total?
1638 * 1638 * 16 = 42928704 (more than 40 million). More than 40 million trees are only three stories tall. The number of I/O operations is greatly reduced.
所以说如果查找数据时走索引查找数据效率是非常高的,而且根节点中的数据是常驻内存中的。

Why does mysql finally choose B+ tree among B and B+ trees?
If the B tree stores data of type int. Because the data corresponding to each int under each node of the B tree is saved. According to the size of a node of 16Kb, if a node can store 16 pieces of data, more than 40 million pieces of data can be stored, how tall can the B-tree be in the end?
16 * 16 * 16 * ... = 40 million.
As many times as 16 need to be multiplied, there will be a few layers of height, and the result is obvious.

The height of the B+ tree depends on how much index data (how many int values) can be placed in each non-leaf node. The more index data is placed, the lower the height will be. So why only leaf nodes store data. The rest is only index data.

当然mysql不止这一种数据结构的索引还有HASH类型的索引。


2. Storage engine

1.MyISAM

MyISAM index files and data files are separated ( 非聚集), please see the explanation below.

We know that the data in mysql is stored on the disk, so where exactly is it?
We have a database:
Insert image description here
there are two tables in the study_mysql library, where the myisam_table storage engine is MyISAM

CREATE TABLE `myisam_table` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `name` varchar(255) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8

If the mysql configuration is not modified, these databases and tables are saved to the data folder by default.
Insert image description here
The three files in the figure below correspond to the files related to the myisam_table table.
Insert image description here
in

  • .frm saves the structural data of the table.
  • .MYD saved table data.
  • .MYI saved table index data.

=》MyISAM data saving and query examples. "=
Insert image description here

The data is the content of the table below and is saved in the .MYD file.
Insert image description here

The index, that is, the data of the B+ tree is stored in the .MYI file.

Insert image description here
If you want to query the content of column 1 whose value is equal to 30, first search for 30 in the .MYI file, and then go to the .MYD file to find its corresponding data based on the memory address saved by the leaf node.

非聚集索引: Simply understood, the leaf node does not save all the data content corresponding to the index. Please understand it in conjunction with innodb below.

2.InnoDB

After understanding the data saving and reading of MyISAM, you should have some understanding of the concept of non-aggregation.
The table is set to innodb engine:
or study_mysql library

There are two tables in the library, among which the innodb_table storage engine is InnoDB

CREATE TABLE `innodb_table` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `name` varchar(255) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

Insert image description here
in

  • .frm saves the structural data of the table.
  • .ibd saves indexes and data.

Insert image description here
The table data file itself is an index structure file organized according to a B+ tree.
Clustered index (clustered index, primary key index): The leaf node itself contains complete data records.
As mentioned before, leaf nodes may not store all the data, but may be the disk address corresponding to the data. In fact, innodb non-primary key index, the leaf node saves the primary key content. If the leaf node saves all the data, it is a clustered index. On the other hand, leaf nodes that only contain primary key data are called non-clustered indexes. MyISAM's primary key index and non-primary key index are the same.

3. Why is it recommended to establish a primary key for InnoDB tables and recommend incrementing the int type?

    If the primary key is specified when creating the table, then mysql can maintain the structure of the B+ tree according to your primary key when constructing the B+ tree. If you do not specify a primary key, then mysql will find a column from all the columns in your table. The values ​​in this column are not repeated. If it is not found, it will create a hidden column by itself. This hidden column will be the same as yours. Correspond to the data in the table, use hidden columns to construct this B+ tree.

Then you may have a few questions:
    1. Why do I have to create the primary key myself when mysql can do it?
    In fact, if you understand that you are going to do something, and you are ready to communicate and start doing it in advance, you may only need to wait 2 hours. You were too lazy to communicate about the same thing, so it took an extra hour to process what you expected, and it took a total of 3 hours. Nowadays, both companies and individuals pay attention to efficiency, so why do you have to ask the program to help you with things that cannot be solved by a primary key? Of course it is for this scenario. In other cases, we may need programs to improve our efficiency, so don’t over-understand.
    2. Why is the int type recommended?
    If you think about the search algorithms you have learned, are most of the scenarios involving comparisons between numbers? If it is a comparison between strings, you also need to calculate the ASCII code value of each string, and finally compare the overall value. This is a bit similar to what was mentioned above. It can be completed in 2 hours, but you have to Let dry for 3 hours. Although the performance impact will not be very large, otherwise mysql probably would not let you set it to a string type. Of course, int is still recommended here. If there is a scene that requires the string type, then you are in control and the decision is yours.
3. Why do we need to increase it ourselves?
The demo address
is still the same website. You can feel it for yourself. Inserting continuously increasing data and inserting large and small data are not incremental data. Take a look at this process.
Insert image description here
In this process, you can see that if it is auto-increment, it only needs to be inserted at the last node. When it cannot fit, add a node at the back and maintain the value of the previous node. If it is not an auto-increment, if you insert a 3.5 between 3 and 4, then it will not only be appended later, but it may also need to be split and balanced. Efficiency is not as high as self-increasing.

4. The difference between innodb’s primary key index and non-primary key index (secondary index)

Each innodb table has only one primary key index, which is the index where all data content is stored in the leaf nodes.
For non-primary key indexes, leaf nodes only store the corresponding primary key content. An index based on a guided field, such as an index based on a name column.
When querying by name, go through the secondary index and finally find the primary key corresponding to name, and then go to the primary key index to find all the data corresponding to this row. That is what we often call table return, which is to find the id through the secondary index, and then query the corresponding content based on the id. (The example here is a single-value index, which means creating an index for one field. It is not recommended to create many single-value indexes. This is just an example.)
Insert image description here

Why does the secondary index only store primary key information? Why not put all the data in leaf nodes?

  1. Save space. If your table has a lot of data and multiple secondary indexes, so much duplicate data takes up a lot of space.
  2. To maintain consistency, when you insert a piece of data, all indexes need to maintain the data. If only the primary key index is maintained, some maintenance costs are reduced.

5. Union index

1. The joint primary key structure
can be understood as having an employee table with the columns name, age, job, and join_time, where name, age, and job are used as joint primary keys.
Insert image description here
Then the leaf node is the stored join_tiem data.

2. If the joint index structure (non-joint primary key)
is a secondary index created based on the columns name, age, and job, the structure is as follows. The value of the leaf node is the primary key information.
Insert image description here
There is a leftmost prefixed content here.
That is, when you create the index

KEY `idx_name_age_job` (`name`,`age`,`job`)

According to the field you specify, the priority starts from the first column to the left of the column you specify, sorted by name first, if the values ​​are the same, sorted by the second column age, and so on. That’s what’s inside the picture above.

=》Example of leftmost prefix of index》=

1. 
select * from table where name = 'Tom' and age = 19
2.
select * from table where age = 19
3.
select * from table where age = 19 and job = 'dev'
4.
select * from table where job = 'dev'

Those of the above 4 statements will go to the index, and those will not go to the index. (The index here is the joint index in the above figure)
Result: Only the first item will go to the index.
Why is only the first one indexed?
Because your index creation rules are created according to the priority of name, age, and job.
First, I can find a range based on name, and the data in this range is Tom's.
Secondly, in this range, age is also ordered, so in this range, age is further followed up for screening to further narrow the range.
Finally, the narrowed range is all the data we want.

Other situations:
Take 2.select * from table where age = 19 as an example.
You can check age=19 directly.
At this time, age is stored in the index in an unordered manner. Why do you say that? You clearly said that they are sorted according to priority, and the priority of age ranks second. This is correct, so is his age in order relative to the name field, but think about it based on the whole table, is it still in order?
Insert image description here
结果显而易见,基于全表是无序的,基于name是有序的。所以只能去表中一个一个找,不会走这个索引的。其他的sql也是这个道理。你没办法一步一步的缩小范围,或者你一开始就没办法限定一个范围。那么你就只能全表扫描,那就不走索引了。


Guess you like

Origin blog.csdn.net/xiaobai_july/article/details/132611619