[MySQL] Index and its B+ tree


Students who need cloud servers and other cloud products to learn Linux can move to / --> Tencent Cloud <-- / --> Alibaba Cloud <-- / --> Huawei Cloud <-- / official website, lightweight cloud servers are low-cost to 112 yuan/year, and new users can enjoy ultra-low discounts on their first order.


 Table of contents

1. Initial understanding of indexes and construction of test data

2. Disk

3. MySQL, OS, and disk interaction methods (InnoDB storage engine)

4. Understanding indexes and pages in MySQL

1. Why does MySQL use the page scheme when interacting with the disk for IO, instead of using the method of loading as much as you use?

2. How MySQL manages pages

2.1 Let’s first look at the data structure of the error page.

2.2 Correct data structure inside a single page

2.3 Correct data structure between multiple pages

2.4 Characteristics of B+ trees 

2.5MySQL primary key

2.6 Why is B+ tree better than other data structures for indexing?

2.7 Clustered index and non-clustered index

2.7.1MyISAM’s auxiliary (ordinary) index

2.7.2 Auxiliary (ordinary) index of innodb

5. Index operations

1. Create an index

1.1 Create primary key index

1.2 Create a unique index (a normal index)

1.3 Create ordinary index/composite index

1.4 Create full-text index

2. Query index

3. Delete index

4. What fields should be indexed? 


1. Initial understanding of indexes and construction of test data

Index: Improve the search performance of the database. The improvement in query speed comes at the expense of the speed of insertion, update, and deletion. These write operations increase a large amount of IO. So its value lies in improving the retrieval speed of a massive amount of data.

Common indexes are divided into:

primary key index

unique index

Ordinary index (index)

Full text index (fulltext)--solve the problem of neutron text indexing.

Construct a data of 8,000,000 records:

mysql> source /home/jly/index_data.sql;
Query OK, 0 rows affected (32 min 14.69 sec)
--看一下前5条数据
mysql> select* from EMP limit 5;
+--------+--------+----------+------+---------------------+---------+--------+--------+
| empno  | ename  | job      | mgr  | hiredate            | sal     | comm   | deptno |
+--------+--------+----------+------+---------------------+---------+--------+--------+
| 100002 | YPdZKD | SALESMAN | 0001 | 2023-06-24 00:00:00 | 2000.00 | 400.00 |    377 |
| 100003 | YJmqTw | SALESMAN | 0001 | 2023-06-24 00:00:00 | 2000.00 | 400.00 |    288 |
| 100004 | yIUxHR | SALESMAN | 0001 | 2023-06-24 00:00:00 | 2000.00 | 400.00 |    127 |
| 100005 | JIrHnr | SALESMAN | 0001 | 2023-06-24 00:00:00 | 2000.00 | 400.00 |    455 |
| 100006 | xFJFYc | SALESMAN | 0001 | 2023-06-24 00:00:00 | 2000.00 | 400.00 |    185 |
+--------+--------+----------+------+---------------------+---------+--------+--------+
5 rows in set (0.03 sec)

It takes 25 seconds to find employee information with employee number 998877:

mysql> select * from EMP where empno=998877;
+--------+--------+----------+------+---------------------+---------+--------+--------+
| empno  | ename  | job      | mgr  | hiredate            | sal     | comm   | deptno |
+--------+--------+----------+------+---------------------+---------+--------+--------+
| 998877 | HJxoaj | SALESMAN | 0001 | 2023-06-24 00:00:00 | 2000.00 | 400.00 |    463 |
+--------+--------+----------+------+---------------------+---------+--------+--------+
1 row in set (25.21 sec)

In an actual project, if it is placed on the public network and 1,000 people query concurrently at the same time, it is likely to crash. So create an index on the table:

--创建索引用时1分8秒
mysql> alter table EMP add index(empno);
Query OK, 0 rows affected (1 min 7.59 sec)

Searching again for employee information with employee number 998877 takes 0.04 seconds:

mysql> select * from EMP where empno=998877;
+--------+--------+----------+------+---------------------+---------+--------+--------+
| empno  | ename  | job      | mgr  | hiredate            | sal     | comm   | deptno |
+--------+--------+----------+------+---------------------+---------+--------+--------+
| 998877 | HJxoaj | SALESMAN | 0001 | 2023-06-24 00:00:00 | 2000.00 | 400.00 |    463 |
+--------+--------+----------+------+---------------------+---------+--------+--------+
1 row in set (0.04 sec)

2. Disk

MySQL provides storage services to users, and what is stored is data, which is stored in the peripheral device of the disk. The disk is a mechanical device in the computer. Compared with other electronic components of the computer, the disk efficiency is relatively low. Adding the characteristics of IO itself, we can know that how to improve efficiency is an important topic in MySQL.

For disk blogs, please click here: [Linux] Buffer/disk inode/dynamic and static library production

3. MySQL, OS, and disk interaction methods (InnoDB storage engine)

As an application software, MySQL can be imagined as a special file system. It has higher IO scenarios, so in order to improve basic IO efficiency, the basic unit of MySQL for IO is 16KB (explained later using the InnoDB storage engine)

mysql> show global status like 'innodb_page_size';
+------------------+-------+
| Variable_name    | Value |
+------------------+-------+
| Innodb_page_size | 16384 |
+------------------+-------+
1 row in set (0.06 sec)

1. When the MySQL server is running in memory, it applies for a large memory space called Buffer Pool inside the server to perform various caches. In fact, it is a large memory space for IO interaction with disk data. Buffer Pool flushes 1M data to the disk at a time and flushes 100M data to the disk at a time. The efficiency is definitely different. The bottom layer of MySQL also has its own 'buffer' refresh strategy to ensure IO efficiency.

2. Data files in MySQL are stored on the disk in units of pages.

3. MySQL's CURD operation requires calculation to find the corresponding insertion position, or to find the corresponding data to be modified or queried.

4. As long as calculations are involved, the CPU is required to participate. In order to facilitate the CPU's participation, the data must be moved to the memory first. Therefore, within a specific period of time, the data must be on the disk and in the memory. After subsequent operations on the memory data, it is refreshed to the disk using a specific refresh strategy. At this time, it involves data interaction between disk and memory, which is IO. At this time, the basic unit of IO is Page.

5. For higher efficiency, the number of system and disk IO must be reduced as much as possible.

4. Understanding indexes and pages in MySQL

The basic unit of the disk hardware device is 512 bytes (some are larger), and the basic unit of data interaction between the MySQL InnoDB engine and the disk is 16KB. Each of these basic data units is called page in MySQL (note that it is related to the system page distinction)

There must be a large number of pages in MySQL. In order to describe first and then organize, in addition to user data, there is also a part of the data structure inside a single page for MySQL to organize and manage a large number of pages.

1. Why does MySQL use the page scheme when interacting with the disk for IO, instead of using the method of loading as much as you use?

Preloading, based on the principle of locality, reduces the number of IOs and improves efficiency.

The main contradiction of low IO efficiency is not the size of a single IO data volume, but the number of IOs.

2. How MySQL manages pages

Create a test table and remember to add primary key constraints:

create table if not exists user (
id int primary key, --一定要添加主键,只有这样才会默认生成主键索引
age int not null,
name varchar(16) not null
);

mysql> desc user;
+-------+-------------+------+-----+---------+-------+
| Field | Type        | Null | Key | Default | Extra |
+-------+-------------+------+-----+---------+-------+
| id    | int(11)     | NO   | PRI | NULL    |       |
| age   | int(11)     | NO   |     | NULL    |       |
| name  | varchar(16) | NO   |     | NULL    |       |
+-------+-------------+------+-----+---------+-------+
3 rows in set (0.04 sec)

Insert multiple sets of data out of order, but the table has been stored in order according to the primary key:

mysql> insert into user (id, age, name) values(3, 18, '杨过');
Query OK, 1 row affected (0.01 sec)

mysql> insert into user (id, age, name) values(4, 16, '小龙女');
Query OK, 1 row affected (0.00 sec)

mysql> insert into user (id, age, name) values(2, 26, '黄蓉');
Query OK, 1 row affected (0.01 sec)

mysql> insert into user (id, age, name) values(5, 36, '郭靖');
Query OK, 1 row affected (0.00 sec)

mysql>  insert into user (id, age, name) values(1, 56, '欧阳锋');
Query OK, 1 row affected (0.00 sec)
--乱序插入,但是数据是有序的
mysql> select * from user; 
+----+-----+-----------+
| id | age | name 		 |
+----+-----+-----------+
| 1  | 56   | 欧阳锋	 |
| 2  | 26   | 黄蓉	   |
| 3  | 18   | 杨过		 |
| 4  | 16   | 小龙女	 |
| 5  | 36   | 郭靖 		 |
+----+-----+-----------+
5 rows in set (0.00 sec)

Why does MySQL help us pre-sort the data?

2.1 Let’s first look at the data structure of the error page.

MySQL loads a page each time. Each page uses a two-way linked list to associate adjacent pages. Linked lists are also used to connect internal pages. Although it can meet the requirements of CRUD, the amount of data is large, and it will be difficult to perform a carpet-like sequential search. Greatly reduce data search efficiency.

2.2 Correct data structure inside a single page

Compared with the structure in Section 2.1, this structure takes more space to store the directory. This directory is like reading a book or looking up a dictionary. We can quickly locate the approximate number of pages of target data through the directory and improve search efficiency. Therefore, although these directories in the MySQL page take up a little more space, they greatly improve the speed of our data search (space is exchanged for time).

This also explains that when we inserted data out of order at the beginning of this paragraph, MySQL automatically sorted it for us. This is because only when the data is in order, MySQL can easily introduce the page directory and improve subsequent search efficiency. (The page directory must be in order! If the page directory is disordered, you substitute it into the computer. If you are a computer, can you quickly locate the data in the face of the disordered page directory?) 

2.3 Correct data structure between multiple pages

We just created a page directory inside each page, which reduces the number of searches within the page and improves the search efficiency of a single page.

The above picture shows the connection relationship between multiple pages. It can be seen from the picture that the page directory also shows a sequential relationship among multiple pages. If you search for data across pages, you can only traverse each page sequentially from front to back. Page directory. If there are too many pages, this retrieval method will greatly reduce the speed of data search between pages. In order to solve this problem, we also use the directory method to manage the directory in each page:

The above figure also uses the page method to manage page directories. Each page does not contain valid data, but only contains the beginning and end page directories in the corresponding page. Therefore, a "first-level directory" can manage thousands of pages.

Then the question comes again. If there are many pages at the bottom, the number of first-level directories will inevitably increase. Then doesn’t our traversal of the first-level directories become a linear traversal again? Apply another layer:

Generally, two or three levels are enough. Those pages that are only used for indexing can be 16KB each. A single page can index thousands of pages of the next level! It is completely sufficient. With three layers, MySQL's query efficiency is not low even in the face of massive data. (If it is not enough, add another layer, and the amount of storage will increase exponentially.) During subsequent searches, the search is performed from top to bottom, and only the part of the page found is loaded. The entire B+ tree will not be loaded into the memory.

If you look carefully at the picture, you will find that this is a B+ tree. But be careful:

1. Not all storage engine indexes use B+ trees, as well as hash indexes and other methods. It can only be said that the mainstream storage engine uses B+ tree as the index data structure.

2. Only leaf nodes are cascaded using linked lists, because this is a characteristic of B+ trees; at the same time, cascading leaf nodes can satisfy range searches (sometimes the data is read across pages, and the leaf nodes have A pointer to the next page is very convenient)

2.4 Characteristics of B+ trees 

1. Non-leaf nodes do not store data and are only used for indexing. All data is stored in leaf nodes.

2. Data is only saved in leaf nodes, and pointers to the previous and next leaf nodes are saved. Leaf nodes are cascaded through linked list pointers, and the leaf nodes themselves are connected in ascending order of keywords.

2.5MySQL primary key

When we created the table before, we specified the primary key column, and MySQL would sort based on the primary key. If we do not specify the primary key when creating the table, MySQL will generate a hidden column as the primary key. This also shows that when we specify the primary key, MySQL will sort according to the primary key. If we do not specify the primary key, MySQL's sorting will be based on the primary key generated by default. So at this time, what is the order in which our data is inserted and the order in which it is taken out? how.

The first picture in this article finds employee information with employee number 998877, which takes 25 seconds. This is because the table in MySQL is a B+ tree constructed with an index based on the default primary key. If you search with an irrelevant employee number, MySQL can only traverse it linearly, which is of course slow.

Later, we rebuilt a B+ tree (auxiliary index) according to the employee number, and searched for employee information according to the number again, and found that the search was very fast:

2.6 Why is B+ tree better than other data structures for indexing?

1. Linear data structure

Linear data structures such as linked lists and sequential lists are traversed one by one. The above is due to the low efficiency of linear lists, and the structure is modified into a B+ tree again and again.

2. Binary search tree

Don’t forget that when learning binary search trees, the time complexity of this data structure is completely determined by the height of the search branch. The optimal time complexity is O(lgN), but once the binary search tree is a little skewed, it may even Degenerate into a linear structure, at this time the time complexity will be greatly improved.

3. Red-black tree and AVL tree

These two data structures are excellent. As shown in the figure below, searching for 80 million data only requires about 26 searches at worst. However, red-black trees and AVL trees are essentially binary trees. The height of the tree will be higher than the B+ tree for the same data. The higher the height of the tree, the smaller the amount of data eliminated in a single search, and the lower the efficiency. The search efficiency is slightly inferior to B+ tree.

4. Hash

In the official index implementation, MySQL's index supports HASH, but InnoDB and MyISAM do not support it. The search efficiency of hashing is O(1), but it does not support range search.

5. B-tree

The difference between B-tree and B+-tree is:

1. In addition to storing the page directory of the next layer, the non-leaf nodes of the B-tree also store data. This results in each non-leaf node storing fewer page directories of the next layer, which may increase the size of the entire tree. high.

2. The leaf nodes of the B-tree will not be connected in a chain structure. Range searches require retraversing the entire tree.

2.7 Clustered index and non-clustered index

MyISAM storage engine-primary key index

The MyISAM engine also uses B+ trees as index results. Unlike the innodb storage engine mentioned in the previous section, the data field of the MyISAM leaf node stores the address of the data record . The picture below shows the primary index of the MyISAM table, Col1 is the primary key.

Clustered index: Storing B+ trees and data together like the innodb storage engine is called a clustered index.

Non-clustered index: The way that the B+ tree and data are separated like the MyISAM storage engine is called a non-clustered index.

2.7.1MyISAM’s auxiliary (ordinary) index

In addition to creating a primary key index by default, MySQL users may also create indexes based on other column information. Generally, such indexes can be called auxiliary (ordinary) indexes.

For MyISAM, there is no difference between establishing a secondary (ordinary) index and a primary key index. It is just that the primary key cannot be repeated, but the non-primary key can be repeated.

The MyISAM storage engine can create multiple indexes in a table. The following figure is an index based on MyISAM's Col2, which is no different from the primary key index:

2.7.2 Auxiliary (ordinary) index of innodb

In addition to the primary key index, InnoDB users will also create auxiliary (ordinary) indexes. We create the corresponding auxiliary index for Col3 in the above table:

There is no data in the leaf nodes of InnoDB's non-primary key index, but only the key value of the corresponding record. Therefore, to find the target record through the auxiliary (ordinary) index, two index passes are required: first, retrieve the auxiliary index to obtain the primary key, and then use the primary key to retrieve the record in the primary index. This process is called table query .

Why does InnoDB not attach data to leaf nodes for this auxiliary (ordinary) index scenario? There is a primary key index on the data, so there is no need to save two copies, otherwise it will be a waste of space.

5. Index operations

1. Create an index

1.1 Create primary key index

Method 1: Specify the primary key when creating the table

-- 在创建表的时候,直接在字段名后指定 primary key
create table user1(id int primary key, name varchar(30));

Method 2: Same as method 1, but written in a different way 

-- 在创建表的最后,指定某列或某几列为主键索引
create table user2(id int, name varchar(30), primary key(id));

Method 3: Add the primary key after creating the table

create table user3(id int, name varchar(30));
-- 创建表以后再添加主键
alter table user3 add primary key(id);

1.2 Create a unique index (a normal index)

Method 1: Specify the unique key when creating the table

-- 在表定义时,在某列后直接指定unique唯一属性。
create table user4(id int primary key, name varchar(30) unique);

Method 2: Same as method 1, but written in a different way

-- 创建表时,在表的后面指定某列或某几列为unique
create table user5(id int primary key, name varchar(30), unique(name));

Method 3: Add a unique key after creating the table

create table user6(id int primary key, name varchar(30));
alter table user6 add unique(name);

1.3 Create ordinary index/composite index

Method 1: Specify the ordinary index when creating the table

create table user8(id int primary key,
name varchar(20),
email varchar(30),
index(name) --在表的定义最后,指定某列为索引
);

Method 2: After creating the table, specify a column as a normal index

create table user9(id int primary key, name varchar(20), email
varchar(30));
alter table user9 add index(name); --创建完表以后指定某列为普通索引

Method 3: Create a common index with a custom name after creating the table

create table user10(id int primary key, name varchar(20), email
varchar(30));
-- 创建一个索引名为 idx_name 的索引
create index idx_name on user10(name);

The created composite index is actually a B+ index. You will find that name is the same as the ordinary key of email. The function of the composite index is to specify multiple fields to build a B+ tree. If you need to find email through name frequently, you can Building a composite index avoids the need to query back the table. The method of finding another index through the index is called index coverage .

mysql> alter table test1 add index(name,email);
Query OK, 0 rows affected (0.17 sec)
Records: 0  Duplicates: 0  Warnings: 0

*************************** 2. row ***************************
        Table: test1
   Non_unique: 1
     Key_name: name--索引名称是一样的,这俩是同一颗B+树
 Seq_in_index: 1
  Column_name: name
    Collation: A
  Cardinality: 0
     Sub_part: NULL
       Packed: NULL
         Null: 
   Index_type: BTREE
      Comment: 
Index_comment: 
*************************** 3. row ***************************
        Table: test1
   Non_unique: 1
     Key_name: name--索引名称是一样的,这俩是同一颗B+树
 Seq_in_index: 2
  Column_name: emile
    Collation: A
  Cardinality: 0
     Sub_part: NULL
       Packed: NULL
         Null: 
   Index_type: BTREE
      Comment: 
Index_comment: 
3 rows in set (0.00 sec)

After the creation of the composite index is completed, you can use name to search or (name, email) to search, but you cannot use email to search. This is the leftmost matching principle of the index .

1.4 Create full-text index

Full-text indexing is used when searching article fields or fields with a large amount of text. MySQL provides a full-text index mechanism, but there are requirements that the storage engine of the table must be MyISAM, and the default full-text index only supports English, not Chinese. If you perform full-text search in Chinese, you can use the Chinese version of sphinx (coreseek)

--创建全文索引
CREATE TABLE articles (
id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
title VARCHAR(200),
body TEXT,
FULLTEXT (title,body)--创建全文索引
)engine=MyISAM;
--插入数据
INSERT INTO articles (title,body) VALUES
('MySQL Tutorial','DBMS stands for DataBase ...'),
('How To Use MySQL Well','After you went through a ...'),
('Optimizing MySQL','In this tutorial we will show ...'),
('1001 MySQL Tricks','1. Never run mysqld as root. 2. ...'),
('MySQL vs. YourSQL','In the following database comparison ...'),
('MySQL Security','When configured properly, MySQL ...');

Check whether there is database data:

--普通查询
select * from articles where body like '%database%';
--全文索引
SELECT * FROM articles WHERE MATCH (title,body) AGAINST ('database');

2. Query index

show keys from 表名;--方式一
show index from 表名;--方式二
desc 表名;----方式三,这种方式显示出来的信息比较简略

mysql> show index from test1\G;
*************************** 1. row ***************************
        Table: test1
   Non_unique: 0
     Key_name: PRIMARY--索引名称(B+树索引)
 Seq_in_index: 1
  Column_name: id--以那一列为索引构建的B+树
    Collation: A
  Cardinality: 0
     Sub_part: NULL
       Packed: NULL
         Null: 
   Index_type: BTREE--索引类型(B+树)
      Comment: 
Index_comment: 
1 row in set (0.00 sec)

3. Delete index

Delete primary key index:

--方式一:删除主键索引
alter table 表名 drop primary key;

Deletion of other indexes (such as deletion of unique indexes):

--方式二:索引名就是show keys from 表名中的 Key_name 字段
alter table 表名 drop index 索引名;
--方式三:
mysql> drop index name on user8;

4. What fields should be indexed? 

1. Fields with primary key and unique key constraints have their own indexes

2. A certain column is frequently used as a query condition

3. Columns with poor uniqueness are not suitable as indexes, even if this column is frequently queried

4. Frequently updated fields are not suitable as indexes

Guess you like

Origin blog.csdn.net/gfdxx/article/details/131404147