(1) MySql index optimization data structure and performance optimization

Index is a sorted data structure that helps Mysql obtain data efficiently

What is the index?


 1. 官方介绍索引是帮助Mysql高效获取数据的数据结构,更通俗的说,数据库索引好比是一本书前面的目录能加快数据库的查询速度
 2. 一般来说索引本身也很大,不可能全部存储在内存中,因此索引往往是存储在磁盘上的文件中的(可能存储在单独的索引文件中,也可能和数据一起存储在数据文件中)
 3. 我们通常所说的索引,包括聚集索引,覆盖索引,组合索引,前缀索引,唯一索引等没有特别说明,默认都是使用B+树结构组织(多路搜索树,并不一定是二叉的)索引
 
 索引的优势和劣势
 优势:
 
 1. 可以提高数据检索的效率,降低数据库的IO成本类似于书的目录
 2. 通过索引列对数据进行排序,降低数据排序的成本,降低了CPU的消耗
 	被索引的列会自动进行排序,包括【单列索引】和【组合索引】只是组合索引的排序要复杂一些
 	如果按照索引列的顺序进行排序,对应order by 语句来说,效率就会提高很多
 劣势:
 1. 索引会占用磁盘空间
 2.索引虽然会提高查询效率,但是会降低更新表的效率。比如每次对表进行增删改操作Mysql不仅要保存数据,还有保存
 或者更新对应的索引文件

1. Data structure and index implementation

- 二叉树(右边的元素大于去父元素,左边的元素小于其父元素)
- 红黑树  (是一种自平衡二叉查找树)
- Hash表 (Hash表再等值查询时,效率很高,时间复杂度为O(1) 但是不支持范围快速查找,范围查找时还是只能通过扫描全表来实现)
- B-Tree (叶节点具有相同的深度,叶节点的指针为空 所有索引元素不重复 节点中的数据索引从左到右递增排列)
- B+Trees (多叉平衡树  非椰子节点不存储Data,只存储索引(冗余),可以放更多的索引,叶子节点包含所有索引字段 叶子节点用指针连接,提高区间访问的性能)

SHOW GLOBAL STATUS LIKE 'Innodb_page_size';
B+Trees with a tree height of 3 can store approximately 21,902,400 indexes

  • 1. Clustered Index-[Clustered Index]: Index and data are put together (leaf nodes contain complete data)
  • 2. Non-clustered index-[Sparse index]: Index and data are not stored together (such as MyISAM myz/myd index and data separation)
    • Clustered index queries are faster than non-clustered index queries
      Insert picture description here
    • B+Tree can store more data than B-Tree
    • Mind using the primary key to reshape auto-increment
      Insert picture description here

1.1 MyISAM index

1.2 InnoDB index

Create InnoDB data table

CREATE TABLE `abc_innodb`
(
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `a`  int(11)     DEFAULT NULL,
  `b`  int(11)     DEFAULT NULL,
  `c`  varchar(10) DEFAULT NULL,
  `d`  varchar(10) DEFAULT NULL,
  PRIMARY KEY (`id`) USING BTREE,
  KEY `idx_abc` (`a`, `b`, `c`)
) ENGINE = InnoDB;

INSERT INTO abc_innodb(a,b,c) VALUES (1,4,2);
INSERT INTO abc_innodb(a,b,c) VALUES (8,5,6);
INSERT INTO abc_innodb(a,b,c) VALUES (6,7,5);
INSERT INTO abc_innodb(a,b,c) VALUES (4,7,3);
INSERT INTO abc_innodb(a,b,c) VALUES (4,7,2);
INSERT INTO abc_innodb(a,b,c) VALUES (9,2,1);
INSERT INTO abc_innodb(a,b,c) VALUES (19,6,2);
INSERT INTO abc_innodb(a,b,c) VALUES (10,1,9);

1.3 The leftmost matching principle

最左前缀匹配原则和联合索引的索引存储结构和检索方式是有关系的。

在组合索引树中,最底层的叶子节点按照第一列a列从左到右递增排列,但是b列和c列是无序的,b列只有在a列值相等的情况下小范围内递增有序,而c列只能在a,b两列相等的情况下小范围内递增有序。

就像上面的查询,B+树会先比较a列来确定下一步应该搜索的方向,往左还是往右。如果a列相同再比较b列。但是如果查询条件没有a列,B+树就不知道第一步应该从哪个节点查起。

可以说创建的idx_abc(a,b,c)索引,相当于创建了(a)、(a,b)(a,b,c)三个索引。、

组合索引的最左前缀匹配原则:使用组合索引查询时,mysql会一直向右匹配直至遇到范围查询(>、<、between、like)就停止匹配

2. Index type

2.1 Primary key index

Each InnodDB table has a clustered index. The clustered index is constructed using B+Tree, and the data stored in the leaf nodes is an entire row of records. In general, a clustered index is equivalent to a primary key index. When a table does not create a primary key index, InnoDB will automatically create a RowID field to build a clustered index. The specific rules for automatic index creation are as follows:


 1. 在表上定义主键 PRIMARY KEY,InnoDB将主键索引引用聚簇索引
 2.如果没有定义主键,InooDB会选择第一个不为NULL的唯一索引列用作聚簇索引 
 3.如果以上两个都没有,InnoDB会使用一个6 byte长整型的随机字段ROWID字段构建聚簇
 索引。该ROWID字段会在插入新行时自动递增

All indexes except clustered indexes are called secondary indexes. In InnoDB, the data stored in the leaf node of the auxiliary index is the primary key value of the row. During retrieval, InnoDB uses this primary key value to search for rows in the clustered index.

The leaf nodes of the primary key index will store data rows, and the secondary index will only store the primary key value.

2.2 Ordinary Index

The basic index type in MySql, there are no restrictions, and it is allowed to insert duplicate values ​​and null values ​​in the defined index column

2.3 Unique index

The value in the index column must be unique but null values ​​are allowed

2.4 Full-text index

You can only create full-text indexes on text type CHAR, VARCHAR, TEXT type fields. When the shortest length is relatively large, if you create a normal index, the efficiency of like fuzzy query is relatively low, then you can create a full-text index in MyISAM and InnoDB. Use full-text index

2.5 Spatial index

Mysql supports spatial indexing after 5.7, and supports the OpenGIS geometric data model. MySql follows the OpenGIS geometric data model rules in terms of spatial indexing

2.6 prefix index

When creating an index on text types such as CHAR, VARCHAR, and TEXT, you can specify the length of the index column, but the numeric type cannot be determined

2.7 Single column index

Index created by a single field (a table can have up to 16 indexes, and the maximum byte length is 256)

2.8 Combined index (joint index)

An index composed of multiple fields is called a composite index (the use of a composite index needs to follow the principle of leftmost prefix matching. Generally, a composite index is used to replace multiple single-column indexes when conditions permit)

Joint index, when creating an index, try to judge whether a joint index can be used on multiple single-column indexes. The use of joint index not only saves space, but also makes it easier to use index coverage.

Just imagine, the more fields indexed, is it easier to meet the data returned by the query? For example, the joint index (a_b_c) is equivalent to having three indexes: a, a_b, and a_b_c. Does this save space? Of course, the space saved is not three times the three indexes (a, a_b, a_b_c), Because the data in the index tree has not changed, but the data in the index data field is indeed saved.

The principle of joint index creation , when creating a joint index, you should put frequently used columns and highly distinguished columns in front. Frequent use means high index utilization, and high distinguishing means large filtering granularity. These are all created in the index The optimization scenarios that need to be considered can also be added to the joint index on the fields that often need to be returned as queries. If you add a field to the joint index and use the covering index, then mind using the joint index

The use of joint index

  1. Consider whether there are multiple single-column indexes that can be merged. If so, create multiple single-column indexes as a joint index
  2. Currently, there are all columns that are frequently used as return fields. At this time, you can consider whether the current column can also be added to the existing index, so that the query statement can use the covering index

Insert picture description here

select * from abc_innodb where a = 13 and b = 16 and c = 4;

Insert picture description here

CREATE TABLE `abc_innodb`
(
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `a`  int(11)     DEFAULT NULL,
  `b`  int(11)     DEFAULT NULL,
  `c`  varchar(10) DEFAULT NULL,
  `d`  varchar(10) DEFAULT NULL,
  PRIMARY KEY (`id`) USING BTREE,
  KEY `idx_abc` (`a`, `b`, `c`)
) ENGINE = InnoDB;

2.9 Covering index (not index structure)

Covering index is not to say that it is an index structure, covering index is a very common optimization method. Because when using the auxiliary index, we can only get the primary key value, which is equivalent to obtaining the data and you need to query the primary key index based on the primary key and then get the data. But imagine the next situation. When querying the composite index in the above abc_innodb table, if I only need the abc field, does that mean that we can directly return the leaf node of the composite index without needing to return. table. This situation is the covering index.
Insert picture description here

3. Index data structure

3.1 hash table

Hash table, HashMap in Java, TreeMap is the structure of Hash table, which stores data in the form of key-value pairs. We use Hash table to store table data. Key can store index columns, and Value can store row records or row disk addresses. Hash table equivalent query is very efficient, and the time complexity is O(1); but it does not support range fast search, and the range search is only by scanning the entire table
(obviously this is not suitable for frequent search and range The database index used for search.)

3.2 Binary tree search

The following figure is an example of a
Insert picture description here
binary tree. The characteristics of a binary tree: each node has at most 2 branches, and the data order of the left subtree and the right subtree is small from the left to the right.
This feature is to ensure that each search can be halved and reduce the number of IOs, but the binary tree is a test of the value of the first root node because it is easy to have a situation that we don't want to happen under this feature. "The tree is not forked." It's uncomfortable

As shown below

Insert picture description here

3.3 Balanced Binary Tree

The balanced binary tree adopts the binary method of thinking. In addition to the characteristics of the binary tree, the balanced binary tree search tree has the main feature that the levels of the left and right subtrees of the tree are at most different. 1. When inserting and deleting data, the balance of the binary tree is maintained through left-handed/right-handed operations. , There will be no situation where the left subtree is very high and the right subtree is very short

The performance of using the balanced binary tree query is close to the binary search method, the time complexity is O(log2n) query id=6, only two IO operations are required
Insert picture description here
. Problems with the balanced binary tree:

  1. Time complexity is related to tree height. It needs to be retrieved as many times as the tree is, and each node's read corresponds to a disk IO operation. The height of the uncle is equal to the number of disk IO operations each time the data is queried. It takes 10ms for each disk to seek the book sword. When the amount of table data is large, the query performance will be poor. (1 million data volume, log2n is approximately equal to 20 disk IO time 20*10=0.2ms)
  2. Balanced binary tree does not support range query and quick search. Range query needs to traverse multiple times from the root node, and query efficiency is not high

3.4 B tree (reconstructing binary tree)

mysql的数据是存储在磁盘文件中的,查询处理数据时,需要先把磁盘中的数据加载到内存中,磁盘IO操作非常耗时
,所以我们优化的重点就是减少磁盘的IO操作。访问二叉树的每个节点就会发生一次IO如果想要减少磁盘IO操作就要
降低树的高度。

假如key为bigint=8byte 每个节点有两个指针,每个指针为4个byte 一个节点占用的空间16个byte(8+4*2=16)

因为每次在Mysqk的InnoDB存储引擎一次IO会读取的一页默认(16kb)的数据量,而二叉树一次IO有效数据量只有16byte
,空间利用率极低为了最大化利用一次IO空间一个简单的想法是在每个节点存储多个元素 在每个节点尽可能多的存储数
据。每个节点可以存储1000个索引(16k/16=1000),这样就将二叉树改造成了多叉树,通过增加树的叉树,将树
从高变成了矮胖。构建一百万条数据,树的高度只需要2层就可以了(1000*1000=100万) 也就说只需要进行两次IO
操作就能拿到数据 磁盘IO次数减少查询数据的效率也就提高了

B树是一种多叉平衡查找树

 1. B树的节点中存储着多个元素,每个内节点有多个分叉
 2. 节点中的元素包含键值和数据,节点中的键值从大到小进行排列也就是说在所有节点都能存储数据
 3. 父节点当中的元素不会出现在子节点
 4. 所有叶子节点都位于同一层,叶子节点具有相同的的深度,叶节点之间没有指针连接

Insert picture description here
For example, query data in the b-tree:

Suppose we query data with a value equal to 10. Query path disk block 1->disk block 2->disk block 5.

First disk IO: Load disk block 1 into the memory, traverse the comparison from the beginning in the memory, 10<15, go left, to the disk addressing disk block 2.

Second disk IO: Load disk block 2 into the memory, traverse the comparison from the beginning in the memory, 7<10, and locate disk block 5 in the disk.

The third disk IO: Load disk block 5 into the memory, traverse and compare in the memory from the beginning, 10=10, find 10, fetch data, if the row record stored by data, fetch data, the query ends. If the disk address is stored, the data needs to be retrieved from the disk according to the disk address, and the query is terminated.

Compared with the binary balanced search tree, in the entire search process, although the number of data comparisons is not significantly reduced, the number of disk IOs will be greatly reduced. At the same time, since our comparison is performed in memory, the time-consuming comparison is negligible. The height of the B-tree is generally 2 to 3 layers to meet most application scenarios, so using the B-tree to build an index can improve the efficiency of the query.

The process is as follows:

Insert picture description here
The problem is that the B tree can still be transformed


 1. B树同样不支持范围查询的快速查找 就意味着当你进行范围查找时还是会从根节点去挨个遍历 可想而知其
 查询效率还是很低
 2. 如果data存储的是行记录,行的大小随着列数的增多,所占用的空间就会变大,这时一个页中可存储的数据量
 就会变少,树相应就会变高,磁盘IO次数就会变多

3.5 B+ tree (B+Tree, transformation of B tree)

B+Tree is an upgraded version of B-tree. On the basis of B-tree, Mysql continues to transform it and uses B+Tree to build indexes. The main difference between B+TreeB tree is whether non-leaf nodes store data


 1. B树:非叶子节点和叶子节点都会存储数据
 2. B+Tree :只有叶子节点会存储数据,非叶子节点只存储键值。叶子节点之间使用双向指针来连接,最底层的
 叶子节点形成了一个双向有序的列表

Insert picture description here
The bottom leaf node of B+Tree contains all index items. It can be seen from the figure that when B+Tree searches for data, because the data is stored on the lowest leaf node, each time the search needs to retrieve the leaf node to query the data, so we need to query the data. Each disk IO is directly related to the height of the tree, but on the other hand, because the data is placed in the child node, the number of indexes stored by the disk block lock for the index will increase relative to the B-tree , The tree height of B+Tree is theoretically shorter than that of B-tree tree. Index coverage also exists. The data in the index meets all the data required by the current query statement. At this time, you only need to find the index and return immediately. , No need to retrieve the bottom leaf node

3.5.1 eg: equivalent query

Suppose we query data with a value equal to 9. Query path disk block 1->disk block 2->disk block 6

The first disk IO: Load disk block 1 into the memory, traverse the comparison from the beginning in the memory, 9<15 go left, to disk addressing disk block 2

Second disk IO: Load disk block 2 into the memory, traverse and compare from the beginning in the memory, 7<9<12, and locate the disk block 6 in the disk addressing

The third disk IO: load disk block 6 into the memory, compare it from the beginning in the memory, find 9 in the third index, and fetch data. If the row record of data is stored, fetch data, and the query ends. If the storage is a disk address, you also need to fetch the data from the disk according to the disk address, and the query is terminated (the distinction here is that the data stored in InnoDB is row data, and the disk address is stored in MyIsam)
as shown below

Insert picture description here

3.5.2 Range query

Suppose we want to find data between 9 and 26. The search path is disk block 1->disk block 2->disk block 6->disk block 7.

First find the data with a value equal to 9, and cache the data with a value equal to 9 to the result set. This step is the same as the previous equivalent query process, three disk IO operations have occurred

After finding 15, the bottom leaf node is an ordered list. We start from disk block 6 and key value 9 and traverse backward to filter all data that meet the conditions.

Fourth disk IO: locate disk block 7 according to disk block 6's subsequent pointer to the disk addressing, load disk 7 into the memory, and traverse the comparison in the memory from the beginning, 9<25<26, 9<26<= 26 cache data to the result set

Because the primary key is unique (there will be no data <=26 later), there is no need to continue to find the query to terminate, and return the result

Insert picture description here

Guess you like

Origin blog.csdn.net/qq_43565087/article/details/109200610