MySQL Advanced Features_06_Data Structure of Index_Shang Silicon Valley_Song Hongkang

Foreword:
Refer to MySQL video P115~P120 from Shang Silicon Valley B station

1. Why use an index

An index is a data structure used by a storage engine to quickly find data records. It is like the table of contents of a textbook. By finding the page number of the corresponding article in the table of contents, you can quickly locate the desired article. The same is true in MySQL. When searching for data, first check whether the query condition hits 通过索引查找an index 全表扫描.

image-20230301233743077

As shown in the figure, when there is no index in the database, the data 分布在硬盘不同的位置上面, when reading the data, the swing arm needs to swing back and forth to find the data, which is very time-consuming. If 数据顺序摆放so, it also needs to read sequentially from 1 to 6 rows, which is equivalent to 6 IO operations 依旧非常耗时. If we don't use any index structure to help us quickly locate the data, we need to search and compare the record Col 2 = 89 line by line. Start from Col 2 = 34, compare and find that it is not, continue to the next line. Our current package has less than 10 rows of data, but if the table is large, there is 上千万条数据, and we can only find it if we need to do it 很多很多次磁盘I/O. Now to find Col 2 = 89 this record. The CPU must first go to the disk to find this record, load it into the memory after finding it, and then process the data. The best time for this process is disk I/O (involving disk rotation time (faster speed), head seek time (slow speed, time-consuming)

If 二叉树such a data structure is used for data storage, as shown in the following figure

image-20230302220121865

Adding an index to the field Col 2 is equivalent to maintaining an index data structure for Col 2 on the hard disk, that is, this 二叉搜索树. Each node of the binary search tree stores that (K,V)结构the key is Col 2, and the value is the file pointer (address) of the row where the key is located. For example: change the root node of the binary search tree to: (34, 0x07). Now that an index has been added to Col 2, when looking for the record Col 2 = 89, it will first search for the binary search tree (binary tree traversal search). Read 34 to the memory, 89 > 34; continue to read the data on the right, read 89 to the memory, 89 == 89; find the data and return. After knowing it, quickly locate the address corresponding to the record to be searched according to the current node value. We can find that 查找两次the recorded address can be located as long as it is needed, and the search speed is improved.

This is why we build indexes, the purpose is to 减少磁盘I/O的次数speed up the query rate.

2. Index and its advantages and disadvantages

2.1 Index overview

The official definition of MySQL for indexes is:Index (index) is a data structure that helps MySQL efficiently obtain data

The nature of indexing: Indexes are data structures. You can simply understand it as a "sorted fast search data structure" that satisfies a specific search algorithm. These data structures point to the data in some way so that it can be technically implemented on those data structures 高级查找算法.

索引是在存储引擎中实现的Therefore, the indexes of each storage engine are not necessarily identical, and each storage engine does not necessarily support all index types. At the same time, the storage engine can define 最大索引数the sum of each table 最大索引长度. All storage engines support at least 16 indexes per table. The total index length is at least 256 bytes. Some storage engines support more indexes and larger index lengths.

2.2 Advantages

(1) Similar to the bibliographic index built by a university library, it improves the efficiency of data retrieval and reduces the index 数据库的IO成本. This is also the main reason for creating an index.

(2) By creating a unique index, each row in the database can be guaranteed 数据的唯一性.

(3) In terms of achieving referential integrity of data, yes 加速表和表之间的连接. In other words, the query speed can be improved when the dependent child table and the parent table are jointly queried.

(4) When using grouping and sorting clauses for data query, 减少查询中分组和排序的时间the CPU consumption can be significantly reduced.

2.3 Disadvantages

Adding indexes also has many disadvantages, mainly in the following aspects:

(1) It is necessary to create and maintain indexes 耗费时间, and as the amount of data increases, the time spent will also increase.

(2) Indexes need to occupy 磁盘空间. In addition to the data space occupied by the data table, each index also occupies a certain amount of physical space. 存储在磁盘上If there are a large number of indexes, the index file may reach the maximum file size faster than the data file.

(3) Although the index greatly improves the query speed, at the same time it will 降低更新表的速度. When adding, deleting, and modifying the data in the bid, the index must also be maintained dynamically, which reduces the speed of data maintenance.

Because, when choosing to use an index, the advantages and disadvantages of the index need to be considered comprehensively.

hint:

Indexes can improve the speed of queries, but will affect the speed of inserting records. In this case, the best way is to delete the index in the table first, then insert the data, and then create the index after the insertion is complete.

3. Deduction of indexes in InnoDB

3.1 Lookup before indexing

Let's look at an example of an exact match:

SELECT [列名列表] FROM 表名 WHERE 列名 = XXX;

1. Search in a page

Assuming that there are relatively few records in the current table, all records can be stored in one page. When searching for records, it can be divided into two situations according to different search conditions:

  • Use the primary key as the search condition

    二分法You can quickly locate the corresponding slot in the page directory , and then traverse the records in the corresponding group of the slot to quickly find the specified record.

  • Use other columns as search criteria

    Because there is no so-called page directory for non-primary key columns in the data page, we cannot quickly locate the corresponding slot through the dichotomy method. In this case, we can only 最小记录start with 依次遍历each record in the singly linked list, and then compare whether each record meets the search conditions. Obviously, the efficiency of this search is very low.

2. Find in many pages

In most cases, there are a lot of records stored in our table, and many data pages are needed to store these records. Finding records in many pages can be divided into two steps;

  1. Navigate to the page where the record is located.

  2. Find the corresponding record from the page where it is located.

In the absence of an index, whether searching based on the primary key column or other columns, since we cannot quickly locate the page where the record is located, we can only search down the doubly linked list, and in each page according to 从第一个页our The above search method is used to find the specified record. Because it is necessary to traverse all data pages, this method is obviously the best 超级耗时. What if a table has 100 million records? At this time 索引came into being.

3.2 Design index

Create a table:

create table index_demo
(
    c1 int,
    c2 int,
    c3 char(1),
    primary key (c1)
) row_foramt = Compact;

There are 2 columns of INT type and 1 column of CHAR(1) type in this newly created index_demotable, and we stipulate that column c1 is the primary key, and this table uses row format Compactto actually store records. Here we simplify the row format diagram of the index_demo table:

image-20230302234141729

We only show these parts of the record in the diagram:

  • record_type: An attribute of the record header information, indicating the type of record, 0indicating ordinary records, 2indicating the smallest records, 3indicating the largest records, and 1has not been used yet, as follows.
  • next_record: The intent attribute of the record header information, indicating the address offset of the next address relative to this record. We use arrows to indicate who the next record is.
  • 各个列的值: There are only index_demothree columns recorded in the table here, which are c1, c2and c3.
  • 其他信息: All information except the above 3 types of information, including the values ​​of other hidden columns and additional information recorded.

The effect of temporarily removing other information items of the record format diagram and erecting it is as follows:

image-20230302235732355

The schematic for putting some records into a page is:

image-20230302235918106

1. A simple index design scheme

Why do we traverse all the data pages when looking for some records based on a certain search condition? Because the records in each page are not regular, we don't know which records in which pages our search conditions match, so we have to traverse all the data pages at once. So 想快速的定位到需要查找的记录在哪些数据页what if one of us? We can quickly locate the data page where the record is located 建立一个目录. To create this directory, the following things must be completed:

  • The primary key value of the user record in the next data page must be greater than the primary key value of the user record in the previous page.

    Assumption: Each data page can store up to 3 records (in fact, a data page is very large and can store many records). With this assumption, we index_demoinsert 3 records into the above table:

    INSERT INTO index_demo VALUES (1,4,'u'),(3,9,'d'),(5,3,'y');
    

Then these records have been concatenated into a one-way linked list according to the size of the primary key value, as shown in the figure:

image-20230305120621622

As can be seen from the figure, index_demoall three records in the table have been inserted into the data page numbered 10. Now let's insert another record:

INSERT INTO index_demo VALUES (4,4,'a')

Since 页10only 3 records can be placed at most, we have to allocate a new page:

image-20230305125541179

Note that new allocations 数据页编号may not be consecutive. They are only related by the maintainer's previous and next page numbers 链表. In addition, 页10the largest primary key value of the record in is 5, and 页28the primary key value of a record in is 4, because 5>4, so this does not meet the primary key value of the user record in the next data page must be greater than the primary key value of the user record in the previous page. Key value requirements, so when inserting a record with a primary key value of 4, it needs to be accompanied once 记录移动, that is, move the record with a primary key value of 5 to page 28, and then insert the record with a primary key value of 4 into page 10 , the schematic diagram of this process is as follows:

image-20230305130457417

image-20230305130511187

This process shows that in the process of adding, deleting, and modifying the recorder in the page, we must 记录移动always ensure that this state is always established through some operations such as: the primary key value of the user record in the next data page must be greater than that of the previous page The primary key value of the user record in . We call this process 页分裂.

  • Create a directory entry for all pages

Due to the data page 编号可能是不连续的, after inserting many records into the index_demo table, the effect may be as follows:

image-20230305132219697

Because these 16KBpages are unique in physical storage 不连续, if we want to use the primary key values ​​from so many pages 快速定位某些记录所在的页, we need to create a list for them 目录. Each page corresponds to a directory, and each directory includes the following two parts:

  • The smallest primary key value in the page's user record that we keyuse identification.
  • Page number, we use page_nologo.

So our table of contents for the top few pages looks like this:

image-20230305132610770

For 页28example, it corresponds 目录项2that this directory entry contains the page number of the page 28and the minimum primary key value of the user record in the page 5. We only need to store several directory items consecutively on the physical memory (for example: an array), and then we can realize the function of quickly searching for a certain record according to the primary key value. For example: to find 20records with primary key values, the specific search process is divided into two steps:

  1. First, think from the directory and 二分法quickly determine that 20the record with the primary key value is in 目录项3(because 12 < 20 < 209), and its corresponding page is 页9.
  2. 页9Then locate the specific record according to the method of searching for records in the page mentioned above .

So far, the simple table of contents for the data page is done. This directory has an alias called 索引.

2. Index scheme in InnoDB

①Iteration 1 time: the page recorded by the directory entry

二分法The above is called a simple indexing scheme, because we use it to quickly locate specific directory entries when searching based on the primary key value, and 假设all directory entries can be stored on physical storage 连续存储, but there are several problems in doing so:

  • InnoDB uses pages as the basic unit for managing storage space. The maximum 16KBcontinuous storage space can be guaranteed. However, as the number of records in the table increases, 非常大的连续的存储空间it is necessary to put down all the directory entries. Unrealistic.
  • We often meet each other 记录进行增删, assuming that we 页28delete all the records in , it means that 目录项2there are no existing ones, so we need to move all the directory items after directory item 2 forward, so that it will affect the whole body operating efficiency is poor.

So, we need a 灵活管理所有目录项way to do it. We found that the directory entry actually looks similar to our user record, except that the two columns in the directory entry are

主键And 页号that's all, in order to distinguish it from user records, we call these records used to represent directory entries 目录项记录. So how does InnoDB distinguish whether a record is ordinary 用户记录or not 目录项记录? Using the attributes in the record header information record_type, the meanings of each value are as follows:

  • 0: normal user record
  • 1: directory entry record
  • 2: minimum record
  • 3: maximum record

This is how we put the directory items we used earlier into the data page:

image-20230305135352189

As can be seen from the figure, we newly allocated a page numbered 30 to store directory entry records. Here again emphasize the difference from 目录项记录ordinary :用户记录

  • 目录项记录The value of record_typeis 1, while the value 普通用户记录of record_typeis 0.
  • Directory item records have only 主键值和页的编号two columns, while the columns of ordinary user records are defined by users themselves, which may contain 很多列, and there are also hidden columns added by InnoDB itself.
  • min_rec_maskUnderstand: There is also an attribute called in the record header information , only the one with the smallest 目录项记录primary key value in the stored page目录项记录min_rec_mask

value 1, the values ​​of other records min_rec_maskare 0.

The same point : Both use the same data page, which will generate the primary key value (page directory), so that it can be used to speed up the query Page Directorywhen searching according to the primary key value .二分法

Now take the search for 20a record with a primary key as an example. The steps to find a record based on a certain primary key can be roughly divided into the following two steps:

1. 目录项记录The page stored first, that is, 二分法the corresponding directory item is quickly located on page 30, because 12 < 20 < 209, so

2. Go to page 9 where the user records are stored and quickly locate the user records 二分法whose primary key is .20

②Iteration 2 times: pages recorded by multiple directory entries

Although it is said 目录项记录that only the primary key value and the corresponding page number are stored in the database, which is much smaller than the storage space required by the user record, but in any case, a page only has a size, and the storage capacity is limited 16KB. 目录项记录If there are too many data in the table, So that one data page is not enough to store all 目录项记录the pages:

Here we assume a page for storing directory entry records 最多只能存放4条目录项记录, so if we insert a user record with a primary key value of 320 in the upper figure at this time, we need to allocate a new 目录项记录page for storage:

image-20230305142916008

As can be seen from the figure, we need two new data pages after inserting a user record with a primary key value of 320:

  • newly created to store this user record 页31.
  • Because the directory entry records were originally stored 页30的容量已满(we assumed that only 4 directory entry records can be stored), so a new one has to be needed 页32to store 页31the corresponding directory entries.

Now because there are more than one page for storing directory entry records, it takes roughly three steps to find a user record based on the primary key value. Take the search for a 20record with the primary key value as an example:

  1. Sure目录项记录页

    We now have two pages for storing directory entry records, namely 页30and 页32, and because the range of the primary key value of the directory entry represented by page 30 is [1,320), the primary key value of the directory entry represented by page 32 is not less than 320, so the primary key value is 20The record corresponding to the directory entry is recorded in 页30.

  2. Document pages by catalog entry 确定用户记录真实所在的页.

    目录项记录The method of locating a directory entry record through the primary key value in a stored page has been said.

  3. Locate a specific record in the page that actually stores user records.

③ Iterate 3 times: the directory page of the directory item record page

Here comes the problem, in the first step of this query step, we need to locate the page that stores the directory item records, but these, 页是不连续的if there are a lot of data in our table 产生很多存储目录项记录的页, how can we quickly locate a storage directory item according to the primary key value What about the page of record? Then generate another page where these directory item records are stored 更高级的目录, just like a multi-level directory, 大目录里嵌套小目录the actual data is in the small directory, so the schematic diagram of each page now looks like this:

image-20230305152552270

As shown in the figure, we have generated a directory that stores more advanced directory items 页33. The two records in this page represent page 30 and page 32 respectively. If the primary key value of the user record is between, go [1, 320)to page 30 to find a more detailed directory Item records, if the primary key value 不小于320, go to page 32 to find more detailed directory item records.

As the number of records in the table increases, the level of this directory will continue to increase. If we simplify it, we can use the following diagram to describe it:

image-20230305153101169

This data structure, its name is B+树.

④B+Tree

Whether it is a stored 用户记录data page or a stored 目录项记录data page, we have stored them in the data structure of the B+ tree, so we also call these data pages 节点. It can be seen from the figure that our actual user records are actually stored on the bottom nodes of the B+ 叶子节点tree . That node is also known as 目录项非叶子节点根节点

The nodes of a B+ tree can actually be divided into several layers, and the bottom layer, which is the layer where our user records are stored, is specified as the first 0layer, and then added sequentially. We made a very extreme assumption before: the pages that store user records 最多存放3条记录, and the pages that store directory entry records 最多存放4条记录. In fact, the number of records stored on a page in a real environment is very large. Assuming that all data pages represented by leaf nodes storing user records can be stored, and all data pages represented by 100条用户记录internal nodes storing directory item records can be stored 1000条目录项记录, then:

  • If the B+ tree has only one layer, that is, there is only one node for storing user records, it can store at most 100one record.
  • If the B+ number has 2 layers, at most 1000 * 100 = 100000one record can be stored
  • If the B+ number has 3 layers, 1000 * 1000 * 100 = 一亿a record can be stored at most
  • If the B+ number has 4 layers, 1000 * 1000 * 1000 * 100 = 一千亿a record can be stored at most

一千亿Can you store records in your table ? So under normal circumstances, 用到的B+树都不会超过4层we only need to do up to 4 searches in the interface (find 3 directory item pages and a user record page) to find a record through the primary key value, and because there are The so-called Page Directory(page directory), so pages within the page can be 二分法achieved by quickly locating records.

3.3 Common index concepts

According to the physical implementation of indexes, indexes can be divided into two types: clustered (aggregated) and non-clustered (aggregated) indexes. We also call non-clustered indexes secondary indexes or auxiliary indexes.

1. Clustered index

The clustered index is not a separate index type, but a data storage method (all user records are stored in the leaf nodes), which is the so-called 索引即数据,数据即索引.

Terminology: "Clustering" means that rows of data are stored together in clusters of adjacent key values.

Features:

  1. Use the size of the record primary key value to sort records and pages, which includes three meanings:

    • 页内The records are arranged in order according to the size of the primary key 单向链表.
    • Each storage 用户记录的页is also arranged according to the order of the primary key of the user record in the page 双向链表.
    • The storage 目录项记录的页is divided into different levels, and the pages in the same level are also arranged in order according to the size of the primary key of the directory entry records in the page 双向链表.
  2. The B+ tree 叶子节点stores complete user records.

    The so-called complete user record means that the values ​​of all columns (including hidden columns) are stored in this record.

We call the B+ tree with these two characteristics 聚簇索引, and all complete user records are stored at 聚簇索引the leaf nodes of this tree. This clustered index does not require us to explicitly use INDEXthe statement in the MySQL statement to create, InnoDBthe storage engine will 自动create a clustered index for us.

Advantages :

  • 数据访问更快, because the clustered index stores the index and data in the same B+ tree, so getting data from the clustered index is faster than non-clustered
  • The sum of the clustered index on the primary key 排序查找is 范围查找very fast
  • According to the order of the clustered index, when searching and displaying a certain range of data, since the data are closely connected, the database does not need to extract data from multiple data blocks 节省了大量的IO操作.

shortcoming

  • 插入速度严重依赖于插入顺序, inserting in the order of the primary key is the fastest way, otherwise page splits will occur, seriously affecting performance. Therefore, for InnoDB tables, we generally define an auto-incrementing ID column as the primary key
  • 更新主键的代价很高, as this will cause the row being updated to move. Therefore, for InnoDB tables, we generally define the primary key as non-updatable
  • 二级索引访问需要两次索引查找, the primary key value is found for the first time, and the row data is found for the second time based on the primary key value

limit

  • For the MySQL database, only the InnoDB data engine currently supports clustered indexes, while MyISAM does not support clustered indexes.
  • Since there can only be one sorting method for data physical storage, each MySQL 表只能有一个聚簇索引. Usually the primary key of the table
  • If no primary key is defined, InnoDB will choose 非空的唯一索引instead, if there is no such index. InnoDB implicitly defines a primary key as a clustered index.
  • In order to make full use of the clustering index and clustering features, index the primary key columns of InnoDB as much as possible 选用有序的顺序id, and it is not recommended to use unordered ids, such as UUID, MD5, HASH, and string columns as primary keys, which cannot guarantee the sequential growth of data.

2. Secondary index (auxiliary index, non-clustered index)

The above introduction can only work 聚簇索引when the search condition is yes, because the data of the B+ tree species is sorted according to the primary key. 主键值So what if we want to use other columns as search criteria? It must not be to traverse the records sequentially along the linked list from beginning to end.

Answer: We can 多建几颗B+树, the data in different B+ trees adopt different sorting rules. For example, we use c2the size of the column as the sorting rule of the data page and the records in the page, and then build a B+ tree, the effect is shown in the following figure:

image-20230305165034185

This B+ tree has several differences from the clustered index introduced above:

  • Use the size of the record c2 column to sort records and pages, which includes three meanings:
    • The records of the page are arranged in order according to the size of the c2 column 单向链表.
    • Each storage 用户记录的页is also arranged in order according to the size of the c2 column recorded in the page 双向链表.
    • The storage 目录项记录的页is divided into different levels, and the pages in the same level are also arranged in order according to the size of the c2 column of the directory entry record in the page 双向链表.
  • The leaf nodes of the B+ tree do not store the complete user records, but only c2列+页号the collocations.
  • Instead of collocations in the directory entry record 主键+页号, it becomes c2列+页号a collocation of .

So if we now want to find some records through the value of the c2 column, we can use the B+ tree we just built. Take the search for the value of column c2 4as an example, the search process is as follows:

  1. Sure目录项记录页

    According to 根界面, that is 页44, you can quickly locate 目录项记录the page you are on as 页42(because of 2 < 4 < 9).

  2. Determine 目录项记录the page where the user record is actually located by page.

    In 页42, you can quickly locate the page that actually stores user records, but because c2there is no unique constraint on the column, the records with c2column values 4​​may be distributed in multiple data pages, and because 2 < 4 <=the page that actually stores user records is determined between 页34the 页35middle.

  3. Locate a specific record in the page that actually stores user records.

    To 页34and 页35in locate specific records.

  4. But the records in the leaf nodes of this B+ tree only store c2and c1(that is 主键) two columns, so we have to look up the complete user records again in the clustered index according to the primary key value.

Concept: back to the table

We can only determine the primary key value of the record we want to find according to the B+ tree sorted by the size of the c2 column, so if we want to find the complete user record according to the value of the c2 column, we still need to check it again in the middle, this 聚簇索引process called 回表. That is, querying a complete user record based on the value of column c2 requires 2a B+ tree!

Question : Why do we need one more 回表operation? Isn't it OK to put the complete user record directly in the leaf node?

answer :

If you put the complete user record in the leaf node, you don't need to go back to the table. However 太占地方, it is equivalent to copying all user records every time a B+ tree is built, which is a bit too much of a waste of storage space.

Because this kind of 非主键列B+ tree according to the establishment needs a table return operation to locate the complete user record, so this kind of B+ tree is also called 二级索引(English name secondary index), or 辅助索引. Since we use the size of the c2 column as the sorting rule of the B+ tree, we also call this B+ tree the index created for the c2 column.

The existence of a nonclustered index does not affect the value of data in the clustered index, so a table can have multiple nonclustered indexes.

image-20230305171431769

Summary: The principles of clustered index and non-clustered index are different, and there are some differences in use:

  1. The storage of the clustered index 叶子节点is ours 数据记录, and the storage of the leaf nodes of the non-clustered index is 数据位置. Nonclustered indexes do not affect the physical storage order of data tables.
  2. A table 只能有一个聚簇索引, because there can only be one sorted storage method, but 多个非聚簇索引it can have multiple index directories to provide data retrieval.
  3. When using a clustered index, the data is stored 查询效率高, but if the data is inserted, deleted, updated, etc., the efficiency will be lower than that of the non-clustered index.

3. Joint index

We can also use the size of multiple columns as the sorting rule at the same time, that is, create indexes for multiple columns at the same time. For example, we want to c2和c3列sort the B+ tree according to the size of the column. This contains two meanings:

  • First sort each record and page according to column c2.
  • When the c2 column of the record is the same, the c3 column is used for sorting

A schematic diagram of the indexes created for columns c2 and c3 is as follows:

image-20230305173831054

As shown in the figure, we need to pay attention to the following points:

  • Each 目录项记录record c2、c3、页号is composed of these three parts. Each record is first excluded according to the c2 column. If the c2 column of the record is the same, it is sorted according to the value of the c3 column.
  • 叶子节点The user record at the B+ tree is c2、c3和主键c1列composed of .

Note that the B+ tree built with the size of the c2 and c3 columns as the sorting rule 联合索引is called a secondary index in essence. Its meaning is different from the expression of creating indexes for columns c2 and c3 respectively. The differences are as follows:

  • The establishment 联合索引will only establish a B+ tree as shown in the above figure.
  • Creating indexes for c2 and c3 respectively will create two B+ trees with the size of the c2 and c3 columns as the sorting rules.

3.4 Precautions for InnoDB's B+ tree index

1. The location of the root interface will not change for ten thousand years

When introducing the B+ tree index earlier, for the convenience of everyone’s understanding, first draw the leaf nodes that store user records, and then draw the internal nodes that store directory item records. In fact, the process of the B+ tree is as follows:

  • Whenever a B+ tree index is created for a table (clustered index is not considered to be created, it exists by default), a page will be created for this index. When there is no data in the 根节点table at the beginning, each B+ tree index corresponds to 根节点There are neither user records nor directory entry records in .
  • When inserting user records into the table later, store the user records in this first 根节点.
  • When the records in the root node are available to 空间用完时continue to insert records, all the records in the root node will be copied to a newly allocated one, for example 页a, and then the new page will be 页分裂operated to get another new page, for example 页b. 页aThis is the newly inserted record will be allocated to or according to the size (that is, the primary key value in the clustered index, the value of the corresponding index column in the secondary index), and then (it will be upgraded to store directory 页bentry 根节点records page).

Special attention should be paid to this process: the root node of a B+ tree index will not move since its birth, so as long as we create an index for a certain table, the page number of its root node will be recorded in Somewhere, and InnoDBwhenever the storage engine needs to use this index, it will fetch the page number of the root node from that fixed place to access this index.

2. Uniqueness of directory entry records in internal nodes

We know that the contents of the directory entry records in the internal nodes of the B+ tree index are 索引列 + 页号the best collocation, but this collocation is a bit imprecise for the secondary index. Also take index_demothe table as an example, assuming that the data in this table is like this:

c1 c2 c3
1 1 ‘u’
3 1 ‘d’
5 1 ‘y’
7 1 ‘a’

If the content of the directory entry records in the secondary index is just 索引列 +页号a match, then it is c2recommended that the indexed B+ tree should look like this:

image-20230305180013783

If we want to insert a new row of records, the values ​​of c1, c2, and c3are respectively: 9, 1, c, then we encounter a big problem when modifying this to create a secondary index corresponding to the B+ tree for column c2: because the directory 页3entry records stored in c2列 + 页号The value of the 页3c2 column corresponding to the two directory entry records in is the same 1, and 新插入的这条记录the value of our c2 column is also the same 1, so should our newly inserted record be placed 页4in or in 页5the middle? ? The answer is: sorry, confused.

In order for the newly inserted record to find which page it is in, we need to ensure that the directory entry record of the node in the same layer of the B+ tree is unique except for the page number field . Therefore, the contents of the records recorded for the directory entries of the inner nodes of the secondary index actually consist of three parts:

  • the value of the indexed column
  • primary key value
  • page number

That is, we 主键值record the directory entries that are also added to the nodes in the secondary index, so that we can ensure that each directory entry record in each layer of the B+ tree node is unique except for the field of page number, so we set it as column c2 It is suggested that the schematic diagram after the secondary index should actually look like this:

image-20230305183742808

In this way, when we insert records again (9,1,'c'), since 页3the directory entry records stored in are composed c2列 + 主键 + 页号of values, we can first compare the value c2of the class of the new record with 页3the value of the c2l column of each directory entry record in , if c2the values ​​of the columns are the same , you can then compare the primary key values, because the values ​​of different directory entry records in the same layer of the B+ tree c2列 + 主键must be different, so you must be able to locate the only directory entry record in the end. In this example, it is finally determined that the new record should be inserted 页5middle.

3. A page stores at least 2 records

A B+ tree can easily store hundreds of millions of records with only a few levels, and the query speed is quite good! This is because the B+ tree is essentially a large multi-level directory. Every time a directory is passed, many invalid subdirectories are filtered and adjusted until the directory that stores real data is finally accessed. So what is the effect if only one subdirectory is stored in a large directory? That is, there are very, very many directory levels, and only one record can be stored in the last directory that stores real data. It takes a long time to store only one real user record? so InnoDB的一个数据页至少可以存放两条记录.

4. Index scheme in MyISAM

4.1 The B-tree index uses a storage engine as shown in the table:

indexing/storage engine MyISAM InnoDB Memory
B-Tree index support support support

Even if multiple storage engines support the same type of index, their implementation principles are different. The default index of InnoDB and MyISAM is Btree index; while the default index of Memory is Hash index.

The MyISAM engine uses it B+Treeas an index structure, and the data field of the leaf node is stored 数据记录的地址.

4.2 The principle of MyISAM index

The figure below is a schematic diagram of a MyISAM index.

We know InnoDB中索引即数据that all the complete user records have been included in the leaf nodes of the B+ tree of the clustered index, and although the MyISAMindex scheme also uses a tree structure, it is 将索引和数据分开存储:

  • Store the records in the table 按照记录的插入顺序individually in a file called 数据文件. This file is not divided into several data pages, as many records are stuffed into this file as there are. Since we are inserting data 没有可以刻意按照主键大小排序, we cannot use binary search on these data.
  • Tables using MyISAMa storage engine will additionally store index information in another 索引文件file called . MyISAMAn index will be created separately for the primary key of the table, but what is stored in the leaf node of the index is not a complete user record, but a 主键值 + 数据记录地址combination of user records.

The MyISAM index file only saves the address of the data record . In MyISAM, there is no structural difference between the primary key index and the secondary index (Secondary key), except that the primary key index requires the key to be unique, while the key of the secondary index can be repeated. If we build a secondary index on Col 2, the structure diagram of this index is as follows:

image-20230305223148564

It is also a B+Tree, and the data field saves the address of the data record. Therefore, the index retrieval algorithm in MyISAM is: first search the index according to the B+Tree search algorithm, if the specified key exists, take out the value of the data field, and then use the value of the data field as the address to read the corresponding data record.

4.3 Comparison between MyISAM and InnoDB

The index methods of MyISAM are all "non-clustered", which is different from InnoDB which contains a clustered index .

Summarize the difference between indexes in the two engines:

① In the InnoDB storage engine, we only need to 聚簇索引perform a search based on the primary key value pair to find the corresponding record, but in the InnoDB storage engine, MyISAMwe need to perform an 回表operation, which means that the indexes established in MyISAM are equivalent to all of them 二级索引.

② InnoDB's data file itself is an index file, while the MyISAM index file and data file are 分离的, the index file only saves the address of the data record,

③ InnoDB's non-clustered index data field stores response records 主键的值, while MyISAM index records are 地址. In other words, all non-clustered indexes of InnoDB use the primary key as the data field.

④ MyISAM's table return operation is very 快速important, because it takes the address offset to fetch data directly from the file. On the other hand, InnoDB finds records in the clustered index after obtaining the primary key. Although it is not satisfied, it is still better than Do not directly use the address to visit.

⑤ InnoDB request table 必须有主键(MyISAM可以没有). If no display is specified, the MySQL system will automatically select a column that can be non-null and uniquely represent a data record as the primary key. If there is no such column, MySQL automatically generates an implicit field for the InnoDB table as the primary key. The length of this field is 6 bytes, and the type is a long integer.

summary

Knowing how indexes are implemented by different storage engines is very helpful for proper use and optimization of indexes. for example:

Example 1: After knowing the index implementation of InnoDB, it is easy to understand 为什么不建议使用过长的字段作为主键, because all secondary indexes refer to the primary key index, and an excessively long primary key index will make the secondary index too large.

Example 2: It is not a good idea to use a non-monotonic field as the primary key in InnoDB, because the InnnoDB data file itself is a B+Tree, and the non-monotonic primary key will cause the data file to maintain the B+Tree when inserting new records. The characteristic rather frequent splitting adjustments is very offset while using 自增字段作为主键是一个很好的选择.

image-20230305225206306

5. The cost of indexing

The index is a good thing, but it cannot be built randomly, it will consume space and time:

  • code in space

    Every time an index is created, a B+ tree must be built for it. Each node of each B+ tree is a data page. A page will occupy the storage space by default. A 16KBlarge B+ tree consists of many data pages. , that is a large piece of storage space.

  • time cost

    Every time the data in the table is operated 增、删、改, it is necessary to modify each B+ tree index. 从小到大的顺序排序And we have said that each layer of B+ tree nodes is composed according to the value of the index column 双向链表. Whether it is the records in the leaf nodes or the records in the inner nodes (that is, whether it is a user record or a directory entry record), a one-way linked list is formed according to the order of the index column values ​​from small to large. The addition, deletion, and modification operations may cause damage to the ordering of nodes and records, so the storage engine needs additional time to perform operations such as , 记录移位, 页面分裂and 页面回收other operations to maintain the ordering of nodes and records. If we build many so, the B+ tree corresponding to each index must perform related maintenance operations, which will slow down the performance.

The more indexes are built on a table, the more storage space will be occupied, and the performance will be worse when adding, deleting and modifying records. In order to build good and few indexes, we need to learn the conditions under which these indexes work.

6 Rationality of MySQL data structure selection

From the perspective of MySQL, a practical problem that has to be considered is disk I/O. If we can minimize disk I/O operations for the index data structure, the time spent on locks will be smaller. It can be said that 磁盘的I/O操作次数the efficiency of index usage is very important.

Search is an index operation. Generally speaking, the index is very large, especially for relational databases. When the amount of data is relatively large, the size of the index may be several G or even more. In order to reduce the memory occupation of the index, the database index is stored on an external disk . When we use the index query, it is impossible to load the entire index into the memory, only 逐一加载, then the standard for MySQL to measure query efficiency is the number of disk IOs.

6.1 Full table traversal

6.2 Hash structure

Hash itself is a function, also known as a hash function, which can help us greatly appreciate the efficiency of retrieving data.

The Hash algorithm converts input into output through a certain algorithm (such as MD5, SHA1, SHA2, SHA3). 相同的输入永远可以得到相同的输出, assuming slight deviations in the input content, there will usually be a different structure in the output.

Example: If you want to verify whether two files are the same, you don’t need to compare the two files directly, you just need to ask the other party to tell you the result calculated by the Hash function, and then perform the same on the file locally The operation of the Hash function, and finally by comparing whether the structures of the two Hash functions are the same, you can know whether the two files are the same.

There are two common types of data structures that speed up lookups:

(1) For a tree, such as a balanced binary search tree, the average time complexity of query/insert/modify/delete isO(log2N)

(2) Hashing, such as HashMap, the average time complexity of query/insert/modify/delete is O(1);

image-20230305231250478

Using Hash to search is very efficient. Basically, the data can be searched in one search, but the B+ tree needs to be searched from top to bottom, and the data can be found by visiting the node multiple times. Multiple I/O operations are required in the middle 从效率来说Hash比 B+树更快.

In the way of hashing, an element K is in h(k), that is, the position of the slot is calculated according to the key k by using the hash function h. The function h maps the key field to the slot of the hash table [0...m-1],

image-20230305231726884

The hash function h in the above figure may map two different keywords to the same position, which is called 碰撞, and is generally used in the database 链接法to solve it. In the chaining method, the elements hashed in the same slot are placed in a linked list, as shown in the following figure:

image-20230305231919501

The Hash structure is efficient, so why should the index structure be designed as a tree?

Reason 1: The Hash index only satisfies (=) (<>) and IN queries. If it is done 范围查询, the time complexity of the hash index will degenerate to O(n); and the "ordered" feature of the tree can still maintain the high efficiency of O(log2N).

Reason 2: Hash index also has a defect. The storage of data is that 没有顺序的in the case of ORDER BY, the use of Hash index also needs to reorder the data.

Reason 3: In the case of a joint index, the Hash value is calculated by merging the joint index, and it is impossible to query a single key or several index keys.

Reason 4: For equivalent query, Hash index is usually more efficient, but there is also a situation, that is 索引列的重复值付过很多,效率就会降低. This is because when encountering a Hash conflict, it is necessary to traverse the row pointers in the bucket for comparison and find the query keyword, which is very time-consuming. Therefore, Hash indexes are usually not used on columns with many repeated values, such as gender and age.

* The Hash index uses the storage engine as shown in the table:

indexing/storage engine MyISAM InnoDB Memory·
HASH index not support not support support

Applicability of Hash index:

Hash indexes have many limitations. In contrast, B+ tree indexes are widely used in databases. However, there are some scenarios where using Hash indexes is more efficient, such as in key-value (Key-Value) databases Redis存储的核心就是Hash表.

The Memory storage engine in MySQL supports Hash indexes. If we need to use a temporary table for query, we can choose the Memory storage engine and set a field as a Hash index. For example, a field of string type, after the Hash calculation, the length can be It is shortened to qualification bytes. When the repetition of fields is low and often needs to be performed 等值查询, it is a good choice to use Hash index.

In addition, InnoDB itself does not support Hash index, but provides 自适应Hash索引(Adaptive Hash Index). Under what circumstances will the adaptive Hash index be used? If a certain data is frequently accessed, when certain conditions are met, the address of this data page will be stored in the Hash table. In this way, the next time you query, you can directly find the location of this page. In this way, the B+ tree also has the advantages of the Hash index.

image-20230305234358153

The purpose of using adaptive Hash index is to facilitate the accelerated positioning of leaf nodes according to SQL query conditions, especially when the B+ tree is relatively deep, the adaptive Hash index can significantly improve the efficiency of data retrieval.

We can use innodb_adaptive_hash_indexvariables to check whether adaptive hash is enabled, for example:

show variables like 'innodb_adaptive_hash_index';

image-20230305234654287

6.3 Binary Search Tree

If we use a binary tree as an index structure, the number of disk IOs is related to the height of the index tree.

If we use a binary tree as an index structure, the number of disk IOs is related to the height of the index tree.

1. The characteristics of the binary search tree

  • A node can only have two child nodes, that is, a node degree cannot exceed 2
  • Left child node < this node; right child node >= this node, the one bigger than me goes to the right, and the one smaller than me goes to the left

2. Find rules

Let's first look at the most basic binary search tree (Binary Search Tree). Searching for a node is the same as inserting a node. We assume that the value to be searched and inserted is the key:

  1. If the key is greater than the root node, search in the right subtree:
  2. If the key is smaller than the root node, search in the left subtree;
  3. If the key is equal to the root node, that is, the node is found, just return the root node.

For example, the binary search tree created by our logarithm sequence (34, 22, 89, 5, 23, 77, 91) is shown in the following figure:

image-20230311181245933

But there are special cases, that is, sometimes the depth of the binary tree is very large. For example, the data order we give is (5, 22, 23, 34, 77, 89, 91), and the created binary search tree is shown in the figure below :

image-20230311181508170

The second tree above also belongs to the binary search tree, but its performance has been degraded into a linked list, and the time complexity of finding data has become 0(n). You can see that the depth of the first tree is 3, that is to say, it only needs 3 comparisons at most to find the node, while the depth of the second tree is 7, and it needs 7 comparisons at most to find the node.

In order to improve query efficiency, it is necessary 减少磁盘IO数. In order to reduce the number of disk IOs, we need to try our best 降低树的高度to change the original "thin and tall" tree structure into "short and fat". The more forks in each layer of the tree, the better.

6.4 AVL tree

In order to solve the above problem that the binary search tree degenerates into a linked list, it is proposed 平衡二叉搜索树(Balanced Binary Tree), also known as the AVL tree (different from the AVL algorithm), which adds constraints on the basis of the binary search tree and has the following properties:

It is an empty tree or the absolute value of the height difference between its left and right subtrees does not exceed 1, and both left and right subtrees are a balanced binary tree.

There are many kinds of common balanced binary trees, including 平衡二叉搜索树, 红黑树, 数堆, 伸展树. The balanced binary search tree is the first self-balancing binary search tree proposed. When we improve the balanced binary search tree, it generally refers to the balanced binary search tree. In fact, the first tree is a balanced binary search tree, and the search time complexity is O(log2n).

The time of data query mainly depends on the number of disk I/O. If we use the form of binary tree, even if it is improved through the credential binary search tree, the depth of the tree is O(log2n). When n is relatively large, the depth is relatively High, such as the situation in the figure below:

image-20230311233710010

每访问一次节点就需要进行一次磁盘I/O操作, for the tree above, we need to do 5 I/O operations. Although the efficiency of the balanced binary tree is high, the depth of the tree is also high, which means that the number of disk I/Os is high, which will affect the efficiency of the overall data query.

For the same data, M叉树what if we put a binary tree (M > 2)? When M = 3, the same 31 nodes can be stored by the following ternary tree:

image-20230311234257778

You can see that the height of the tree is reduced at this time. When the amount of data N is large, and the forked tree M of the tree is large, the height of the M-fork tree will be much smaller than the height of the binary tree (M > 2). So, we need to put 树从“瘦高”变成“矮胖”.

6.5 B-Tree

The English of B tree is Balance Tree, that is 多路平衡查找树. The abbreviation is B-Tree (note that the horizontal bar indicates the meaning of connecting these two words, not the minus sign). Its height is much smaller than that of a balanced binary tree.

The structure of the B-tree is shown in the figure below:

image-20230311235901135

B-tree is a multi-way balanced search tree, each node of which can contain at most M child nodes, M称为B树的阶. Each disk block includes 关键字and 子节点的指针. If x keywords are included in a disk block, then the pointer is x + 1. For a 100-order B-tree, if there are 3 layers, it can store up to about 1 million index data. For a large amount of index data, the B-tree structure is very suitable, because the height of the tree is much smaller than that of the binary tree.

An M-order B-tree (M > 2) has the following characteristics:

  1. The range of the number of children of the root node is [2,M].
  2. Each intermediate node contains k-1 keywords and k children, the number of children = the number of keywords + 1, and the value range of k is [ceil(M/2)],
  3. A leaf node includes k - 1 keywords (the leaf node has no children), and the value range of k is [ceil (M/2),M].
  4. Assume that the keys of the intermediate nodes are: Key[1], key[2],...,key[k-1], and the keys are sorted in ascending order, that is, key[i] < key[i + 1]. At this time, the k - 1 keyword is equivalent to dividing k ranges, that is, corresponding to k pointers, namely: P[1], P[2], ..., P[k], where P[1] points to the key The word is less than the subtree of Key[1], p[i] points to the subtree where the keyword belongs to (Key[i - 1], Key[i]), P[i] points to the subtree where the keyword belongs to (Key[i - 1], Key[i]). P[k] points to a subtree whose key is greater than Key[k - 1].
  5. All leaf nodes are at the same level.

The B-tree shown in the picture above is a 3rd-order B-tree. We can look at disk block 2, the key inside is (8,12), it has 3 children (3,5), (9.10) and (13,15), you can see that (3,5) is less than 8, (9,10) is between 8 and 12, and (13,15) is greater than 12, just in line with the characteristics we just gave.

Then let's look at how to use B-trees to search. Assuming we want 查找的关键字是9, the steps can be divided into the following steps:

  1. We compare with the key (17,35) of the root node, and if 9 is less than 17, we get the pointer P1;
  2. Find the disk block 2 according to the pointer P1, the key is (8,12), because 9 is between 8 and 12, so we get the pointer P2;
  3. Find the disk block 6 according to the pointer P2, the key is (9,10), and then we find the key 9.

You can see that during the search process of the B-tree, we have compared a lot, but if the data is read out and compared in memory, this time is negligible. However, reading the disk block itself requires I/O operations, which consumes more time than the time required for comparison in memory, and is an important factor in the data search time. B树相当于比平衡二叉树来说磁盘I/O操作要少, which is more efficient than a balanced binary tree in data query. so 只要树的高度足够低,IO次数足够少,就可以提高查询性能.

summary:

  1. If the B tree causes the tree to be unbalanced when inserting and deleting nodes, it will maintain the self-balancing of the tree by automatically adjusting the position of the nodes.
  2. Keyword sets are distributed throughout the tree, that is, both leaf nodes and non-leaf nodes store data. It is possible for the search to end up at a non-leaf node.
  3. Its search performance is equivalent to doing a binary search in the keyword corpus.

Another example:

image-20230312003539203

6.6 B+ tree

B+ tree is also a multi-way search tree, 基于B树做出了改进and mainstream DBMSs support B+ tree indexing, such as MySQL. Compared to B-Tree, B+Tree适合文件索引系统.

  • MySQL official website description

image-20230312004129925

The difference between B+ tree and B tree lies in the following points:

  1. If there are k child nodes, there are k keywords. That is, the number of children = the number of keywords, and in the B tree, the number of children = the number of keywords + 1.
  2. The keywords of non-leaf nodes also exist in child nodes at the same time, and are the maximum (or minimum) of all keywords in child nodes.
  3. Non-leaf nodes are only used for indexing and do not save data records. Information related to records is placed in leaf nodes. And in the B tree, 非叶子节点既保存索引,也保存数据记录.
  4. All keywords appear in the leaf nodes, and the leaf nodes form an ordered linked list, and the leaf nodes themselves are linked in ascending order according to the size of the keywords.

The figure below is a B+ tree with an order of 3. The keywords 1, 18, and 35 in the root node are child nodes (1,8,14), (18,24,31) and (35,41,53) respectively. ), the keywords of each parent node will appear in the keywords of the child nodes of the next layer, so all the keyword information is included in the leaf nodes, and each leaf node has a pointing to the next Node pointers, thus forming a linked list.

A total of 3 I/O operations were performed in the whole process. It seems that the query process of B+ tree and B tree is similar, but the fundamental difference between B+ tree and B tree is that the intermediate nodes of B+ tree do not directly store data . What are the benefits?

First of all, B+ tree query efficiency is more stable . Because the B+ tree can only find the corresponding data every time it visits the leaf node, and in the B tree, the non-leaf nodes will also store data, which will cause the query efficiency to be unstable. Sometimes the non-leaf nodes can be accessed Find the keyword, and sometimes you need to visit the leaf node to find the node keyword.

Secondly, the query efficiency of B+ tree is higher . This is because usually B+ trees are 更矮胖larger than B trees (order trees are larger and lower in depth), and queries require less disk I/O. With the same disk page size, B+ tree can store more node keys.

Not only in the query of a single keyword, but also in the search range, the efficiency of the B+ tree is higher than that of the B tree . This is because all keywords appear in the leaf nodes of the B+ tree, there will be pointers between the leaf nodes, and the data is incremented, which allows us to search for ranges through pointer connections. In the B tree species, the search range needs to be traversed in order to complete the search, and the efficiency is much lower.

Both B-tree and B+ tree can be used as the data structure of the index, and B+ tree is used in MySQL.

However, B-tree and B+ tree have their own application scenarios. It cannot be said that B+ tree is completely better than B-tree, and vice versa.

Thinking question: In order to reduce IO, will the index tree be loaded at one time?

1. The database index is stored on the disk. If the amount of data is large, the size of the index will inevitably be large, exceeding several gigabytes.

2. When we use the index to query, it is impossible to load all the indexes of several G into the memory. What we can do is: load each disk page one by one, because the disk page corresponds to the node of the index tree.

Thinking question: What is the storage capacity of the B+ tree? Why do you say that the general search for row records only needs 1 to 3 disk IOs at most?

The page size in the InnoDB storage engine is 16KB. The primary key type of a general table is INT (occupies 4 bytes) or BIGINT (occupies 8 bytes), and the pointer type is generally 4 or 8 bytes, that is to say, a The page (a node in B+Tree) stores about 16KB/(8B + 8B) = 1k key values ​​(because it is an estimate, for the convenience of calculation, the value of K here is 10 ^ 3. That is to say, a depth A B+Tree index of 3 can maintain 10 ^ 3 * 10^3 * 10^3 = 1 billion records. (Here it is assumed that a data page also stores 10 ^ 3 rows of record data)

In actual situations, each node may not be fully filled, so in the database, B+Tree的高度一般都在2~4层. MySQL's InnoDB storage engine is designed to keep the root node resident in memory, that is to say, only 1 to 3 disk I/O operations are needed at most when searching for a row record of a certain key value.

Thinking question: Why is the B+ tree more suitable for the file index and database index of the operating system in practical applications than the B-tree?

1. The disk read and write cost of B+ tree is lower

The internal nodes of the B+ tree do not have pointers to the specific information of the keywords. Therefore, its internal nodes are smaller than the B-tree. If all the keywords of the same internal node are stored in the same disk block, then the number of keywords that the disk block can hold is also greater. The more keywords that need to be searched are read into the memory at one time. Relatively speaking, the number of IO reads and writes is also reduced.

2. B+ tree query efficiency is more stable

Since the non-terminal is not the node that ultimately points to the file content, it is just the index of the keyword in the leaf node. So any keyword search must take a path from the root node to the leaf node. All keyword queries have the same path length, resulting in comparable query efficiency for each data.

Thinking question: The difference between Hash index and B+ tree index

1. Hash index 不能进行范围查找, and B+ tree can. This is because the data pointed to by the Hash index is unordered, while the leaf node of the B+ tree is an ordered linked list.

2. Hash index 不支持联合索引的最左侧原则(that is, some indexes of joint index cannot be used), while B+ tree can. For a joint index, when calculating the Hash value of the Hash index, the index key is merged and then the Hash value is calculated together, so the Hash value is not calculated separately for each index. Therefore, if one or several indexes of the joint index are used, the joint index cannot be used.

3. Hash index 不支持ORDER BY排序, because the data pointed to by the Hash index is out of order, so it cannot play the role of sorting optimization, while the B+ tree index data is ordered, which can play the role of optimizing the ORDER BY sorting of the field. In the same way, we can't use Hash index to perform 模糊查询fuzzy query, but when B+ tree uses LIKE for fuzzy query, fuzzy query after LIKE (such as the end of %) can play an optimized role.

4、InnoDB不支持哈希索引

Thinking question: Are the Hash index and B+ tree index manually specified when building the index?

If you are using MySQL, we need to know which index structures are supported by the MySQL storage engine, as shown in the figure below (reference source https://dev.mysql.com/doc/refman/8.0/en/create-index.html ). If it is another DBMS, you can refer to the relevant DBMS documentation.

image-20230312144924411

You can see that for the InnoDB and MyISAM storage engines, the B+ tree index is used by default, and the Hash index cannot be used. The adaptive hash provided by InnoDB does not need to be manually specified. If it is a Memory/Heap and NDB storage engine, it is possible to select a Hash index.

6.7 R-trees

R-Tree is rarely used in MySQL and only supports it geometry数据类型. The only storage engines that support this type are MyISM, bdb, InnoDB, ndb, and archive. Give an example of what R-trees can solve in the real world: find all restaurants within 20 miles. If there is no R-tree, how would you solve it? In general, we divide the coordinates (x, y) of the restaurant into two fields and store them in the database. One field records the longitude, and the other field records the latitude. In this case, we need to traverse all restaurants to obtain their location information, and then calculate whether they meet the requirements. If there are 100 restaurants in an area, we have to perform 100 location calculation operations. If it is applied to such a large database as Google and Baidu Maps, this method will definitely not be feasible. R-trees are fine 解决了这种高纬空间搜索问题. It extends the idea of ​​B-tree to multi-dimensional space, adopts the idea of ​​B-tree to divide space, and adopts the method of merging and decomposing nodes when adding and deleting operations to ensure the balance of the tree. Therefore, the R tree is a tree used for 存储高维数据的平衡树. Compared with B-Tree, the advantage of R-Tree is range search.

indexing/storage engine MyISAM InnoDB Memory
R-Tree index support support not support

6.8 Summary

Using Yin can help us quickly locate the data we want to find from a large number of data centers, but the index also has some shortcomings, such as occupying storage space, reducing the performance of database write operations, etc. If there are multiple indexes, it will also increase the time for index selection . When we use an index, we need to balance the advantages of the index (improving query efficiency) and the disadvantages (the cost of maintaining the index).

In actual work, we also need to determine whether to use the index based on the demand and the distribution of the data itself. Although, after all, the 索引不是万能的essence 数据量大的时候不使用索引是不可想象的of the index is to help us improve the efficiency of data retrieval.

Appendix: Time Complexity of Algorithms

The same problem can be solved by different algorithms, and the quality of an algorithm will affect the efficiency of the algorithm and even the program. The purpose of algorithm analysis is to select the appropriate algorithm and improve the algorithm.

Guess you like

Origin blog.csdn.net/weixin_43811294/article/details/129476125