Detailed explanation of the index of Mysql exploration, and can talk to the interviewer again~

Recommended reading:

Preface

  • What is the index? What are the pros and cons? Once asked in the interview, it may be a tricky question for new beginners.

  • This article will describe in detail what is index, the advantages and disadvantages of index, data structure and other common knowledge.

What is an index

  • An index is a data structure that stores and sorts the values ​​of specific columns in the table, so it is created on the columns of the table. The index will speed up the search by reducing the number of records that need to be queried in a table. If there is no index, the database has to perform a full table scan. An index is like a table of contents for a book, it will let you find the content faster.

Advantages of indexing

  1. By creating a unique index, you can ensure the uniqueness of each row of data in the database table.

  2. It can greatly speed up the data retrieval speed, avoid the data scan of the whole table, and greatly reduce the number of matching rows traversed. This is also the main reason for creating an index.

  3. Can speed up the connection between the table and the table, especially in the realization of the referential integrity of the data is particularly meaningful.

  4. When using grouping and sorting clauses for data retrieval, it can also significantly reduce the time for grouping and sorting in queries.

  5. By using the index, you can use the optimization hider in the query process to improve the performance of the system.

Disadvantages of indexes

  1. Creating and maintaining indexes takes time, and this time increases as the amount of data increases.

  2. The index needs to occupy physical space. In addition to the data space occupied by the data table, each index also occupies a certain physical space. If a clustered index is to be established, the space required will be even greater.

  3. When the data in the table is added, deleted, and modified, the index must also be dynamically maintained, which reduces the speed of data maintenance.

On which columns to index

  1. On the columns that need to be searched frequently, you can speed up the search;

  2. On the column as the primary key, enforce the uniqueness of the column and the arrangement structure of the data in the organization table;

  3. In the columns that are often used in connection, these columns are mainly foreign keys, which can speed up the connection;

  4. Create an index on the column that often needs to be searched based on the range, because the index is sorted and the specified range is continuous;

  5. Create an index on a column that often needs to be sorted, because the index is already sorted, so that the query can use the sort of the index to speed up the sort query time;

  6. WHERECreate indexes on columns frequently used in clauses to speed up the judgment of conditions.

Which columns are not to be indexed?

  1. Indexes should not be created for columns that are rarely used or referenced in queries. This is because, since these columns are rarely used, indexing or no indexing does not improve query speed. On the contrary, due to the increase of indexes, it reduces the maintenance speed of the system and increases the space requirements.

  2. Indexes should not be added to columns with few data values. This is because, because these columns have very few values, such as the gender column of the personnel table, in the query results, the data rows in the result set account for a large proportion of the data rows in the table, that is, the data that needs to be searched in the table The proportion of rows is large. Increasing the index does not significantly speed up the retrieval speed.

  3. Indexes should not be added to columns that are defined as textimageand bitdata types. This is because the amount of data in these columns is either quite large or has few values.

  4. When the modification performance is far greater than the retrieval performance, an index should not be created. This is because modification performance and retrieval performance are contradictory. When the index is increased, the retrieval performance will be improved, but the modification performance will be reduced. When the index is reduced, the modification performance will be improved and the retrieval performance will be reduced. Therefore, when the modification performance is far greater than the retrieval performance, an index should not be created.

Index data structure

Common index data structures B+Treeare: Hash索引, FullText索引, R-Tree索引, .

Hash index

1 Overview:

In MySQL, only the Memorystorage engine supports Hashindexes, which is Memorythe default index type for tables. The hash index organizes the index of the data in the form of hash values, so the retrieval efficiency is very high and can be located at one time, unlike the B-/+Treeindex that requires multiple IO operations from the root node to the leaf node.

2. Disadvantages of Hash index:

① Hash index can only satisfy equivalent queries, but cannot satisfy range queries. Because after the data passes through the Hash algorithm, the size relationship may change. ② Hash index cannot be sorted. The same is because the size relationship may change after the data passes through the Hash algorithm, and sorting is meaningless.

③ Hash index cannot avoid scanning of table data. Because when a Hash collision occurs, it is not enough to compare the Hash value. You need to compare the actual value to determine whether it meets the requirements.

④ The performance of Hash index is not necessarily higher than that of B-Tree index when a large number of Hash values ​​are the same. Because collisions will cause multiple scans of table data, resulting in low overall performance, this problem can be solved to a certain extent by using a suitable Hash algorithm.

⑤ Hash index cannot use some index keys to query. Because when the combined index is used, the hash value is calculated after combining the data of multiple database columns, so it is meaningless to calculate the hash value for the data of a single column.

FullText index

1 Overview:

Full-text index, currently only MySQL MyISAMstorage engine support, and only char, varchar, text type support. It is used to replace the less efficient like fuzzy matching operation, and it can fully fuzzy match multiple fields at one time through the full-text index of multiple field combinations.

2. Storage structure:

The B-Treeindex data is also stored, but a specific algorithm is used. The field data is divided and then indexed (generally every 4 bytes). The index file stores the index string set before the division and the index after the division. Index information, the node corresponding to the Btree structure stores the divided word information and its position in the index string set before the division.

B-/+Tree index

  • B+Tree is the most frequently used index data structure of mysql, and it is the index type of Innodb and Myisam storage engine mode. Compared with Hash index, the speed of B+ tree searching for a single record is not as fast as Hash index, but it is more suitable for sorting and other operations.

1. Advantages of B+Tree index:

  • B+Tree with sequential access pointers: All index data of B+Tree is on the leaf nodes, and sequential access pointers are added. Each leaf node has a pointer to the adjacent leaf node. This is done to improve the efficiency of interval query. For example, to query all data records with key from 18 to 49, when 18 is found, all data nodes can be accessed at one time by traversing in the order of nodes and pointers, which is greatly mentioned Interval query efficiency.

  • Greatly reduce the number of disk I/O reads.

B-/+Tree index:

  • File systems and database systems generally use B-/+Tree as the index structure: Generally speaking, the index itself is also very large, and it is impossible to store all of it in memory, so the index is often stored on the disk in the form of an index file. In this case, disk I/O consumption will be generated during the index search process. Compared with memory access, the consumption of I/O access is several orders of magnitude higher. Therefore, the most important indicator to evaluate the pros and cons of a data structure as an index is The incremental complexity of the number of disk I/O operations in the search process. In other words, the structural organization of the index should minimize the number of disk I/O accesses during the search process.

Locality processing and disk read ahead

  • Due to the characteristics of storage media, disk access is much slower than main memory. In addition to the cost of mechanical motion, disk access speed is often a few hundredths of main memory. Therefore, in order to improve efficiency, minimize disk access. I/O. In order to achieve this goal, the disk is often not read strictly on-demand, but read ahead every time. Even if only one byte is needed, the disk will start from this position and sequentially read a certain length of data backwards into the memory.

  • Since the disk sequential read efficiency is very high (no seek time is required, only a small rotation time is required), for localized programs, pre-reading can improve I/O efficiency. The length of pre-reading is generally an integral multiple of a page. A page is a logical block of computer management memory. The hardware and operating system often divide the main memory and the disk storage area into continuous blocks of equal size. Each storage block is called a page (in many operating systems, the size of a page is usually 4k) The main memory and disk exchange data in units of pages. When the data to be read by the program is not in the main memory, a page fault exception will be triggered. At this time, the system will send a disk read signal to the disk, and the disk will find the starting position of the data and read one or several pages continuously. Load into memory, then return abnormally, and the program continues to run.

Performance analysis of B-/+Tree index

  • As mentioned above, the number of disk I/Os is generally used to evaluate the pros and cons of the index structure. Analyzing from B-Tree first, according to the definition of B-Tree, it can be seen that at most h nodes need to be visited at one time. The designer of the database system cleverly used the disk read-ahead principle, setting the size of a node equal to one page, so that each node only needs one I/O to be fully loaded. In order to achieve this goal, the following techniques are needed to actually implement B-Tree:

  • Every time a new node is created, a space for a page is directly requested. This ensures that a node is physically stored in a page. In addition, the computer storage allocation is page-aligned, so that a node only needs one I/O.

  • A search in B-Tree requires h-1at most I/O (root node resident memory), and the progressive complexity is O(h)=O(logdN). In general practical applications, the out-degree d is a very large number, usually more than 100, so h is very small (usually not more than 3).

  • In summary, the efficiency of using B-Tree as an index structure is very high.

  • In the red-black tree structure, h is obviously much deeper. Since logically close nodes (parent and child) may be physically far away and locality cannot be used, the I/O progressive complexity of the red-black tree is also O(h), which is significantly less efficient than B-Tree.

  • In addition, B+Tree is more suitable for external storage index, the reason is related to the internal node out degree d. It can be seen from the above analysis that the larger d, the better the performance of the index, and the upper limit of the out degree depends on the size of the key and data in the node. Because the node in the B+Tree removes the data field, it can have a larger out degree , Has better performance. (See point 3 of this section for details)

Comparison of B-Tree and B+Tree

  • According to the structure of B-Tree and B+Tree, we can find that B+ tree has more advantages in file system or database system than B tree. The reasons are as follows:

1. B+ tree disk read and write costs are lower

The internal node of the B+ tree does not have a pointer to the specific information of the keyword. Therefore, its internal nodes are smaller than the B-tree. If all the keywords of the same internal node are stored in the same disk block, the more keywords the disk block can hold. The more keywords that need to be searched are read into the memory at one time. Relatively speaking, the number of I/O reads and writes is reduced.

2. B+ tree query efficiency is more stable

Because the internal node is not the node that ultimately points to the content of the file, but only the index of the keyword in the leaf node. Therefore, any keyword search must take a path from the root node to the leaf node. The path length of all keyword queries is the same, resulting in the same query efficiency for each data.

3. B+ tree is more conducive to scanning the database

The B-tree does not solve the problem of low efficiency of element traversal while improving the performance of disk IO. The B+ tree only needs to traverse the leaf nodes to solve the scan of all keyword information. Therefore, for the frequently used range query in the database, B+ tree has higher performance.

Implementation of MySQL Index

  • In MySQL, the index belongs to the concept of the storage engine level. Different storage engines implement indexes in different ways. This section mainly discusses the index implementation methods of the MyISAM and InnoDB storage engines.

Implementation of MyISAM index

1. Primary key index

The MyISAM engine uses B+Tree as the index structure, and the data field of the leaf node stores the address of the data record. The following figure is the schematic diagram of MyISAM index:

  • There are three columns in the table. Assuming that we use Col1 as the primary key, the above figure shows the primary key of a MyISAM table. It can be seen that the MyISAM index file only saves the address of the data record.

2. Auxiliary Index

In MyISAMthe primary and secondary index index ( Secondary keyno) no difference in structure, but the main index key requirement is unique, and the auxiliary key index may be repeated. If we build an auxiliary index on Col2, the structure of this index is shown in the figure below:

  • It is also a B+Tree, and the data field saves the address of the data record. Therefore, the index retrieval algorithm in MyISAM is to first search the index according to the B+Tree search algorithm. If the specified Key exists, then take out the value of its data field, and then use the value of the data field as the address to read the corresponding data record.

  • MyISAM's indexing method is also called "non-clustered", the reason for this is to distinguish it from InnoDB's clustered index.

InnoDB index implementation

  • Although InnoDB also uses B+Tree as the index structure, the specific implementation is different.

1. Primary key index

The first major difference from MyISAM is that InnoDB's data file itself is an index file. Known from the above, MyISAM index file and data file are separated, and the index file only saves the address of the data record. In InnoDB, the table data file itself is an index structure organized by B+Tree, and the leaf node data field of this tree saves complete data records. The key of this index is the primary key of the data table, so the InnoDB table data file itself is the primary index.

The above figure is a schematic diagram of InnoDB's main index (also a data file). You can see that the leaf node contains a complete data record. This kind of index is called a clustered index. Because InnoDB's data file itself needs to be aggregated by the primary key, InnoDB requires the table to have a primary key (MyISAM may not). If it is not explicitly specified, the MySQL system will automatically select a column that can uniquely identify the data record as the primary key. If it does not exist For this kind of column, MySQL automatically generates an implicit field for the InnoDB table as the primary key. The field length is 6 bytes and the type is long integer.

2. Auxiliary Index

The second difference from MyISAM index is that InnoDB's secondary index data field stores the value of the primary key of the corresponding record instead of the address. In other words, all InnoDB secondary indexes refer to the primary key as the data field. For example, the following figure shows an auxiliary index defined on Col3:

  • Here, the ASCII code of English characters is used as the comparison criterion. This implementation of the clustered index makes the search by the primary key very efficient, but the secondary index search needs to retrieve the index twice: first retrieve the secondary index to obtain the primary key, and then use the primary key to retrieve the records from the primary index.

  • InnoDB tables are built based on clustered indexes . Therefore, InnoDB's index can provide a very fast primary key lookup performance. However, its secondary index will also contain the primary key column, so if the primary key uses too long fields, it will cause other secondary indexes to become larger. If you want to define many indexes on the table, try to define the primary key as small as possible. InnoDB does not compress indexes.

Guess you like

Origin blog.csdn.net/weixin_45784983/article/details/108365935