Detailed explanation of MySQL indexes (nanny-level tutorial)

1. Index Overview

Indexes are data structures (ordered) that help MySQL  obtain data efficiently . In addition to data, the database system also maintains data structures that satisfy specific search algorithms. These data structures reference (point to) the data in some way, so that advanced query algorithms can be implemented on these data structures. This data structure is an index. .

2. Advantages and Disadvantages of Indexing

advantage:

  • Improve data retrieval efficiency and reduce database IO costs
  • Sort data through index columns to reduce the cost of data sorting and reduce CPU consumption.

shortcoming:

  • Index columns also take up space
  • Indexes greatly improve query efficiency, but reduce the speed of updates, such as INSERT, UPDATE, and DELETE.

3. Index structure

3.1 Introduction to index structure

Index structure describe
B+Tree The most common index type, most engines support B+ tree index (default)
Hash The underlying data structure is implemented using a hash table. Only queries that exactly match the index column are valid. Range queries are not supported.
R-Tree (spatial index) Spatial index is a special index type of the MyISAM engine, mainly used for geospatial data types (usually less used)
Full-Text (full-text index) It is a way to quickly match documents by building an inverted index, similar to Lucene, Solr, ES (usually less used)

 3.2 Index support by different storage engines

index InnoDB MyISAM Memory
B+Tree index support support support
Hash index not support not support support
R-Tree index not support support not support
Full-text Supported after version 5.6 support not support

 The indexes we usually refer to, unless otherwise specified, refer to indexes organized in a B+ tree structure.

 4. Introduction to data structures (binary tree, red-black tree, Btree, B+tree)

4.1 Binary tree

Binary Tree is a special tree data structure in which each node has at most two child nodes, called left child node and right child node. The definition of a binary tree is as follows:

A binary tree can be empty (that is, have no nodes), or it can consist of a root node and two binary trees called left subtree and right subtree.

Characteristics of binary trees:

  1. Each node has at most two child nodes, namely the left child node and the right child node.
  2. The left subtree and right subtree are also binary trees and can be empty.
  3. There is no specific order for the child nodes of a binary tree, and the positions of the left and right child nodes can be determined according to the specific application.

 4.2 Red-black tree

A Red-Black Tree is a self-balancing binary search tree that maintains balance by rearranging the colors of nodes after insertion and deletion operations. The name of a red-black tree comes from the color markings on each node, which can be red or black.

Red-black trees have the following characteristics:

  1. Each node is either red or black.
  2. The root node is black.
  3. All leaf nodes (NIL nodes) are black.
  4. If a node is red, both of its child nodes are black.
  5. For any node, the simple path from that node to all its descendant leaf nodes contains the same number of black nodes.

These characteristics ensure the key property of red-black trees, that is, the longest path from the root node to any leaf node will not exceed twice the shortest path, thus maintaining the balance of the tree. This balance makes red-black trees very efficient in practical applications, and they are often used as the basis for data structures such as collections and mappings.

 Disadvantages of red-black tree: In the case of large data volume, the hierarchy is deep and the retrieval speed is slow.

4.3 B-Tree (multi-path balanced search tree)

B-Tree (B-tree) is a self-balancing search tree structure used to store and organize large amounts of data. It is widely used in fields such as databases and file systems to provide efficient data access and query performance.

B-Tree features include:

  1. Multi-path balancing: Each node can contain multiple keywords and sub-nodes, which makes B-Tree have better balancing performance. Normally, all leaf nodes of a B-Tree are located at the same level.
  2. Orderliness: The keywords in B-Tree are arranged in ascending order, which is very efficient when performing range queries.
  3. Disk friendliness: The node size of B-Tree usually matches the size of the hard disk page, which can minimize disk I/O operations and improve read and write performance.
  4. Adaptability: B-Tree can dynamically adjust its structure to adapt to dynamic insertion and deletion operations of data, maintaining balance and stable performance.

The basic operations of B-Tree include insertion, deletion and search. During insertion and deletion operations, B-Tree maintains balance by redistributing keys and adjusting nodes. By using B-Tree indexes, data retrieval efficiency can be significantly improved, especially for large-scale data sets.

It should be noted that B-Tree is not limited to the structure of a binary tree. Each node can contain multiple child nodes, making it suitable for processing large-scale data sets.

 B-Tree data insertion process animation reference: Data Structure Visualization

B-Tree Visualization  (If the above demo link cannot be opened, please change it)

4.4  B+Tree

B+Tree (B+Tree) is a self-balancing search tree structure similar to B-Tree, which is widely used in fields such as databases and file systems. It is a variant of B-Tree. Compared with B-Tree, it has some optimizations in storage and query performance.

B+ tree is similar to B-Tree and also has the characteristics of multi-path balance, orderliness and disk friendliness. But B+ trees have a different design in some aspects:

  1. Only leaf nodes store data: The internal nodes of the B+ tree only store index information, while the actual data records are stored in the leaf nodes, which can improve the efficiency of range queries.
  2. Leaf nodes are connected through pointers: The leaf nodes of the B+ tree are connected using pointers to form an ordered linked list, which facilitates range query and sequential traversal.
  3. Better performance in sequential access: Due to the pointer connections between leaf nodes and the form of an ordered linked list, B+ trees have better performance in sequential access. For example, for range queries or traversing data in keyword order, B+ trees are more suitable than B-Tree.
  4. The leaf nodes are not connected to each other: There is no direct connection between the leaf nodes of the B+ tree. Navigation needs to be done through internal nodes, which can reduce the space occupied by internal nodes.

B+ trees are often used as index structures in database systems, and are particularly suitable for scenarios that support range queries and sequential access to data. Its balance and disk-friendliness enable good performance during storage and retrieval of large-scale data sets.

  B+Tree’s data insertion process animation reference: Data Structure Visualization

B+ Tree Visualization (If the above demo link cannot be opened, please change it)

The MySQL index data structure is optimized for the classic B+Tree. On the basis of the original B+Tree, a linked list pointer pointing to the adjacent leaf node is added to form a B+Tree with a sequential pointer, which improves the performance of interval access.

4.5 Hash hash index

Hash Index is an index structure used to quickly find data in a database. It achieves efficient data search by converting the keyword (Key) into a fixed-length hash value (Hash Value) through a hash function (Hash Function), and then establishing a mapping relationship between this hash value and the storage location.

The main features of hash indexes include:

  1. Fast search: By using a hash function to map keywords to storage locations, hash indexes can directly access target data in constant time, so they have very high search efficiency.

  2. Equality query optimization: Hash indexes are suitable for equality comparison queries (such as WHERE column = value). For such queries, you only need to calculate the hash value and perform a search without traversing the entire index.

  3. Range queries and sorting are not supported: Since hash indexes are looked up based on hash values, range queries (such as WHERE column > value) and sorting operations are not supported.

  4. Conflict handling: Since the hash function may map different keywords to the same hash value, this situation is called a hash collision. Common conflict resolution methods include the open address method and the linked list method.

It should be noted that the hash index may not be as effective as the B-tree index in some scenarios because it cannot support range queries and sort operations, and performance may decrease when there are a large number of conflicts. Therefore, when selecting an index type, comprehensive considerations need to be made based on specific business needs and data characteristics.

 Hash index uses a certain hash algorithm to convert the key value into a new hash value, maps it to the corresponding slot, and then stores it in the hash table.
If two (or more) key values ​​are mapped to the same slot, they will cause a hash conflict (also called hash collision), which can be resolved through a linked list.

Hash index features:

  • Hash index can only be used for peer comparison (=, in) and does not support range query (betwwn, >, <,...)
  • Unable to use index to complete sort operation
  • Query efficiency is high, usually only one retrieval is required, and the efficiency is usually higher than B+Tree index

Storage engine supports:

  • Memory
  • InnoDB: has adaptive hash function. The hash index is automatically built by the storage engine based on the B+Tree index under specified conditions.

******Interview questions******

Why does the InnoDB storage engine choose to use the B+Tree index structure?

  • Compared with a binary tree: When a binary tree is inserted sequentially, a linked list will be formed, and the query performance is greatly reduced. In the case of large amounts of data, the hierarchy is deep and the retrieval speed is slow. B+ tree can solve the problem of sequential insertion with fewer levels and high search efficiency.
  • Compared with the red-black tree: Although the red-black tree solves the problem of sequential insertion to form a linked list, it is essentially a binary tree. In the case of large amounts of data, the level is deeper and the retrieval speed is slow. B+ tree can solve the problem of sequential insertion with fewer levels and high search efficiency.
  • Compared with B-Tree: For B-Tree, no matter it is a leaf node or a non-leaf node, the data will be saved. This will cause the key values ​​stored in a page to decrease, and the pointers will also decrease. To save a large amount of data, the only way is to increase The height of the tree, resulting in reduced performance. The internal nodes of the B+ tree only store index information, while the actual data records are stored in the leaf nodes, which can improve the efficiency of range queries. The leaf nodes of the B+ tree will form an ordered linked list, which facilitates range query and sequential traversal.
  • Relative to Hash index: Hash index only supports equal value matching and does not support range query and sorting. B+Tree supports range matching and sorting operations.

5. Introduction to index classification

5.1 Index classification

Classification meaning Features Keywords
primary key index Index created on the primary key in the table Automatically created by default, there can only be one PRIMARY
unique index Avoid duplication of values ​​in a data column in the same table There can be multiple UNIQUE
regular index Quickly locate specific data There can be multiple
Full text index Full-text indexing searches for keywords in the text rather than comparing values ​​in the index There can be multiple FULL TEXT

5.2 InnoDB storage engine index classification 

In the InnoDB storage engine, according to the storage form of the index, it can be divided into the following two types:

Classification meaning Features
Clustered Index Put the data storage and index together, and the leaf nodes of the index structure save the row data There must be, and only one
Secondary Index Store data and indexes separately. The leaf nodes of the index structure are associated with the corresponding primary keys. There can be multiple

Clustered index selection rules:

  • If a primary key exists, the primary key index is a clustered index
  • If there is no primary key, the first unique (UNIQUE) index will be used as the clustered index
  • If the table does not have a primary key or a suitable unique index, InnoDB will automatically generate a rowid as a hidden clustered index

 Assuming that the id field of the user table is a clustered index and the name field is a secondary index, then the query sequence of select * from user where name = 'Arm' is as follows:

The data of name = Arm will be queried in the secondary index first, and the id of name = Arm is 10, and then the data of id = 10 will be queried in the clustered index (the row data of this row is stored in the clustered index). The picture is as follows:

Thinking questions

Which of the following SQL statements has the highest execution efficiency? Why?

select * from user where id = 10;

select * from user where name = 'Arm';

-- 备注:id为主键,name字段创建的有索引

Answer: The first statement, because the second statement requires a table query, is equivalent to two steps. 

6. Use of indexes (create, view, delete)

Create index:
  CREATE [ UNIQUE | FULLTEXT ] INDEX index_name ON table_name (index_col_name, ...);

The explanation is as follows:

  • CREATE INDEX: Keyword to create index.
  • [ UNIQUE | FULLTEXT ]: Optional parameter, specifying the index type. UNIQUEIt means creating a unique index, that is, the value of the index column must be unique; FULLTEXTit means creating a full-text index for full-text search. If not specified, the default is a normal index.
  • index_name:Specify the name of the index.
  • ON table_name: Specify which table to create the index on, table_namewhich is the table name.
  • (index_col_name, ...): Specify the column name to create an index. You can specify one or more columns as the keys of the index. Separate multiple columns with commas.

For example, if you want to userscreate a idx_usernamenormal index named on the table named with the index column username, you can use the following statement:

        CREATE INDEX idx_username ON users (username);

If you want to create a unique index, you can UNIQUEadd the keyword to the statement:

        CREATE UNIQUE INDEX idx_email ON users (email);

If you want to create a full-text index, you can FULLTEXTadd keywords to the statement:

        CREATE FULLTEXT INDEX idx_content ON articles (content);

View index:
  SHOW INDEX FROM table_name;

Delete index:
  DROP INDEX index_name ON table_name;

Case: 

  1. -- name字段为姓名字段,该字段的值可能会重复,为该字段创建索引
  2. create index idx_user_name on tb_user(name);
  3. -- phone手机号字段的值非空,且唯一,为该字段创建唯一索引
  4. create unique index idx_user_phone on tb_user (phone);
  5. -- 为profession, age, status创建联合索引
  6. create index idx_user_pro_age_stat on tb_user(profession, age, status);
  7. -- 为email建立合适的索引来提升查询效率
  8. create index idx_user_email on tb_user(email);
  9. -- 删除索引
  10. drop index idx_user_email on tb_user;

7. SQL performance analysis

7.1 SQL execution frequency (understand)

 After the My5QL client is successfully connected, server status information can be provided through the show [session|global] status command. Through the following commands, you can check the access frequency of INSERT, UPDATE, DELETE, and SELECT of the current database.

After successfully connecting to the MySQL server using the My5QL client, you can obtain the status information of the server by using the SHOW SESSION STATUSor command. SHOW GLOBAL STATUSSpecifically, you can use the following instructions to view the access frequency of INSERT, UPDATE, DELETE and SELECT in the current database:

SHOW SESSION STATUS LIKE 'Com_insert';
SHOW SESSION STATUS LIKE 'Com_update';
SHOW SESSION STATUS LIKE 'Com_delete';
SHOW SESSION STATUS LIKE 'Com_select';

In MySQL, "session" and "global" are both used to refer to variables or parameters at different levels.

  1. Session level: Session level variables or parameters only apply to the current session (connection). This means that the value set is only valid for the current connection and has no effect on other connections. For example, SETsession-level variables set through statements only take effect in the current session and will be reset to their default values ​​after the session ends.

  2. Global level: Global level variables or parameters apply to the entire MySQL server instance. This means that the value set is valid for all connections and sessions. For example, a global-level variable set by modifying a configuration file or using SET GLOBALa statement affects all connections and sessions.

Within a command, you can access variables or parameters at different levels using:

  • SHOW SESSION STATUS: Displays the current session level status variables.
  • SHOW GLOBAL STATUS: Display global level status variables.

Note that some variables may only be viewable or setable at specific levels. Therefore, when choosing to use "session" or "global", consider whether the required variables or parameters are available at that level or have the required permission restrictions.

7.2 Slow query log (understand)

The MySQL slow query log is a log that records SQL query statements whose execution time exceeds a specific threshold. It helps you identify and optimize performance bottlenecks in your database.

To enable MySQL slow query logging, you need to perform the following steps:

1. Open the MySQL configuration file (usually `my.cnf` or `my.ini`). You can find the file at:

  • Linux: /etc/mysql/my.cnfor/etc/my.cnf
  • Windows: in the MySQL installation directorymy.ini

2. Find the `[mysqld]` section in the configuration file, if it does not exist, add it at the end of the file.

3. Add the following line in the `[mysqld]` section to enable slow query logging and set the threshold in seconds:
   slow_query_log = 1
   slow_query_log_file = /path/to/slow-query.log
   long_query_time = 2

  • slow_query_log: Set to  1 enable slow query log.
  • slow_query_log_file: Specify the path and file name of the slow query log. Please choose the appropriate file path and name based on your needs.
  • long_query_time: Specifies the number of seconds in which a query takes longer to execute than is considered a slow query. This value can be adjusted based on your application needs.

4. Save and close the configuration file.

5. Restart the MySQL service for the configuration changes to take effect.

6. Now, MySQL slow query log is enabled. Query statements whose execution time exceeds the threshold will be recorded in the specified log file.

To view the slow query log, you can use a text editor to open the specified log file (`/path/to/slow-query.log`) to view the slow query statements and related information recorded in it.

Please note that enabling slow query logs may have some impact on database performance and should be used with caution in production environments and configured and managed appropriately as needed.

7.3 profile details (view SQL execution time) (understand)

In MySQL, the query "profiling" feature can be used to track and analyze query performance. When query performance analysis is enabled, MySQL will record detailed execution statistics for each query.

To enable profiling for a query and view detailed execution statistics for a query, follow these steps:

1. Open a MySQL client or use the appropriate MySQL graphical user interface tool to connect to the database.

2. In the session, execute the following command to enable the profiling function of the query:
   SET profiling = 1;

3. Then, execute the query you want to analyze.

4. After the query is executed, use the following command to view the profiling details of the query:
   SHOW PROFILES;

   This will display a list of all queries, each with a unique query ID.

5. Select the query whose profiling details you want to view and use the following command:
   SHOW PROFILE FOR QUERY <query_id>;
   Replace `<query_id>` with the actual query ID of the query you want to view.

   This will display detailed execution statistics related to the selected query, including the query's execution time, number of rows scanned, temporary table creation, and more.

6. After viewing the profiling details of the query, you can use the following command to stop profiling and clear the recorded query information:
   SET profiling = 0;

Please note that enabling profiling of queries may have some impact on performance. Therefore, profiling should be enabled only when detailed analysis of query performance is required, and profiling should be stopped promptly to avoid unnecessary overhead.

 7.4 explain execution plan (important)

Explain execution plan is a command in MySQL that is used to obtain the execution plan of a query statement. The execution plan shows details such as the order of operations selected by the MySQL optimizer when processing the query, the indexes used, and the data access method.

The result set of the Explain execution plan contains multiple fields, each of which provides information about a different aspect of the query execution. The following are common fields and their meanings in the Explain execution plan result set:

  1. id

    • Indicates the number of each operation in the query execution plan.
    • For complex queries, there may be multiple operations, which are numbered according to a tree structure.
  2. select_type

    • Indicates the type of operation to perform.
    • Common types include: SIMPLE (simple query), PRIMARY (main query), SUBQUERY (subquery), DERIVED (derived table query), UNION (joint query), etc.
    • This field helps you understand the types and relationships of different operations in the query.
  3. table

    • Indicates the name of the table involved in the operation.
    • If the query involves more than one table, multiple rows may appear, with arrows indicating the join order.
  4. type

    • Indicates how MySQL will access the table.
    • Common types include: ALL (full table scan), INDEX (use index scan), RANGE (range scan), REF (use reference key scan), EQ_REF (unique index search), CONST (constant search), etc.
    • Usually, the better type of access is to use an index instead of a full table scan.
    • The best to worst join types are system > const > eq_reg > ref > range > index > ALL.

      system

      The table has only one row of records (equal to the system table)
      const Index query using constants
      eq_ref Unique index scan, usually using primary key constraints
      ref Non-unique index scan
      range Index range scan
      index Full index scan
      ALL Full table scan
  5. possible_keys

    • Indicates potential indexes that MySQL can use.
    • If indexes are used in the query, they will appear in this field.
  6. key

    • Indicates the actual index chosen to use.
    • If this field is empty, no index is used.
    • Generally, a better execution plan is to use efficient indexes to speed up queries.
  7. rows

    • Instructs MySQL to estimate the number of rows that need to be checked.
    • This is an estimate based on statistics and the index selector's algorithm.
  8. Extra

    • Provides additional execution information to help further understand the details of query execution.
    • Possible values ​​include: Using index (only using index for query), Using where (using WHERE clause for filtering), Using temporary (using temporary table), Using filesort (using file sorting), etc.

These fields provide details about the query execution and can help developers understand how the query was executed, access patterns, and whether there are potential performance issues. By analyzing the fields in the Explain execution plan result set, you can make appropriate optimization decisions, such as creating appropriate indexes, rewriting queries, or adjusting query statements to improve query performance.

8. Index usage rules

8.1 Leftmost prefix rule

The leftmost prefix rule means that when using a joint index to query, you must start from the leftmost column of the index and cannot skip the middle columns. If a column is skipped, the index will only be partially effective, and the indexes on subsequent fields will be invalid.

The reason for this rule is that the joint index is stored in order by multiple columns of the index. When querying, the database system will search based on the leftmost column of the index, and gradually search to the right in the order of the index until data that meets all conditions is found or no further matching is possible.

If we skip a column for query, the columns after this column will not be searched in the order of the index, causing the index to become invalid. This will cause the database to need to scan more data pages to satisfy the query conditions, thereby reducing query performance.

Therefore, when using a joint index to query, you should abide by the leftmost prefix rule and query in the order of the index columns, so that you can maximize the performance advantages provided by the index. If you need to perform flexible queries on multiple columns, you can consider creating a more appropriate index or use other query optimization methods to improve performance.

8.2 Joint index avoids range queries 

When using a joint index for a range query (<, >), the column index on the right side of the range query will be invalid. This is because range queries need to scan the index in a certain order and cannot fully utilize the orderliness of the index.

In order to avoid this index failure problem, you can consider using >= or <= instead of range queries. By using the >= or <= operator, the range query can be converted into an equal value query or a single value query, so that the entire joint index remains valid.

For example, if you want to do a range query col1 > 5 AND col2 < 10, you can rewrite it as col1 >= 5 AND col1 < x AND col2 < 10, where x is a value greater than 5. In this way, we split the range query into two equivalent queries to ensure the effective use of the joint index.

It should be noted that when splitting a range query, we need to choose an appropriate split point (such as the x value in the above example) according to the specific situation to ensure the correctness and coverage of the query results. In addition, the split query conditions may increase some logical complexity and require careful design and testing.

8.3 SQL prompts

  1. USE INDEX: Instructs MySQL to use a specific index to execute queries.
  2. IGNORE INDEX: Instructs MySQL to ignore a specific index and select other available indexes to execute the query.
  3. FORCE INDEX: Forces MySQL to use a specific index to execute queries and ignore other indexes that may be more suitable.

8.4 Covering index

Try to use covering indexes (the query uses an index, and all the columns that need to be returned can be found in the index) and reduce select *.

The meaning of the extra field in explain:
using index condition: The search uses an index, but you need to query the data back to the table
using where; using index;: The search uses the index, but the required data can be found in the index column, so there is no need to query back to the table

If the corresponding row can be found directly in the clustered index, the row data will be returned directly, and only one query is required, even select *; if the clustered index is found in the auxiliary index, for example, it only needs to be found through the auxiliary index (name select id, name from xxx where name='xxx';) For the corresponding id, return the name and the id corresponding to the name index. Only one query is needed; if you are looking for other fields through the auxiliary index, you need to query back to the table, such asselect id, name, gender from xxx where name='xxx';

So try not to use it select *. It is easy to cause table query and reduce efficiency, unless there is a joint index that contains all fields.

Interview question : A table has four fields (id, username, password, status). Due to the large amount of data, the following SQL statement needs to be optimized. How to proceed is the best solution:
select id, username, password from tb_user where username='itcast';

Solution: Create a joint index for the username and password fields, so there is no need to query back the table and directly overwrite the index.

8.5 Prefix index

When the field type is a string (varchar, text, etc.), sometimes it is necessary to index a very long string, which will make the index very large. When querying, a lot of disk IO is wasted and affects the query efficiency. In this case, you can just Drop part of the prefix of the string and create an index, which can greatly save index space and improve indexing efficiency.

Syntax: create index idx_xxxx on table_name(columnn(n));
Prefix length: can be determined based on the selectivity of the index, which refers to the ratio of unique index values ​​(cardinality) to the total number of records in the data table. The higher the index selectivity, the higher the query efficiency. The unique index Selectivity is 1, which is the best index selectivity and has the best performance.

Find the selectivity formula:

  1. select count(distinct email) / count(*) from tb_user;
  2. select count(distinct substring(email, 1, 5)) / count(*) from tb_user;

8.6 Single column index and joint index

Single column index: that is, an index only contains a single column.
Joint index: that is, an index contains multiple columns.
In a business scenario, if there are multiple query conditions, when considering building an index for the query field, it is recommended to build a joint index instead of a single column. index.

Single column index situation:
explain select id, phone, name from tb_user where phone = '17799990010' and name = '韩信';
This sentence will only use the phone index field

Precautions
  • When performing a multi-condition joint query, the MySQL optimizer will evaluate which field has a higher indexing efficiency and select that index to complete the query.

9. Index design principles

Design Principles

  1. Create indexes for tables with large amounts of data and frequent queries
  2. Create indexes for fields that are often used as query conditions (where), sorting (order by), and grouping (group by) operations
  3. Try to choose columns with high differentiation as indexes, and try to build unique indexes. The higher the differentiation, the more efficient the use of indexes.
  4. If it is a string type field and the field length is long, a prefix index can be established based on the characteristics of the field.
  5. Try to use joint indexes and reduce single-column indexes. When querying, joint indexes can often cover the index, save storage space, avoid table backs, and improve query efficiency.
  6. It is necessary to control the number of indexes. The more indexes, the better. The more indexes, the greater the cost of maintaining the index structure, which will affect the efficiency of additions, deletions and modifications.
  7. If an indexed column cannot store NULL values, constrain it using NOT NULL when creating the table. When the optimizer knows whether each column contains NULL values, it can better determine which index to use most efficiently for the query

10. Index failure situation

10.1 Index column operations

 If operations are performed on indexed columns, the index will become invalid. like:explain select * from tb_user where substring(phone, 10, 2) = '15';

Index column operations refer to operations (such as calculations, function operations, etc.) on index columns when querying conditions or index creation. In some cases, operations on indexed columns may cause index failure. Here are some common reasons:

  1. Unpredictable operation results: When operations are performed on index columns, the original values ​​of the columns may be changed, resulting in an inability to accurately match the key values ​​in the index. For example, if a function operation is used in the query condition, for example WHERE UPPER(column) = 'VALUE', because the index only stores the original column value rather than the result of the function operation, the database cannot directly use the index for efficient search and filtering.

  2. Operation result type mismatch: The index is sorted and stored according to a specific data type. If an operation results in a mismatch between the data type of the result and the data type of the index column, the index will not be used correctly. For example, if you perform a string concatenation operation on an integer index column, you may not be able to use the index to speed up queries.

  3. The operation causes the index columns to be unable to be compared in order: the main purpose of the index is to provide ordering to quickly locate and filter data. If operations are performed that cause the order of the index columns to be inconsistent, the index will lose order and will not provide optimization for queries. For example, if an irreversible hash function operation is used in the query condition, the values ​​of the index columns cannot be compared in order.

In order to avoid index failure, you should try to avoid performing operations on index columns during query conditions or index creation. If you really need to use operations, you can consider the following solutions:

  • Reversing operations on index columns: If the operation is reversible, you can maintain the validity of the index by applying the operation to the query parameters instead of the index columns.
  • Using function indexes: Some database management systems provide the function of function indexes, which can create indexes based on specific function operations to meet specific query needs.

10.2 Strings without quotes

If the string is not quoted when querying conditions or creating an index, the index may become invalid. Here are some common reasons:

  1. Data type mismatch: Strings in the database need to be represented in quotes, while unquoted values ​​are usually treated as other data types (such as column names, function names, etc.). If quotation marks are not used correctly when querying conditions or creating indexes, the database may not correctly match the data type of the string, causing the index to fail.

  2. String comparison problem: When the database performs string comparison, it usually relies on the sorting rules of the string. If the string is unquoted, the database may parse it as another type of data rather than comparing it according to the string's collation. This may cause the index to not correctly match the query conditions, causing the index to become invalid.

  3. Syntax error: In SQL statements, strings usually need to be enclosed in quotation marks as a legal syntax structure. If quotation marks are not used, syntax errors may occur, causing the database to be unable to correctly parse query conditions or create indexes, resulting in index failure.

To avoid index failure issues, make sure that all string values ​​are properly quoted when querying and creating indexes. This allows the database to correctly identify the string type and perform comparison and index optimization according to the string's sorting rules. At the same time, it is recommended to refer to the documentation of the relevant database to understand specific syntax rules and best practices.

10.3 Fuzzy query

In fuzzy query, if it is only a tail fuzzy match, the index will not be invalid; if it is a head fuzzy match, the index will be invalid. For example: explain select * from tb_user where profession like '%工程';, if there are % before and after, it will also be invalid.

The following are the reasons why tail fuzzy matching and head fuzzy matching indexes fail:

  1. Tail fuzzy matching: If the wildcard character of a fuzzy query (such as %) appears only at the end of the search string, the index can still be used effectively. For example, use to LIKE 'abc%'perform a tail matching query, so that the database can search by using the index and return matching results starting with "abc".

  2. Header fuzzy matching: On the contrary, if the wildcard character of the fuzzy query (such as %) appears at the beginning of the search string, the index will fail. For example, use LIKE '%abc'a header match query. In this case, the database cannot effectively use the index to search because it cannot determine the starting position of the matching value.

The main reason is that the index stores data in a certain order, and the header wildcard of fuzzy matching requires traversing the entire index for matching, making it impossible to efficiently locate and filter through the orderliness of the index.

10.4 Conditions for or connection

When using the OR operator to combine multiple conditions, if one of the conditions' columns does not have an index, the index involved will not be used. This is due to the following reasons:

  1. Index selectivity: The database optimizer usually decides whether to use an index based on its selectivity. Selectivity refers to how unique the different values ​​in the index are. When a condition's columns are not indexed, it is less selective, meaning it contains fewer distinct values. In this case, the index using the condition may not provide sufficient filtering effect, causing the query optimizer to decide not to use the index.

  2. Cost estimation of query plan: When the database optimizer determines the query plan, it estimates the cost based on each possible execution path. If one of the criteria's columns is not indexed, the indexes involved may not provide effective filtering, making the cost estimate for the execution path using the index higher. Therefore, the optimizer may choose other execution paths that do not use indexes.

  3. Logical structure: For the OR operator, the database needs to evaluate each condition independently and combine the results. If one of the conditions' columns does not have an index, the database may need to scan the entire table to evaluate the condition, which conflicts with other conditions that use indexes. To avoid unnecessary data access and merge operations, the optimizer may choose not to use any indexes.

To solve this problem, you can consider the following options:

  • Make sure all involved criteria columns have appropriate indexes to improve query performance.
  • For large tables, consider refactoring the query to split the OR operator into multiple independent queries and use UNION or UNION ALL to merge the results. This ensures that each subquery uses the appropriate index and avoids index failure caused by the OR operator.

10.5 Impact of data distribution

When MySQL evaluates that using an index is slower than a full table scan, it will choose not to use an index. Here is an example:

For a student table, if it contains a column info, and infothe fields of most records are empty, and the index is set on the column, when the following query is executed:

SELECT * FROM student WHERE info IS NULL;

In this case, MySQL's optimizer may choose not to use the index on that column.

Here’s why:

  1. Poor index selectivity: Since infothe fields of most records are empty, the selectivity of the index columns is very low. The selectivity of an index refers to the degree to which different values ​​are unique. When the selectivity of a column is very low, it means that the index cannot provide good filtering effect. The optimizer may consider a full table scan to be more efficient than using an index because it may be more expensive to look up and access data blocks using an index.

  2. Data access cost estimation: Since the fields of most records infoare empty, when using the index of this column to perform a range scan, a large number of data blocks may need to be accessed, which will increase the cost of the query. Considering the distribution of the overall data, the optimizer may decide that it is cheaper to perform a direct full table scan.

For the above reasons, the MySQL optimizer may choose not to use the index on this column and instead use a full table scan to find infoempty records.

Guess you like

Origin blog.csdn.net/weixin_55772633/article/details/132215904