School Recruitment Interview Database Principle Knowledge Review Summary II Index

Mysql index and performance optimization knowledge summary

1. Definition and characteristics

An index is a single, on-disk database structure that contains reference pointers to all records in a table. Using indexes can quickly find rows with a specific value in one or more columns. All MySQL column types can be indexed. Using indexes for related columns is the best way to improve the speed of query operations.

There are two types of index storage in MySQL, namely BTREE and HASH , which are specifically related to the storage engine of the table.

MyISAM and InnoDB storage engines only support BTREE indexes ;

The MEMORY/HEAP storage engine can support HASH and BTREE indexes .

indexedadvantageThere are mainly the following:

  1. By creating a unique index, the uniqueness of each row of data in the database table can be guaranteed.

  2. It can greatly speed up the query speed of data, which is also the main reason for creating indexes.

  3. In terms of achieving referential integrity of data, it can speed up the joins between tables.

  4. When using grouping and sorting clauses for data query, it can also significantly reduce the time of grouping and sorting in the query.

Adding indexes also has many disadvantages , mainly in the following aspects:

  1. Creating and maintaining indexes takes time and increases with the amount of data. ( loss time )
  2. Indexes need to occupy disk space. In addition to the data space occupied by the data table, each index also occupies a certain amount of physical space. If there are a large number of indexes, the index file may reach the maximum file size faster than the data file. ( loss space )
  3. When adding, deleting and modifying the data in the table, the index should also be maintained dynamically, which reduces the speed of data maintenance. ( Indexes should be built on infrequently updated tables )

2. Index classification

MySQL indexes can be divided into the following categories:

  1. Ordinary index and unique index
    Ordinary index is the basic index type in MySQL, which allows insertion of duplicate values ​​and null values ​​in the columns defining the index .
    A unique index requires that the values ​​of the indexed columns must be unique , but null values ​​are allowed . If it is a composite index, the combination of column values ​​must be unique.
    A primary key index is a special kind of unique index that does not allow null values .

  2. Single-column index and composite index
    Single-column index means that an index only contains a single column, and a table can have multiple single-column indexes.
    Composite index refers to the index created on the combination of multiple fields in the table. The index will be used only when the left field of these fields is used in the query condition. Follow the leftmost prefix set
    when using composite indexes .

  3. Full-text index The full-text index type is FULLTEXT, which supports full-text search of values ​​on the columns defining the index, and allows insertion of duplicate values ​​and null values
    ​​in these index columns . Full-text indexes can be created on columns of type CHAR, VARCHAR, or TEXT.

  4. Spatial index
    Spatial index is an index built on fields of spatial data types . There are four spatial data types in MySQL, namely
    GEOMETRY, POINT, LINESTRING, and POLYGON. MySQL is extended with the SPATIAL keyword to enable the creation of spatial indexes with a syntax similar to that of regular indexes. To create a column of a spatial index, it must be declared as NOT NULL . A spatial index can only be created on a table whose storage engine is MyISAM. The MyISAM storage engine supports spatial data indexes (R-Tree), which can be used for geographic data storage. The spatial data index will index data from all dimensions, and can effectively use any dimension to perform combined queries. Data must be maintained using GIS-related functions .

3. Index creation

  1. Create an index when creating a table:

    When using CREATE TABLE to create a table, in addition to defining the data type of the column, you can also define a primary key constraint, a foreign key constraint, or a unique constraint. No matter what kind of constraint is created, defining the constraint is equivalent to creating one on the specified column . index

    CREATE TABLE table_name [col_name data_type]
    --其中,UNIQUE、FULLTEXT和SPATIAL为可选参数,分别表示唯一索引、全文索引和空间索引;----INDEX与KEY为同义词,两者作用相同,用来指定创建索引。
    [UNIQUE|FULLTEXT|SPATIAL] [INDEX|KEY] [index_name] (col_name [length])
    [ASC|DESC]
    --eg:
    CREATE TABLE t1 (
    id INT NOT NULL,
    name CHAR(30) NOT NULL,
    --在id字段上创建非空索引UniqIdx
    UNIQUE INDEX UniqIdx(id)
    );
    
  2. Create indexes on existing tables

    To create an index on an existing table, use the ALTER TABLE statement or the CREATE INDEX statement.

    ALTER TABLE table_name ADD
    [UNIQUE|FULLTEXT|SPATIAL] [INDEX|KEY] [index_name] (col_name[length],...)
    [ASC|DESC]
    --eg:bookId字段上建立名称为UniqidIdx的唯一索引
    ALTER TABLE book ADD UNIQUE INDEX UniqidIdx (bookId);
    
    CREATE [UNIQUE|FULLTEXT|SPATIAL] INDEX index_name
    ON table_name (col_name [length],...) [ASC|DESC]
    --eg:
    CREATE UNIQUE INDEX UniqidIdx ON book (bookI\d);
    

4. Index usage occasions and rationale

It is recommended to create indexes according to the following principles:

  1. Specify a unique index when uniqueness is a characteristic of some kind of data itself . The use of a unique index needs to ensure the data integrity of the defined columns
    to improve query speed.
  2. Build indexes on columns that are frequently sorted or grouped (that is, perform group by or order by operations). If there are multiple columns to be sorted
    , you can build composite indexes on these columns.

It is recommended to design indexes according to the following principles:

  1. Avoid too many indexes on frequently updated tables , and keep the indexes in as few columns as possible. Create indexes on fields that should be frequently used in queries, but avoid adding unnecessary fields.
  2. It is best not to use indexes for tables with a small amount of data . Due to the small amount of data, the query time may be shorter than the time spent traversing the index, and the index may not produce optimization effects.
  3. Build indexes on columns with many different values ​​that are often used in conditional expressions , and do not build indexes on columns with few different values. For example, there are only two different values ​​of "male" and "female" in the "gender" field of the student table, so there is no need to build an index. If an index is built, it will not only not improve the query efficiency, but will seriously slow down the data update speed.
  4. Specify a unique index when uniqueness is a characteristic of some kind of data itself . The use of a unique index needs to ensure the data integrity of the defined columns
    to improve query speed.

5. Determine whether the index of the database is valid

You can use the EXPLAIN statement to see if an index is in use.

EXPLAIN SELECT * FROM book WHERE year_publication=1990;

The EXPLAIN statement will output detailed SQL execution information for us, among which:
The possible_keys line gives the various indexes that MySQL can choose when searching for data records.
The key row is the index actually selected by MySQL.

If both the possible_keys row and the key row contain the year_publication field, it means that the index is used in the query.

6. Index invalidation

There are several ways to avoid index failure:

  1. When using a composite index, you need to follow the **"leftmost prefix" principle**;
  2. Do not perform any operations on the index column, such as calculation, function, type conversion, will cause the index to fail and turn to full table scan;
  3. Try to use covering indexes (queries that only access index columns), and reduce select * covering indexes to reduce the number of times to return to the table ;
  4. MySQL cannot use indexes when using unequal (!= or <>), which will result in a full table scan;
  5. LIKE starts with a wildcard (%abc), the MySQL index will become invalid and become a full table scan operation;
  6. If the string is not enclosed in single quotes, the index will fail (the implicit conversion of the index column may occur);
  7. Use or sparingly, and the index will fail when you use it to connect.

7. Index implementation principle

The primary key index in InnoDB is a clustered index, and the primary key index in MyISAM is a non-clustered index, both of which use B+ trees as the underlying structure, because one of the characteristics of B+ tree indexes in the database is high fan- out

7.1 MyISAM index implementation: (non-clustered index)

  1. The MyISAM index file and data file are separated, and the index file only saves the address of the data record

  2. The MyISAM engine uses B+Tree as the index structure, and the data field of the leaf node stores the address of the data record .

  3. In MyISAM, there is no structural difference between the primary index and the secondary index (Secondary key) , except that the primary index requires the key to be unique, while the key of the secondary index can be repeated.

  • The primary key index is as follows:

insert image description here

  • Auxiliary index: Create an auxiliary index on Col2, then the structure of this index is shown in the figure below

insert image description here

7.2 InnoDB index implementation

  1. Primary key index (clustered index)

    In InnoDB, the == table data file itself is an index structure organized by B+Tree, and the leaf node data field of this tree stores complete data records. == The key of this index is the primary key of the data table, so the InnoDB table data file itself is the primary index.

insert image description here

  1. Secondary index (nonclustered index)

    InnoDB's auxiliary index data field stores the value of the corresponding record's primary key instead of the address .

insert image description here

The difference between clustered index and non-clustered index

The leaf nodes of the clustered index store primary key values ​​and data rows, and support covering indexes; the leaf nodes of the secondary index store primary key values ​​or pointers to data rows.

Since node nodes (data pages) can only be sorted according to a B+ tree, a table can only have one clustered index. The existence of auxiliary indexes does not affect the organization of data in the clustered index, so a table can have multiple auxiliary indexes

Index map of Innodb and MyISAM

insert image description here

8. Index refactoring

  1. Frequent update and delete operations occur on the table;
  2. The alter table...move operation occurred on the table (the move operation caused the rowid to change).

insert image description here

9. Interview questions:

9.1 What is the difference between MySQL's Hash index and B-tree index?

reference answer

  1. The bottom layer of the hash index is the hash table . When searching, the corresponding key value can be obtained by calling the hash function once, and then the actual data can be obtained by querying back to the table .
  2. The underlying implementation of the B+ tree is a multi-way balanced search tree . For each query, it starts from the root node. Only when the leaf node is found can the query key value be obtained, and then judge whether it is necessary to return to the table to query data according to the query.
  3. They have the following differences:
    • The hash index is faster for equivalent queries (in general), but it cannot perform range queries . Because after the hash function is used to build the index in the hash index, the order of the index cannot be consistent with the original order, and range queries cannot be supported. And all nodes of the B+ tree follow (the left node is smaller than the parent node, the right node is larger than the parent node, and the multi-fork tree is similar), which naturally supports the range.
    • The hash index does not support using the index for sorting, the principle is the same as above
    • The hash index does not support fuzzy query and the leftmost prefix matching of the multi-column index. The reason is also because of the unpredictability of the hash function. The hash index cannot avoid returning to the table to query data at any time, and the B+ tree can only complete the query through the index when certain conditions (clustered index, covering index, etc.) are met.
    • Although the hash index is faster in the equivalent query, it is unstable and its performance is unpredictable. When there are a lot of repetitions in a certain key value, a hash collision occurs, and the efficiency may be extremely poor at this time. The query efficiency of the B+ tree is relatively stable, and all queries are from the root node to the leaf node, and the height of the tree is low.
  4. Therefore, in most cases, directly selecting the B+ tree index can obtain stable and better query speed. Instead of using a hash index.

9.2 How to use index in select in statement?

reference answer

Whether the index works depends mainly on the field type:

  • If the field type is a string, you need to add quotation marks to both the numeric value and the string value in the in query , so that the index can work.
  • If the field type is int, the value in the in query does not need to be quoted, and the index will work.
  • The fields of IN will also work in the joint index according to the above method.

9.3 How to use indexes in fuzzy query statements

Reference answer
Fuzzy query mobile like '%8765' in MySQL. In this case, the index on mobile cannot be used. If you need to perform fuzzy query based on the last four digits of the mobile phone number, you can use the following method to modify it.

We can add redundant columns (virtual columns were added after MySQL5.7, it is more appropriate to use virtual columns, the idea is the same), such as mobile_reverse, the internal storage is mobile flashback text, such as mobile is 17312345678, then mobile_reverse stores 87654321371, which is the mobile_reverse column To build an index, use the statement mobile_reverse like reverse('%5678') in the query.

reverse is the reverse function in MySQL, this statement is equivalent to mobile_reverse like '8765%', this statement can use the index.

9.4 Comparison of B+ tree, B tree, and red-black tree

Balanced trees such as red-black trees can also be used to implement indexes, but file systems and database systems generally use B+ Trees as index structures, because using B+ trees to access disk data has higher performance.

(1) The B+ tree has a lower tree height

The tree height of the balanced tree is O(h)=O(log_dN)O(h)=O(log dN), where d is the out-degree of each node. The out-degree of the red-black tree is 2, and the out-degree of the B+ Tree is generally very large, so the tree height h of the red-black tree is obviously much larger than that of the B+ Tree . Each node of the B-tree stores key and data. Key is the key value of a data record and is unique. Data stores data other than key in the data record.

The B+ tree only stores data in leaf nodes, so that non-leaf nodes can store more keys . Therefore, the B+ tree is more short and fat than the B tree, because the index tree is too large to be read into the memory at one time. The shallower the depth of the tree, the fewer the number of IOs when searching for data, and the faster the efficiency. The pointer of each leaf node of the B+ tree points to the adjacent leaf nodes, forming an ordered linked list, which can traverse all records in the order of key code sorting. Since the data is arranged in sequence and connected, it is convenient for range search and search . If the B-tree leaf node pointer is null, recursive traversal of each layer is required. Adjacent elements may not be adjacent in memory, so cache hits are not as good as B+ trees .

(2) Disk access principle

The operating system generally divides memory and disk into blocks of fixed size, each block is called a page, and memory and disk exchange data in units of pages. The database system sets the size of a node of the index to the size of a page, so that one node can be fully loaded in one I/O.

If the data is not on the same disk block, it is usually necessary to move the braking arm for seek, and the physical structure of the braking arm makes the movement inefficient, thus increasing the disk data read time. The B+ tree has a lower tree height than the red-black tree , and the number of seeks is proportional to the tree height . Accessing the same disk block requires only a short disk rotation time, so the B+ tree is more suitable for disk data storage. read.

(3) Disk read-ahead feature

In order to reduce disk I/O operations, the disk is often not strictly read on demand, but read ahead every time. During the read-ahead process, the disk performs sequential reading, which does not require disk seeking, and only requires a short disk rotation time, so the speed will be very fast. And can take advantage of the read-ahead feature, adjacent nodes can also be pre-loaded.

10. Index optimization

  1. Independent columns
    Indexed columns cannot be part of expressions or parameters of functions when performing queries, otherwise the index cannot be used.

  2. Multi-column indexes
    When you need to use multiple columns as conditions for queries, using multi-column indexes is better than using multiple single-column indexes.

  3. The order of the index columns
    is such that the most selective index columns come first . Index selectivity refers to the ratio of unique index values ​​to the total number of records. The maximum value is 1, at this time each record has a unique index corresponding to it. The higher the selectivity, the higher the discrimination of each record and the higher the query efficiency .

  4. Prefix index
    For columns of BLOB, TEXT and VARCHAR types, you must use a prefix index, which only indexes the first part of characters. The selection of the prefix length needs to be determined according to the index selectivity.

  5. Covering index
    An index that contains all the values ​​of the fields that need to be queried.

    Has the following advantages:

    The index is usually much smaller than the size of the data row, and only reading the index can greatly reduce the amount of data access.
    Some storage engines (such as MyISAM) only cache indexes in memory, and the data relies on the operating system to cache. Therefore, accessing only the index can avoid using system calls (which are usually time-consuming).
    For the InnoDB engine, there is no need to access the primary index if the secondary index can cover the query.

11. Query performance optimization

  1. Use Explain to analyze
    Explain is used to analyze the SELECT query statement, and developers can optimize the query statement by analyzing the Explain result.

    The more important fields are:

    • select_type : query type, including simple query, joint query, subquery, etc.
    • key : the index to use
    • rows : the number of rows scanned
    • Extra: Additional additional information, mainly to confirm whether the two situations of Using filesort and Using temporary appear.
  2. Optimize data access

    1. Reduce the amount of data requested

      • Only return necessary columns: it is best not to use SELECT * statements.
      • Return only necessary rows: Use the LIMIT statement to limit the data returned.
      • Cache repeatedly queried data : Using cache can avoid querying in the database, especially when the data to be queried is frequently queried repeatedly, the query performance improvement brought by caching will be very obvious.

    2. The most efficient way to reduce the number of rows scanned on the server side is to use indexes to cover queries .

  3. Refactor query method

    1. Segmentation of large queries
      If a large query is executed at one time, it may lock a lot of data at one time, occupy the entire transaction log, exhaust system resources, and block many small but important queries.
    DELETE FROM messages WHERE create < DATE_SUB(NOW(), INTERVAL 3 MONTH);
    
    1. Decompose a large join query
      Decompose a large join query into a single-table query for each table , and then associate them in the application. The benefits of doing so are:

      • Make caching more efficient . For join queries, if one of the tables changes, the entire query cache becomes unusable. After decomposing multiple queries, even if one of the tables changes, the query cache for other tables can still be used.
      • Decomposed into multiple single-table queries, the cached results of these single-table queries are more likely to be used by other queries, thereby reducing the query of redundant records.
      • Reduce lock contention;
      • Connecting at the application layer makes it easier to split the database, making it easier to achieve high performance and scalability .
      • The efficiency of the query itself may also be improved. For example, in the following example, using IN() instead of join query can make MySQL query according to the order of ID, which may be more efficient than random join.
      SELECT * FROM tag
      JOIN tag_post ON tag_post.tag_id=tag.id
      JOIN post ON tag_post.post_id=post.id
      WHERE tag.tag='mysql';
      #优化后
      SELECT * FROM tag WHERE tag='mysql';
      SELECT * FROM tag_post WHERE tag_id=1234;
      SELECT * FROM post WHERE post.id IN (123,456,567,9098,8904);
      
  4. Optimized subquery:

    • The nested query of SELECT statement can be performed by using subquery, that is, the result of one SELECT query is used as the condition of another SELECT statement. Subqueries can complete many SQL operations that logically require multiple steps to complete at one time.
    • Although subquery can make the query statement very flexible, the execution efficiency is not high. When executing a subquery, MySQL needs to create a temporary table for the query results of the inner query statement. Then the outer query statement queries the records from the temporary table. After the query is completed, these temporary tables are revoked. Therefore, the speed of the subquery will be affected to some extent. If the amount of data queried is relatively large, this impact will increase accordingly.
    • In MySQL, a join (JOIN) query can be used instead of a subquery . A join query does not need to create a temporary table , and its speed is faster than a subquery. If an index is used in the query, the performance will be better.

12. Insert performance optimization

For InnoDB engine tables, common optimization methods are as follows:

  1. Disable the uniqueness check
    Execute set unique_checks=0 before inserting data to disable the check on the unique index, and run set unique_checks=1 after the data import is completed. This is the same as using the MyISAM engine.
  2. Disable foreign key check
    Disable the foreign key check before inserting data, and resume the foreign key check after the data is inserted.
  3. Disable automatic commit
    Disable the automatic commit of the transaction before inserting data. After the data import is completed, execute the resume automatic commit operation.

13. MySql slow query optimization

  • Enable slow query log: The slow query log is disabled by default
    in MySQL . It can be enabled through the log-slow-queries option in the configuration file my.ini or my.cnf, or you can use –log-slow- when the MySQL service starts queries[=file_name] enables slow query logging. When starting the slow query log, you need to configure the long_query_time option in the my.ini or my.cnf file to specify the recording threshold. If the query time of a query statement exceeds this value, the query process will be recorded in the slow query log file.

  • Analyze slow query logs:
    directly analyze mysql slow query logs, and use the explain keyword to simulate the optimizer to execute SQL query statements to analyze sql slow query statements.

  • Common slow query optimization:

  1. When the index does not work
    In the query statement using the LIKE keyword to query, if the first character of the matching string is "%", the index will not work . The index will only work if the "%" is not in the first position.
    MySQL can create indexes for multiple fields. An index can include 16 fields. For multi-column indexes, the index will only be used if the first field of these fields is used in the query condition.
    The query condition of the query statement only uses the OR keyword , and the columns in the two conditions before and after the OR are both indexes, the index is used in the query. Otherwise, the query will not use the index.
  2. Optimizing the database structure
    For a table with many fields, if some fields are used infrequently, you can separate these fields to form a new table .
    Because when a table has a large amount of data, it will slow down due to the existence of low-frequency fields.
    For tables that need to be queried frequently, an intermediate table can be established to improve query efficiency. By establishing an intermediate table,
    insert the data that needs frequent joint query into the intermediate table, and then change the original joint query into a query on the intermediate table, so as to improve query efficiency.
  3. Decomposing relational queries
    Many high-performance applications will decompose relational queries , that is, a single-table query can be performed on each table, and then the query results can be correlated in the application, which is more efficient in many scenarios.
  4. Optimize LIMIT pagination
    When the offset is very large, for example, it may be a query such as limit 10000, 20. This is the cost for MySQL to query 10020 records and return only the last 20 records. The previous 10000 records will be discarded. very high. One of the easiest ways to optimize such queries is to use index covering scans whenever possible, rather than querying all columns . Then do an association operation as needed and return the required columns. When the offset is large, the efficiency of doing this will be greatly improved.

14. What should I do if the table contains tens of millions of data

It is recommended to optimize in the following order:

  1. Optimize SQL and indexes;
  2. Increase cache, such as memcached, redis;
  3. Separation of reading and writing, you can use master-slave replication , or master-master replication;
  4. Use the partition table that comes with MySQL, which is transparent to the application and does not need to change the code, but the SQL statement is optimized for the partition table;
  5. Do vertical splitting , that is, divide a large system into multiple small systems according to the coupling degree of modules;
  6. To do horizontal splitting , a reasonable sharding key must be selected. In order to have good query efficiency, the table structure must also be changed, a certain amount of redundancy must be made, and the application must also be changed. Sharding key should be included in SQL as much as possible to locate the data in a limited Look up the table instead of scanning all
    the tables.

Guess you like

Origin blog.csdn.net/zhangkai__/article/details/126512205