[Analysis of MySQL Innodb Architecture] 3. Innodb Indexes

The content of this article is mainly manually translated from the MySQL5.7 official website manual - Innodb index part , readers can read it in conjunction with the official manual. Please point out any mistakes, thank you for reading, welcome to discuss!

1. Innodb clustered index and secondary index

Each InnoDB table has a special index called a clustered index, which stores all row data. Generally speaking, a clustered index is equivalent to a primary key, so a table can only have one clustered index. The purpose of designing a clustered index is to optimize table addition, deletion, modification, and query operations, so it is important to understand the principles.

  • When you define a primary key for a table, InnoDB will use it as a clustered index.
  • If you do not define a primary key, InnoDB will use the first uniquenon-null index column as a clustered index
  • If you don't define any indexes, InnoDB will generate a hidden GEN_CLUST_INDEXclustered index called, which consists of hidden 6-byte RowIDcolumns with monotonically increasing properties. So physically, the order in which the rows are arranged is RowIDdetermined by .

1.1 How clustered indexes speed up queries

First use it as a normal index. After the Innodb engine searches for the corresponding page through the primary key condition, because the index page contains complete row data, there is no need to do a secondary search through the primary key, and the data can be returned directly. There is only one disk I/O at most.

1.2 How to associate the secondary index with the clustered index

In contrast to the clustered index is the secondary index. The index page of the latter contains the indexed columns and the corresponding primary key columns, so after finding the corresponding primary key columns through the secondary index, and then finding the corresponding data in the clustered index through the primary key columns, at most two disk I/O .

If the primary key is long, then the secondary index will also use more space, and it is recommended to use a shorter primary key (according to needs, in 2021, space is not a primary consideration).
For more advantages of clustered index and secondary index, refer to here

2. The physical structure of the Innodb index

Except for the spatial index that uses R-tree , Innodb's indexes are all B-treestructures. R-tree is used to index multidimensional data. Regardless of R-tree or B-tree, index records are stored in their leaf nodes. The default index page size is 16KB, which innodb_page_sizeis specified by the parameter when mysql starts.

When new data is inserted into the InnoDB clustered index, InnoDB will leave 1/16 of the page space to store future inserts and updates. If the index data is inserted sequentially, the filling rate of a single index page is about 15/16; if it is inserted randomly, it is 1/2 to 15/16; it means that random insertion cannot make good use of the page space, and it is easy to generate pages Split.

Innodb performs bulk loading when creating and rebuilding B-tree indexes. This creation method is called ordered index construction. The parameter innodb_fill_factordefines the fill space percentage of each B-tree index page when the ordered index is built, and the remaining space is used for future index growth. Ordered indexes do not support spatial index types. The ordered index is described below. For example, innodb_fill_factorset it to 100, then leave 1/16 of the clustered index page space for future index growth. (This 1/16 must be kept)

If the fill factor is lower than MERGE_THRESHOLD(default 50%, can be specified), innodb will try to shrink the index tree to release pages. MERGE_THRESHOLDApplied to B-tree and R-tree. More about merge_threshold

3. Ordered index construction

As mentioned above, Innodb performs bulk loading when creating and rebuilding B-tree indexes. This creation method is called 有序索引构建.
Index building has 3 phases:

  • Step1: Scan the clustered index, generate index entries and add them to the sort buffer. When the buffer is full, the ordered index entries are written to a temporary intermediate file. This process is calledrun
  • Step2: After writing to the temporary file one or more times, perform a merge order on all entries in the file
  • step3: Insert the sorted index entries into the B-tree

Before introducing the construction of ordered indexes, index entries generally use the insert API to insert one record into the B-tree at a time. This process involves opening a B-tree pointer, then finding the insertion position, and then using the method to insert the B-tree page 乐观. 悲观But if the page is full and the insertion fails, the insertion will be performed again , 悲观which generally involves the splitting and merging of B-tree nodes. The main cost of this top-down index construction method is: finding the insertion position, and frequently splitting and merging B-tree nodes.

Ordered index construction is done in a bottom-up manner. The description in this part is mainly about the dynamic process of inserting the index B-tree, but it is very difficult to understand. It is recommended to read the original text.

With this approach, a reference to the right-most leaf page is held at all levels of the B-tree. The right-most leaf page at the necessary B-tree depth is allocated and entries are inserted according to their sorted order. Once a leaf page is full, a node pointer is appended to the parent page and a sibling leaf page is allocated for the next insert. This process continues until all entries are inserted, which may result in inserts up to the root level. When a sibling page is allocated, the reference to the previously pinned leaf page is released, and the newly allocated leaf page becomes the right-most leaf page and new default insert location.

3.1 Reserve space for future index expansion

As mentioned above, a little.

3.2 Ordered index construction and full-text index support

Full-text indexing supports ordered index builds. In the past, SQL was used to insert entries into the full-text index (the blogger didn't understand the second half of the sentence, but it didn't matter).

3.3 Ordered Index Construction and Redo logging

During ordered index builds, the redo log is disabled, there is one checkpointto ensure that the index build can survive unexpected exits or failures. Checkpointing forces all dirty pages to be written to disk. During ordered index builds, page cleanerthreads are periodically signaled to flush dirty pages to ensure that checkpoint operations can be processed quickly. Normally, the thread flushes dirty pages clean-pagewhen the number falls below a set threshold . page cleanerFor ordered index builds, dirty pages are flushed in a timely manner to reduce checkpointoverhead, and parallelize I/O and CPU activity.

Blogger's Note: This passage is a relatively general description. Readers need to understand the checkpoint mechanism of redo log first in order to understand it better.

3.4 Ordered Index Construction and Optimizer Statistics

Sorted index builds can cause optimizer statistics to differ from those produced by previous methods of index creation. The difference in statistics, which is not expected to affect workload performance, is due to the difference in the algorithm used to populate the index.

4. Innodb full-text index

Full-text indexes are created on text columns, such as char, varchar, text, which can speed up CRUD operations on tables.
The creation method is similar to other indexes, and can be created by create table, alter table, create indexseveral statements. The syntax of full-text search is , please refer to here formatch() ... against detailed usage .

4.1 Design of full-text index

InnoDB full-text index adopts inverted index design. An inverted index stores a list of words extracted from fields. To support proximity searches, position information for each word is also stored as a byte offset.

4.2 Full-text index table

When creating an InnoDB full-text index, a set of index tables is created, as shown in the following example:

mysql> CREATE TABLE opening_lines (
       id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
       opening_line TEXT(500),
       author VARCHAR(200),
       title VARCHAR(200),
       FULLTEXT idx (opening_line)
       ) ENGINE=InnoDB;

mysql> SELECT table_id, name, space from INFORMATION_SCHEMA.INNODB_SYS_TABLES
       WHERE name LIKE 'test/%';
+----------+----------------------------------------------------+-------+
| table_id | name                                               | space |
+----------+----------------------------------------------------+-------+
|      333 | test/FTS_0000000000000147_00000000000001c9_INDEX_1 |   289 |
|      334 | test/FTS_0000000000000147_00000000000001c9_INDEX_2 |   290 |
|      335 | test/FTS_0000000000000147_00000000000001c9_INDEX_3 |   291 |
|      336 | test/FTS_0000000000000147_00000000000001c9_INDEX_4 |   292 |
|      337 | test/FTS_0000000000000147_00000000000001c9_INDEX_5 |   293 |
|      338 | test/FTS_0000000000000147_00000000000001c9_INDEX_6 |   294 |
|      330 | test/FTS_0000000000000147_BEING_DELETED            |   286 |
|      331 | test/FTS_0000000000000147_BEING_DELETED_CACHE      |   287 |
|      332 | test/FTS_0000000000000147_CONFIG                   |   288 |
|      328 | test/FTS_0000000000000147_DELETED                  |   284 |
|      329 | test/FTS_0000000000000147_DELETED_CACHE            |   285 |
|      327 | test/opening_lines                                 |   283 |
+----------+----------------------------------------------------+-------+ 

opening_lines is the table we defined, called the main table, or the indexed table.

The first six index tables constitute the inverted index, which is called the auxiliary index table. When the inserted data is tokentokenized (tokenized), individual words (also known as tokens ) are inserted into the index table along tokenwith positional information and associations . DOC_IDWords are fully sorted and partitioned in six index tables when inserted according to the character set ordering of the first character of the word.

Blogger’s Note: In the full-text index environment, MySQL refers to row as document (doc), so doc_id is row_id; doc will be used to segment words, and the divided words are called tokens

The inverted index is partitioned into six auxiliary index tables to support parallel index creation. By default, two threads tokenize, sort, and insert words and related data into index tables. If it is a larger table (the index field size is larger), you can consider increasing it innodb_ft_sort_pll_degreeto increase the number of worker threads. The default is 2.
As you can see, the format of each auxiliary index table name is FTS_000***_000***_INDEX_#. table_idAnd each auxiliary index table is associated with the index table through the hexadecimal number and in its table name .
For example, test/opening_linesthe table table_idis 327, which corresponds to 0x147 in hexadecimal. So the auxiliary index table associated with this table is FTS_000147_000***_INDEX_#also in the query results.

The hexadecimal number that appears in the auxiliary index table name is also full-text indexed index_id. For example, test/FTS_0000000000000147_00000000000001c9_INDEX_1where 1c9is 457 in decimal, you can INFORMATION_SCHEMA.INNODB_SYS_INDEXESidentify opening_linesthe index defined on the table (idx) by querying the table for this value (457).

Blogger's Note : The description of the manual is also not clear enough. It can be seen that there are two concepts here, one is "index table", and the other is multiple "auxiliary index tables" associated with it, and the association method is not necessary to remember. There are 2 hexadecimal numbers here, corresponding to 000A_000B in the auxiliary table name. According to the above description, these two hexadecimal numbers are used for association.

mysql> SELECT index_id, name, table_id, space from INFORMATION_SCHEMA.INNODB_SYS_INDEXES
       WHERE index_id=457;
+----------+------+----------+-------+
| index_id | name | table_id | space |
+----------+------+----------+-------+
|      457 | idx  |      327 |   283 |
+----------+------+----------+-------+

If the table space stored in the main table is file_per_tableseparate, these index tables also have separate table spaces. Otherwise, it is where the main table is stored.
The other index tables shown in the previous examples are called common index tables and are used for delete processing and storing the internal state of the full-text index. For example:
insert image description here
These tables can also store related data from other tables that contain full-text indexed columns.

When you drop a full-text index, the columns created for that index are preserved FTS_DOC_ID, because dropping FTS_DOC_IDa column would require rebuilding the previously indexed table.

  • FTS_*_DELETED and FTS_*_DELETED_CACHETables
    These two tables store the tokens of the index columns of records that have been deleted from the main table doc_id. FTS_*_DELETED_CACHEYes FTS_*_DELETEDmemory version (cache).
  • FTS_*_BEING_DELETED and FTS_*_BEING_DELETED_CACHE
    These two tables store records that have been deleted from the main table, but whose index column tokens are being deleted doc_id. Again, the latter is the memory version of the former.
  • FTS_*_CONFIG
    Stores the internal state of the full-text index. The most important thing is that it is saved FTS_SYNCED_DOC_ID, and this ID identifies a word that has been segmented and flashed doc_id. When the crash is restored, this ID is used to identify those docs that have not been flashed, so that those docs can be parsed again and written to the full-text index cache. We can INFORMATION_SCHEMA.INNODB_FT_CONFIGlook at the data by querying it. ( Blogger's Note : The logic here seems to be unreasonable?)

4.3 Full-text index cache

When a doc is inserted, it will be word-segmented, and then each word and associated data will be inserted into the full-text index table. In this process, even for small docs, there will be several lightweight auxiliary index table insertion operations, in short, the writing is slow. To avoid this problem, Innodb uses a full-text index cache to speed up index insert operations. The cache will keep inserting data until the cache is full and is flushed in batches (to the auxiliary index table). We can INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHElook at the data by querying it.
This cache also avoids multiple insertions of the same word, minimizing the problem of duplicate entries.

Variables innodb_ft_cache_sizecan configure the size limit of the available cache space for the full-text index of each table.
Variables innodb_ft_total_cache_sizecan configure the size limit of the available cache space for all tables globally.

One thing to note is that the cache only stores the word segmentation data of the most recently inserted data, and the query does not load the index data from the disk into the cache. Therefore, each query directly checks the cache + checks the disk index, and returns the results after combining the two results.

4.4 DOC_IDandFTS_DOC_ID

InnoDB uses DOC_IDunique document identifiers called document identifiers to map a word in a full-text index to the document record in which that word occurs. The mapping also requires the columns of the index table FTS_DOC_ID. If it is not defined, mysql will automatically add a hidden FTS_DOC_IDcolumn when the full-text index is created.
For example, the following table does not define this column:

mysql> CREATE TABLE opening_lines (
       id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
       opening_line TEXT(500),
       author VARCHAR(200),
       title VARCHAR(200)
       ) ENGINE=InnoDB;

When you create a full-text index, a warning will appear, that is, mysql is automatically creating FTS_DOC_IDcolumns

mysql> CREATE FULLTEXT INDEX idx ON opening_lines(opening_line);
Query OK, 0 rows affected, 1 warning (0.19 sec)
Records: 0  Duplicates: 0  Warnings: 1

mysql> SHOW WARNINGS;
+---------+------+--------------------------------------------------+
| Level   | Code | Message                                          |
+---------+------+--------------------------------------------------+
| Warning |  124 | InnoDB rebuilding table to add column FTS_DOC_ID |
+---------+------+--------------------------------------------------+

Similarly, alter tablethe statement to create a full-text index will also have this warning, but create tableit does not exist when the method is used to create a full-text index. Obviously, defining columns
at time is cheaper than creating a full-text index on a table with loaded data, because no changes to the table are required. Of course, under normal circumstances, we can ignore this performance loss. If we want to create the field ourselves, it must be , the field name must be , and all uppercase. It is not necessary, plus it can improve some performance. as follows:CREATE TABLEFTS_DOC_ID
BIGINT UNSIGNED NOT NULLFTS_DOC_IDauto_increment

mysql> CREATE TABLE opening_lines (
       FTS_DOC_ID BIGINT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
       opening_line TEXT(500),
       author VARCHAR(200),
       title VARCHAR(200)
       ) ENGINE=InnoDB;

If you want to add this column by yourself, you must be responsible for the correctness of the data in this column, and it cannot be empty or repeated. We can optionally add an uniqueindex to this column FTS_DOC_ID_INDEX.

mysql> CREATE UNIQUE INDEX FTS_DOC_ID_INDEX on opening_lines(FTS_DOC_ID);

But it is not necessary, because Innodb will add it automatically. Prior to MySQL 5.7.13, the maximum allowable gap between
the used FTS_DOC_IDvalue and the new value was 10000. FTS_DOC_IDIn MySQL 5.7.13 and later, the allowed gap is 65535. (The blogger didn't understand this sentence too much, but I don't feel that I need to care too much)

In order to avoid rebuilding the table, even if the full-text index is deleted later, FTS_DOC_IDthe column will not be deleted.

4.5 Deletion processing of full-text index (optimization)

Similar to the above mentioned that inserting records will update the index table multiple times resulting in performance degradation, deleting records will also be the same. In order to optimize, Innodb records the deleted records DOC_IDto the above-mentioned FTS_*_DELETEDtable, and before the query returns the results, it will DOC_IDfilter the information found in the auxiliary index table into the table. The advantage of this is that the deletion is fast and the overhead is low. The disadvantage is that the index data will not be deleted immediately after the record is deleted. If you want to delete the index data of invalid records, you need to set the prerequisites innodb_optimize_fulltext_only=ON, and then execute the command: OPTIMIZE TABLE XXX. For the optimization of full-text index, refer to here .

4.6 Transaction processing of full-text indexing

A full-text index has certain transactional characteristics due to its caching and bulk operations. Specifically, updates and inserts on full-text indexes are processed when transactions are committed, meaning that full-text searches can only see committed data .

Demo slightly.

4.7 Monitoring full-text indexes

The following tables are monitored INFORMATION_SCHEMAin the library:

  • INNODB_FT_CONFIG
  • INNODB_FT_INDEX_TABLE
  • INNODB_FT_INDEX_CACHE
  • INNODB_FT_DEFAULT_STOPWORD
  • INNODB_FT_DELETED
  • INNODB_FT_BEING_DELETED

You can also use INNODB_SYS_INDEXESand INNODB_SYS_TABLESto view the basic information of the full-text index.

5. Extensions (unofficial content)

The content here is quoted from link

The stopword list (stopword list) indicates that the words in the list do not need to be indexed and word-segmented. The InnoDB storage engine has a default stopword list, which information_schema.INNODB_FT_DEFAULT_STOPWORDis below , sharing 36 stopwords by default.
In addition, users can also innodb_ft_server_stopword_tablecustomize the stopword list through parameters:

SHOW GLOBAL VARIABLES LIKE 'innodb_ft_server_stopword_table';

SET GLOBAL innodb_ft_server_stopword_table = '库/表';

The full-text index of the current InnoDB storage engine also has the following limitations:

  • Each table can only have one index for full-text search;
  • The full-text index columns composed of multiple combinations must use the same character set and collation;
  • Languages ​​without word delimiter (delimiter), such as Chinese, Japanese, Korean, etc. are not supported . The ngram full-text parser provided by MySQL5.7.6 supports word segmentation for Chinese, Japanese, and Korean.

Guess you like

Origin blog.csdn.net/sc_lilei/article/details/120263963