The content of this article is mainly manually translated from the MySQL5.7 official website manual - Innodb index part , readers can read it in conjunction with the official manual. Please point out any mistakes, thank you for reading, welcome to discuss!
1. Innodb clustered index and secondary index
Each InnoDB table has a special index called a clustered index, which stores all row data. Generally speaking, a clustered index is equivalent to a primary key, so a table can only have one clustered index. The purpose of designing a clustered index is to optimize table addition, deletion, modification, and query operations, so it is important to understand the principles.
- When you define a primary key for a table, InnoDB will use it as a clustered index.
- If you do not define a primary key, InnoDB will use the first
unique
non-null index column as a clustered index - If you don't define any indexes, InnoDB will generate a hidden
GEN_CLUST_INDEX
clustered index called, which consists of hidden 6-byteRowID
columns with monotonically increasing properties. So physically, the order in which the rows are arranged isRowID
determined by .
1.1 How clustered indexes speed up queries
First use it as a normal index. After the Innodb engine searches for the corresponding page through the primary key condition, because the index page contains complete row data, there is no need to do a secondary search through the primary key, and the data can be returned directly. There is only one disk I/O at most.
1.2 How to associate the secondary index with the clustered index
In contrast to the clustered index is the secondary index. The index page of the latter contains the indexed columns and the corresponding primary key columns, so after finding the corresponding primary key columns through the secondary index, and then finding the corresponding data in the clustered index through the primary key columns, at most two disk I/O .
If the primary key is long, then the secondary index will also use more space, and it is recommended to use a shorter primary key (according to needs, in 2021, space is not a primary consideration).
For more advantages of clustered index and secondary index, refer to here
2. The physical structure of the Innodb index
Except for the spatial index that uses R-tree , Innodb's indexes are all B-tree
structures. R-tree is used to index multidimensional data. Regardless of R-tree or B-tree, index records are stored in their leaf nodes. The default index page size is 16KB, which innodb_page_size
is specified by the parameter when mysql starts.
When new data is inserted into the InnoDB clustered index, InnoDB will leave 1/16 of the page space to store future inserts and updates. If the index data is inserted sequentially, the filling rate of a single index page is about 15/16; if it is inserted randomly, it is 1/2 to 15/16; it means that random insertion cannot make good use of the page space, and it is easy to generate pages Split.
Innodb performs bulk loading when creating and rebuilding B-tree indexes. This creation method is called ordered index construction. The parameter innodb_fill_factor
defines the fill space percentage of each B-tree index page when the ordered index is built, and the remaining space is used for future index growth. Ordered indexes do not support spatial index types. The ordered index is described below. For example, innodb_fill_factor
set it to 100, then leave 1/16 of the clustered index page space for future index growth. (This 1/16 must be kept)
If the fill factor is lower than MERGE_THRESHOLD
(default 50%, can be specified), innodb will try to shrink the index tree to release pages. MERGE_THRESHOLD
Applied to B-tree and R-tree. More about merge_threshold
3. Ordered index construction
As mentioned above, Innodb performs bulk loading when creating and rebuilding B-tree indexes. This creation method is called 有序索引构建
.
Index building has 3 phases:
- Step1: Scan the clustered index, generate index entries and add them to the sort buffer. When the buffer is full, the ordered index entries are written to a temporary intermediate file. This process is called
run
- Step2: After writing to the temporary file one or more times, perform a merge order on all entries in the file
- step3: Insert the sorted index entries into the B-tree
Before introducing the construction of ordered indexes, index entries generally use the insert API to insert one record into the B-tree at a time. This process involves opening a B-tree pointer, then finding the insertion position, and then using the method to insert the B-tree page 乐观
. 悲观
But if the page is full and the insertion fails, the insertion will be performed again , 悲观
which generally involves the splitting and merging of B-tree nodes. The main cost of this top-down index construction method is: finding the insertion position, and frequently splitting and merging B-tree nodes.
Ordered index construction is done in a bottom-up manner. The description in this part is mainly about the dynamic process of inserting the index B-tree, but it is very difficult to understand. It is recommended to read the original text.
With this approach, a reference to the right-most leaf page is held at all levels of the B-tree. The right-most leaf page at the necessary B-tree depth is allocated and entries are inserted according to their sorted order. Once a leaf page is full, a node pointer is appended to the parent page and a sibling leaf page is allocated for the next insert. This process continues until all entries are inserted, which may result in inserts up to the root level. When a sibling page is allocated, the reference to the previously pinned leaf page is released, and the newly allocated leaf page becomes the right-most leaf page and new default insert location.
3.1 Reserve space for future index expansion
As mentioned above, a little.
3.2 Ordered index construction and full-text index support
Full-text indexing supports ordered index builds. In the past, SQL was used to insert entries into the full-text index (the blogger didn't understand the second half of the sentence, but it didn't matter).
3.3 Ordered Index Construction and Redo logging
During ordered index builds, the redo log is disabled, there is one checkpoint
to ensure that the index build can survive unexpected exits or failures. Checkpointing forces all dirty pages to be written to disk. During ordered index builds, page cleaner
threads are periodically signaled to flush dirty pages to ensure that checkpoint operations can be processed quickly. Normally, the thread flushes dirty pages clean-page
when the number falls below a set threshold . page cleaner
For ordered index builds, dirty pages are flushed in a timely manner to reduce checkpoint
overhead, and parallelize I/O and CPU activity.
Blogger's Note: This passage is a relatively general description. Readers need to understand the checkpoint mechanism of redo log first in order to understand it better.
3.4 Ordered Index Construction and Optimizer Statistics
Sorted index builds can cause optimizer statistics to differ from those produced by previous methods of index creation. The difference in statistics, which is not expected to affect workload performance, is due to the difference in the algorithm used to populate the index.
4. Innodb full-text index
Full-text indexes are created on text columns, such as char
, varchar
, text
, which can speed up CRUD operations on tables.
The creation method is similar to other indexes, and can be created by create table
, alter table
, create index
several statements. The syntax of full-text search is , please refer to here formatch() ... against
detailed usage .
4.1 Design of full-text index
InnoDB full-text index adopts inverted index design. An inverted index stores a list of words extracted from fields. To support proximity searches, position information for each word is also stored as a byte offset.
4.2 Full-text index table
When creating an InnoDB full-text index, a set of index tables is created, as shown in the following example:
mysql> CREATE TABLE opening_lines (
id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
opening_line TEXT(500),
author VARCHAR(200),
title VARCHAR(200),
FULLTEXT idx (opening_line)
) ENGINE=InnoDB;
mysql> SELECT table_id, name, space from INFORMATION_SCHEMA.INNODB_SYS_TABLES
WHERE name LIKE 'test/%';
+----------+----------------------------------------------------+-------+
| table_id | name | space |
+----------+----------------------------------------------------+-------+
| 333 | test/FTS_0000000000000147_00000000000001c9_INDEX_1 | 289 |
| 334 | test/FTS_0000000000000147_00000000000001c9_INDEX_2 | 290 |
| 335 | test/FTS_0000000000000147_00000000000001c9_INDEX_3 | 291 |
| 336 | test/FTS_0000000000000147_00000000000001c9_INDEX_4 | 292 |
| 337 | test/FTS_0000000000000147_00000000000001c9_INDEX_5 | 293 |
| 338 | test/FTS_0000000000000147_00000000000001c9_INDEX_6 | 294 |
| 330 | test/FTS_0000000000000147_BEING_DELETED | 286 |
| 331 | test/FTS_0000000000000147_BEING_DELETED_CACHE | 287 |
| 332 | test/FTS_0000000000000147_CONFIG | 288 |
| 328 | test/FTS_0000000000000147_DELETED | 284 |
| 329 | test/FTS_0000000000000147_DELETED_CACHE | 285 |
| 327 | test/opening_lines | 283 |
+----------+----------------------------------------------------+-------+
opening_lines is the table we defined, called the main table, or the indexed table.
The first six index tables constitute the inverted index, which is called the auxiliary index table. When the inserted data is token
tokenized (tokenized), individual words (also known as tokens ) are inserted into the index table along token
with positional information and associations . DOC_ID
Words are fully sorted and partitioned in six index tables when inserted according to the character set ordering of the first character of the word.
Blogger’s Note: In the full-text index environment, MySQL refers to row as document (doc), so doc_id is row_id; doc will be used to segment words, and the divided words are called tokens
The inverted index is partitioned into six auxiliary index tables to support parallel index creation. By default, two threads tokenize, sort, and insert words and related data into index tables. If it is a larger table (the index field size is larger), you can consider increasing it innodb_ft_sort_pll_degree
to increase the number of worker threads. The default is 2.
As you can see, the format of each auxiliary index table name is FTS_000***_000***_INDEX_#
. table_id
And each auxiliary index table is associated with the index table through the hexadecimal number and in its table name .
For example, test/opening_lines
the table table_id
is 327, which corresponds to 0x147 in hexadecimal. So the auxiliary index table associated with this table is FTS_000147_000***_INDEX_#
also in the query results.
The hexadecimal number that appears in the auxiliary index table name is also full-text indexed index_id
. For example, test/FTS_0000000000000147_00000000000001c9_INDEX_1
where 1c9
is 457 in decimal, you can INFORMATION_SCHEMA.INNODB_SYS_INDEXES
identify opening_lines
the index defined on the table (idx) by querying the table for this value (457).
Blogger's Note : The description of the manual is also not clear enough. It can be seen that there are two concepts here, one is "index table", and the other is multiple "auxiliary index tables" associated with it, and the association method is not necessary to remember. There are 2 hexadecimal numbers here, corresponding to 000A_000B in the auxiliary table name. According to the above description, these two hexadecimal numbers are used for association.
mysql> SELECT index_id, name, table_id, space from INFORMATION_SCHEMA.INNODB_SYS_INDEXES
WHERE index_id=457;
+----------+------+----------+-------+
| index_id | name | table_id | space |
+----------+------+----------+-------+
| 457 | idx | 327 | 283 |
+----------+------+----------+-------+
If the table space stored in the main table is file_per_table
separate, these index tables also have separate table spaces. Otherwise, it is where the main table is stored.
The other index tables shown in the previous examples are called common index tables and are used for delete processing and storing the internal state of the full-text index. For example:
These tables can also store related data from other tables that contain full-text indexed columns.
When you drop a full-text index, the columns created for that index are preserved FTS_DOC_ID
, because dropping FTS_DOC_ID
a column would require rebuilding the previously indexed table.
FTS_*_DELETED and FTS_*_DELETED_CACHE
Tables
These two tables store the tokens of the index columns of records that have been deleted from the main tabledoc_id
.FTS_*_DELETED_CACHE
YesFTS_*_DELETED
memory version (cache).FTS_*_BEING_DELETED and FTS_*_BEING_DELETED_CACHE
These two tables store records that have been deleted from the main table, but whose index column tokens are being deleteddoc_id
. Again, the latter is the memory version of the former.FTS_*_CONFIG
Stores the internal state of the full-text index. The most important thing is that it is savedFTS_SYNCED_DOC_ID
, and this ID identifies a word that has been segmented and flasheddoc_id
. When the crash is restored, this ID is used to identify those docs that have not been flashed, so that those docs can be parsed again and written to the full-text index cache. We canINFORMATION_SCHEMA.INNODB_FT_CONFIG
look at the data by querying it. ( Blogger's Note : The logic here seems to be unreasonable?)
4.3 Full-text index cache
When a doc is inserted, it will be word-segmented, and then each word and associated data will be inserted into the full-text index table. In this process, even for small docs, there will be several lightweight auxiliary index table insertion operations, in short, the writing is slow. To avoid this problem, Innodb uses a full-text index cache to speed up index insert operations. The cache will keep inserting data until the cache is full and is flushed in batches (to the auxiliary index table). We can INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE
look at the data by querying it.
This cache also avoids multiple insertions of the same word, minimizing the problem of duplicate entries.
Variables innodb_ft_cache_size
can configure the size limit of the available cache space for the full-text index of each table.
Variables innodb_ft_total_cache_size
can configure the size limit of the available cache space for all tables globally.
One thing to note is that the cache only stores the word segmentation data of the most recently inserted data, and the query does not load the index data from the disk into the cache. Therefore, each query directly checks the cache + checks the disk index, and returns the results after combining the two results.
4.4 DOC_ID
andFTS_DOC_ID
InnoDB uses DOC_ID
unique document identifiers called document identifiers to map a word in a full-text index to the document record in which that word occurs. The mapping also requires the columns of the index table FTS_DOC_ID
. If it is not defined, mysql will automatically add a hidden FTS_DOC_ID
column when the full-text index is created.
For example, the following table does not define this column:
mysql> CREATE TABLE opening_lines (
id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
opening_line TEXT(500),
author VARCHAR(200),
title VARCHAR(200)
) ENGINE=InnoDB;
When you create a full-text index, a warning will appear, that is, mysql is automatically creating FTS_DOC_ID
columns
mysql> CREATE FULLTEXT INDEX idx ON opening_lines(opening_line);
Query OK, 0 rows affected, 1 warning (0.19 sec)
Records: 0 Duplicates: 0 Warnings: 1
mysql> SHOW WARNINGS;
+---------+------+--------------------------------------------------+
| Level | Code | Message |
+---------+------+--------------------------------------------------+
| Warning | 124 | InnoDB rebuilding table to add column FTS_DOC_ID |
+---------+------+--------------------------------------------------+
Similarly, alter table
the statement to create a full-text index will also have this warning, but create table
it does not exist when the method is used to create a full-text index. Obviously, defining columns
at time is cheaper than creating a full-text index on a table with loaded data, because no changes to the table are required. Of course, under normal circumstances, we can ignore this performance loss. If we want to create the field ourselves, it must be , the field name must be , and all uppercase. It is not necessary, plus it can improve some performance. as follows:CREATE TABLE
FTS_DOC_ID
BIGINT UNSIGNED NOT NULL
FTS_DOC_ID
auto_increment
mysql> CREATE TABLE opening_lines (
FTS_DOC_ID BIGINT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
opening_line TEXT(500),
author VARCHAR(200),
title VARCHAR(200)
) ENGINE=InnoDB;
If you want to add this column by yourself, you must be responsible for the correctness of the data in this column, and it cannot be empty or repeated. We can optionally add an unique
index to this column FTS_DOC_ID_INDEX
.
mysql> CREATE UNIQUE INDEX FTS_DOC_ID_INDEX on opening_lines(FTS_DOC_ID);
But it is not necessary, because Innodb will add it automatically. Prior to MySQL 5.7.13, the maximum allowable gap between
the used FTS_DOC_ID
value and the new value was 10000. FTS_DOC_ID
In MySQL 5.7.13 and later, the allowed gap is 65535. (The blogger didn't understand this sentence too much, but I don't feel that I need to care too much)
In order to avoid rebuilding the table, even if the full-text index is deleted later,
FTS_DOC_ID
the column will not be deleted.
4.5 Deletion processing of full-text index (optimization)
Similar to the above mentioned that inserting records will update the index table multiple times resulting in performance degradation, deleting records will also be the same. In order to optimize, Innodb records the deleted records DOC_ID
to the above-mentioned FTS_*_DELETED
table, and before the query returns the results, it will DOC_ID
filter the information found in the auxiliary index table into the table. The advantage of this is that the deletion is fast and the overhead is low. The disadvantage is that the index data will not be deleted immediately after the record is deleted. If you want to delete the index data of invalid records, you need to set the prerequisites innodb_optimize_fulltext_only=ON
, and then execute the command: OPTIMIZE TABLE XXX
. For the optimization of full-text index, refer to here .
4.6 Transaction processing of full-text indexing
A full-text index has certain transactional characteristics due to its caching and bulk operations. Specifically, updates and inserts on full-text indexes are processed when transactions are committed, meaning that full-text searches can only see committed data .
Demo slightly.
4.7 Monitoring full-text indexes
The following tables are monitored INFORMATION_SCHEMA
in the library:
- INNODB_FT_CONFIG
- INNODB_FT_INDEX_TABLE
- INNODB_FT_INDEX_CACHE
- INNODB_FT_DEFAULT_STOPWORD
- INNODB_FT_DELETED
- INNODB_FT_BEING_DELETED
You can also use INNODB_SYS_INDEXES
and INNODB_SYS_TABLES
to view the basic information of the full-text index.
5. Extensions (unofficial content)
The content here is quoted from link
The stopword list (stopword list) indicates that the words in the list do not need to be indexed and word-segmented. The InnoDB storage engine has a default stopword list, which information_schema.INNODB_FT_DEFAULT_STOPWORD
is below , sharing 36 stopwords by default.
In addition, users can also innodb_ft_server_stopword_table
customize the stopword list through parameters:
SHOW GLOBAL VARIABLES LIKE 'innodb_ft_server_stopword_table';
SET GLOBAL innodb_ft_server_stopword_table = '库/表';
The full-text index of the current InnoDB storage engine also has the following limitations:
- Each table can only have one index for full-text search;
- The full-text index columns composed of multiple combinations must use the same character set and collation;
Languages without word delimiter (delimiter), such as Chinese, Japanese, Korean, etc. are not supported. The ngram full-text parser provided by MySQL5.7.6 supports word segmentation for Chinese, Japanese, and Korean.