"MySQL Series - InnoDB Engine 37" Index and Algorithm - Full Text Search

Full Text Search

1 Overview

For the characteristics of the B+ tree, you can search through the prefix of the index field. For example, the following query method supports B+ tree index. As long as the name field is added with B+ tree index, you can use the index to quickly search for names starting with XXX.

select * from table where name like 'XXX%';

The following situation is not suitable for private B+ indexes, because even if a B+ tree index is added, full-text scanning is required.

select * from table where name like '%XXX%';

But in practice, many such scenarios will be encountered. For example, search engines perform full-text searches based on keywords entered by users, which are not suitable for using B+ indexes. Then you need to introduce another retrieval method, called full-text retrieval.

Full-text search is a technology to find any content in the whole book or the whole article stored in the database. It can obtain relevant chapters, sections, paragraphs, sentences, words and other information in the full text as needed, and can also perform various statistics and analysis.

Initially, the InnoDB engine did not support full-text search technology. Tables that need to use full-text retrieval need to use the MyISAM engine, but this loses the transactional nature of the InnoDB storage engine. So starting from InnoDB 1.2.x version, InnoDB began to support full-text search.

2 inverted index

Full-text retrieval is usually implemented using an inverted index. Inverted index, like B+ tree, is also an index structure. It stores a mapping between words and their own positions in one or more documents in an auxiliary table. Usually implemented using an associative array, which has two representations:

  • inverted file index, its expression is {word, the ID of the document where the word is located}
  • full inverted index, its expression is {word, (the document ID where the word is located, the position in the specific document)}

For example: In the following example, the content stored in the table is as follows:

Documentld indicates the ID of the full-text search document, and text indicates the stored content, and the user needs to perform full-text search on the content of these stored documents.

insert image description here

For the key array of the inverted file index, its stored content can accommodate:

insert image description here

It can be seen that the word code exists in 1 and 4, so that the corresponding storage content can be found according to 1 and 4. For the inverted file index, it only accesses the document ID, while the full inverted index stores the pair (pair), that is (documentid, position), so the stored inverted index table is as follows:

insert image description here

The full inverted index also stores the location information of the word. For example, the word code appears at (1:6), that is, the sixth word of document 1 is code. In contrast, a full inverted index takes up more space, but better positions the data.

3 InnoDB full-text search

The InnoDB storage engine supports full-text retrieval technology from version 1.2.x, which adopts the full inverted index method. In the InnoDB storage engine, treat (DocumentId,Position) as an "ilist". Therefore, in the full-text search table, there are two columns, one is the word field, the other is the ilist field, and an index is set on the word field. In addition, since the InnoDB storage engine stores Position information in the ilist field, Proximty Search can be performed, but the MyISAM storage engine does not support this feature.

As mentioned before, the inverted index needs to store word in a table, which is called Auxiliary table (auxiliary table). In the InnoDB storage engine, in order to improve the parallel performance of full-text retrieval, there are 6 auxiliary tables. Currently, each table is partitioned according to the latin code of word.

Auxiliary Table is a persistent table stored on disk. However, in the full-text index of the InnoDB storage engine, there is another important concept FTS Index Cache (full-text search index cache), which is used to improve the performance of full-text search.

FTS Index Cache is a red-black tree, which is sorted according to (word, ilist). This means that the inserted data has updated the corresponding table, but the update to the full-text index may still be in the FTS Index Cache after the word segmentation operation, and the Auxiliary Table may not have been updated. The InnoDB storage engine will update the Auxiliary Table in batches instead of updating the Auxiliary Table every time it is inserted. When querying the full-text index, the Auxiliary Table first merges the corresponding word field in the FTS Index Cache into the Auxiliary Table, and then performs the query. This merge operation is very similar to the function of the Insert Buffer introduced earlier, the difference is that the Insert Buffer is a persistent object, and it is a B+ tree structure. However, the function of FTS Index Cache is similar to that of Insert Buffer. It improves the performance of the InnoDB storage engine, and because it performs batch insertion after sorting according to the red-black tree, the Auxiliary Table generated by it is relatively small.

おすすめ

転載: blog.csdn.net/m0_51197424/article/details/130044769