mysql Chinese full-text search from entry to abandon

Like full-match fuzzy query cannot use index is always a thorny problem of sql query, so can mysql full-text search really solve this problem?

background

Recently, I encountered a query optimization problem in my work. The simplified SQL is as follows:

SELECT
	* 
FROM
	wxswj_nsrxx 
WHERE
	nsrmc LIKE '%东鹏%' 
	OR nsrsbh LIKE '%东鹏%' 
	OR shxydm LIKE '%东鹏%';

Questions:
1. Full-match fuzzy query is used
2. OR keyword is used

Obviously, such a query cannot be indexed, and because the data volume of the table is very large, with more than 5 million data, the response speed of the entire query is very unsatisfactory.

Chinese full-text search in practice

Instructions for ngram segmentation insertion:
https://dev.mysql.com/doc/refman/5.7/en/fulltext-search-ngram.html add link description

1. Optimization idea:
Chinese fuzzy matching query mainly involves word segmentation and full-text retrieval, and there is an index type in mysql that is full-text index FULLTEXT . So I want to solve the problem of full-match fuzzy query in mysql through full-text indexing.

2. Description:
Before MySQL 5.7.6, full-text indexing only supports English full-text indexing, but not Chinese full-text indexing. You need to use a word segmenter to preprocess Chinese paragraphs into words, and then store them in the database.
Starting from MySQL 5.7.6, MySQL has a built-in ngram full-text parser to support Chinese word segmentation.

3. View the current database version:

select version() from dual;

The result is 5.7.28, which supports Chinese full-text search

4. Restrictions
on full-text search: FULLTEXT indexes are created on text-based columns (CHAR, VARCHAR, or TEXT columns).
Full-text indexes can only be created on CHAR, VARCHAR, or TEXT columns.
Each table can only have one full-text search index
. The full-text search index composed of multiple columns must use the same character set and collation.

5.
Before closing the query cache sql optimization, the query cache is generally closed:
SHOW VARIABLES LIKE'query_cache%';
set global query_cache_size=0;
set global query_cache_type=0;

SHOW VARIABLES LIKE ‘query_cache%’;

6. Create a full-text index

ALTER TABLE `wxswj`.`wxswj_nsrxx`  ADD FULLTEXT INDEX `ft_index`(`nsrmc`,`nsrsbh`,`shxydm`) WITH PARSER ngram;

7. Use full-text index Use the full-text index
through **MATCH (col1,col2,...) AGAINST (expr [search_modifier])** statement.

SELECT
	* 
FROM
	wxswj_nsrxx MATCH ( `nsrmc`, `nsrsbh`, `shxydm` ) against ( '东鹏' IN boolean MODE )

The three fields of 东鹏de-fuzzy matching,, are used here nsrmc, nsrsbhand the corresponding record is returned if shxydmany of the fields contains the query key 东鹏.

8. The query execution plan
Insert picture description here
uses a new combined full-text search, and ref reaches const level

9. Optimization effect The
query performance has been improved by more than 100 times.

pit

So far, everything seems to be very good, but soon the pit appeared.
When the query keyword is too long, an exception occurs?

Question 1: FTS query exceeds result cache limit
when a relatively long query condition is used to match the query or even execute the query plan, an exception occurs:

188 - FTS query exceeds result cache limit

Explanation of the exception in the mysql official website:
https://bugs.mysql.com/bug.php?id=86036

Each full-text search query or InnoDB full-text search of each thread has a cache limit on the query results, which is defined in bytes. The intermediate and final InnoDB full-text search query results are processed in memory. You can use innodb_ft_result_cache_limit to set the size limit. Full-text search query result caching can avoid excessive memory consumption when InnoDB full-text search query results are very large (for example, millions or hundreds of millions of rows). If the result cache size limit is reached, an error is returned, indicating that the query exceeds the maximum allowable memory.

Recommended solutions:
Insert picture description here
1. Increase the value of innodb_ft_result_cache_limit to make it greater than 4G

SHOW VARIABLES LIKE 'innodb_ft_result_cache_limit%';
set global innodb_ft_result_cache_limit=4000000000;

2. Optimize the query statement, limit the number of records returned by the query, and reduce the huge cache from intermediate results. It is generally limited by displaying the specified limit.

Problem 2: The query speed is very unstable.
By modifying the value of innodb_ft_result_cache_limit, we solved the abnormal problem of the cache limit.
At that time, when we tried to modify the query conditions, we found that the query performance was very unstable.
Sometimes the query speed is very fast, and sometimes it is not even as good as the like full match module query.
Especially when the query condition is very long, the problem is very obvious, and the query performance is not guaranteed at all.

SELECT
	* 
FROM
	wxswj_nsrxx MATCH ( `nsrmc`, `nsrsbh`, `shxydm` ) against ( '中国航天工业科学技术咨询有限公司' IN boolean MODE )

give up

After investigating various materials, I did not find a better solution, and finally reluctantly chose to give up.

Test statement

create table test(
id int(11) not null primary key auto_increment,
name varchar(100) not null comment '工商名',
brand varchar(100) default null comment '品牌名',
en varchar(100) default null comment '英文名',
fulltext key (name,brand,en) with parser ngram
)engine=innodb default charset=utf8;
insert into test (name,brand,en) values ('芜湖美的厨卫电气制造有限公司','aa','wh');
insert into test (name,brand,en) values ('北京凡客尚品电子商务有限公司','aa','ef');
insert into test (name,brand,en) values ('凡客诚品(北京)科技有限公司','aa','dfd');
insert into test (name,brand,en) values ('瞬联讯通科技(北京)有限公司','aa','sdfs');
insert into test (name,brand,en) values ('北京畅捷通讯有限公司','aa','wsdh');
insert into test (name,brand,en) values ('北京畅捷通支付技术有限公司','aa','df');
insert into test (name,brand,en) values ('畅捷通信息技术股份有限公司','aa','whdfgh');
insert into test (name,brand,en) values ('北京畅捷科技有限公司','aa','dgdf');
insert into test (name,brand,en) values ('中国航天工业科学技术咨询有限公司','aa','whffgh');
insert into test (name,brand,en) values ('北京·松下彩色显象管有限公司','aa','wfghfgh');
insert into test(name,brand,en) select name,brand,en from test;
insert into test(name,brand,en) select name,brand,en from test;
insert into test(name,brand,en) select name,brand,en from test;
insert into test(name,brand,en) select name,brand,en from test;
insert into test(name,brand,en) select name,brand,en from test;
insert into test(name,brand,en) select name,brand,en from test;

EXPLAIN  SELECT  *  from  test  where  match  (name,brand,en)  against  ('通讯录' IN BOOLEAN MODE) LIMIT 100;

The total amount of test data created is: 655360
select count(*) from test;

SELECT  *  from  test  where name like '%美的%' or brand like '%美的%' or en like '%美的%';
耗时:0.544

EXPLAIN  SELECT  *  from  test  where  match  (name,brand,en)  against  ('美的' IN BOOLEAN MODE) LIMIT 100;
耗时:0.150



SELECT  *  from  test  where name like '%芜湖美的厨卫电气制造有限公司%' or brand like '%芜湖美的厨卫电气制造有限公司%' or en like '%芜湖美的厨卫电气制造有限公司%';
耗时:0.679

EXPLAIN  SELECT  *  from  test  where  match  (name,brand,en)  against  ('芜湖美的厨卫电气制造有限公司' IN BOOLEAN MODE) LIMIT 100;
耗时:5.626

By adding double quotation marks, the exact phrase search is realized, and the search conditions are not matched by word segmentation. Let's test:
Insert picture description here

 SELECT  *  from  test  where  match  (name,brand,en)  against  ('"芜湖美的厨卫电气制造有限公司"' IN BOOLEAN MODE) LIMIT 100;
耗时:5.626

Found no impact on query performance.

Through experiments, it is found that the longer the query condition, the slower the query performance.
You can test and feel it yourself.

You are welcome to share any suggestions about using mysql full-text search.

in conclusion

This experiment proves that MySQL has limited support for full-text search, the restrictions are relatively large, and the query performance cannot be guaranteed. In many cases, it may not be as good as using the like query directly.
Consider playing a small table with hundreds of thousands of data.
When a full-match fuzzy query is required for some large tables, firstly, discuss with the business side whether it can only support the pre-match fuzzy query, and secondly increase other query conditions as much as possible, and limit the number of matched records through limit.
Under complex queries, and must require full-match fuzzy query support and strict requirements on query performance, then Elasticsearch is recommended.

Follow me in private chat and receive video tutorials for free.
Insert picture description here
Insert picture description here
More exciting, follow me.
Legend: Follow the old man to learn java

Guess you like

Origin blog.csdn.net/w1014074794/article/details/106746114