[Original] A Technical Solution Supporting Fuzzy Matching Full-text Search in the Site

[Original] A Technical Solution Supporting Fuzzy Matching Full-text Search in the Site


Author: Heiyeluren (heiyeluren)

Time: 2021/1/13

 

Problem scenarios for full-text retrieval/full-text search:

For example, there is a problem scenario like this:

In the computer course training system, there is a need to realize the scenario of matching degree calculation based on user input keywords, search course name and course introduction, and the amount of initial data is not large.

 

Probably something like this:

 

scenes to be used:

[Enter the "mysql" keyword to match these]:

Introduction to mysql engine innodb (exact match)

mysql introduction (exact match)

Use MySQL correctly (exact match)

sql optimization (partial match, recall)

sq (abandon)

Full-text search technology application scenarios, such as searching for articles on WeChat public accounts:

 

 

After the analysis of these scenarios, this is a typical problem of relying on inputting a keyword, and then performing a corresponding search on the relevant document or text content through fuzzy matching, and outputting the results according to the weight. This is a typical "full-text search/full-text search "A technical problem scenario is the same as the essential requirements of our daily search in the site and the whole network search, but the application scenario is simpler, because the amount of data is not large, so there are more corresponding solutions. Through technical analysis, sharing The solution is as follows.

 

 

[Scheme 1: Simple full-text search scheme realized by pure self-coding]

Recommended index: ☆☆☆☆

Programming language: Java/Golang/C++/Python

Applicable scenarios: The amount of data is relatively small, such as at the level of 100,000, but general string matching algorithms such as KMP cannot meet the requirements. You need to practice your hands to quickly realize full-text retrieval.

 

a. Store documents: build the corresponding 500 titles into a string list (it can be a container structure such as map/set, put it in memory, and traverse it through an iterator)

b. Document word segmentation and establishment of inverted ranking: Segment these titles (using the open source word segmentation method) and store them in a word segmentation index structure, which can be an inverted list or a rough hashmap. (The storage relationship is the association relationship between words and the above string groups)

c. User retrieval processing: After the user enters the vocabulary, perform word segmentation (using an open-source word segmentation database) to search whether there is a corresponding vocabulary in b, if so, sort the corresponding list according to the weight, and then map it to the string list extraction in a come out. (This step can be directly and simply pulled, or calculated using the tf-idf or bm25 algorithm. If you want to be lazy, you don’t need to use such a complicated method)

d. Output after sorting: sort and output the corresponding weight result list above, and return it to the calling application.

 

Open source recommendation:

Stuttering word segmentation (recommended): https://github.com/fxsjy/jieba

Stuttering participle usage: https://www.jianshu.com/p/883c2171cdb5

LibMMSeg:https://www.oschina.net/p/libmmseg

SCWS word segmentation: http://www.xunsearch.com/scws/index.php

TF-IDF algorithm: https://blog.csdn.net/zhb_bupt/article/details/40985831

 

 

[Scheme 2: Using MySQL to implement a full-text search engine]

Recommended index: ☆☆☆

Programming languages: unlimited

Applicable scenarios: medium data volume, such as tens of millions, and MySQL-based storage structure, the following methods can be considered.

 

Implementation 1: MySQL 5.7 + Ngram tokenizer

The FULLINDEX of the old version of MySQL is for full-text retrieval in English, and cannot retrieve and process Chinese; after the new version, there is a new word segmentation algorithm, which uses the Ngram full-text parser that comes with MySQL 5.7.22 and later for word segmentation, and uses SQL statement MATCH ...AGAINST statement completes the full-text search.

Disadvantages: The whole Ngram tokenizer is too mechanical, and it cannot do semantic processing well, but it is also compatible with some scenarios such as searching SQL in MySQL.

Reference document 1: https://www.cnblogs.com/xuey/p/11631102.html

Reference document 2: https://blog.csdn.net/weixin_51686373/article/details/109773911

 

Implementation 2: Using MySQL+IK tokenizer

Below MySQL 5.6, only the MyISAM engine supports full-text search. The Innodb engine above MySQL 5.6 also supports full-text search. The corresponding field needs to establish a FULLTEXT index.

Below MySQL 5.7.6, only English full-text index is supported, and Chinese full-text index is not supported. It is necessary to use the IK tokenizer to split Chinese paragraphs into words. Compared with solution 1, word segmentation is more accurate and does not need to rely on external solutions such as ES.

Disadvantages: The performance is not high, and it can only be used for small data volumes.

 

Implementation method 3: MySQL+Sphinx engine

It is mainly realized by combining MySQL with an open source search engine; using MySQL + Sphinx for full-text search, using mysql as a data source, and using sphinx as a word segmentation and storage search engine.

Disadvantages: Relatively speaking, the installation and deployment is a little troublesome, and it needs to rely on third-party engines.

Reference document: https://blog.csdn.net/socho/article/details/52251177

 

 

[Scheme 3: Use MongoDB's full-text search engine]

Recommended index: ☆☆

Programming languages: unlimited

Applicable scenarios: The amount of data is relatively small, and MongoDB is used for self-storage, so there is no need to do more data dumping and processing.

 

Method to realize:

Higher versions of MongoDB 3.4 and later support partial Chinese retrieval. The main method is that MongoDB creates a full-text search index for a doc field, and then uses find to search the full-text index, which can be sorted according to similarity.

The implementation process is to use db.create_index([("metaDataList.title",pymongo.TEXT)]) to build a text full-text index, and then use db.find({ "$text": { "$search": "Keywords" } },{ "score": { "$meta": "textScore" } }) to search, you can get the weight value through textScore and sort the output.

 

Disadvantages: Chinese support is not very good, just enough to use, the amount of data should not be too large, otherwise the performance will be relatively poor, the query time of data above tens of millions may be longer than 10 seconds, and complex weight sorting cannot be done.

 

Reference document 1: https://www.jianshu.com/p/a3d763b29553

Reference document 2: https://docs.mongoing.com/indexes/text-indexes

 

 

[Scheme 4: Use third-party services such as Lucene/Elasticsearch or Xunsearch/Solr to realize full-text search]

Recommended index: ☆☆☆☆

Programming languages: unlimited

Applicable scenarios: the amount of data is relatively large, such as tens of millions or billions of data, and I don’t want to do development, so I directly use the ready-made

 

Implementation method: Elasticsearch is a commonly used open source search engine for log or big data storage. The underlying search engine is mainly implemented using Lucene, so the two can be used independently, with similar effects, and can be built locally, but the installation and configuration are a bit complicated. If you want to save trouble, you can use third-party services (such as Alibaba Cloud) to directly import data, and then remotely pull the data results. (The whole solution is suitable for scenarios with a large amount of data)

Disadvantages: Installation and deployment are troublesome, data needs to be filled and then queried, and it also depends on external services.

 

Lucene full-text search 1: https://blog.csdn.net/zhang18024666607/article/details/78216635

Lucene full-text search 2: https://zhuanlan.zhihu.com/p/73875797

ES introductory tutorial: http://www.ruanyifeng.com/blog/2017/08/elasticsearch.html

ES full-text search: https://www.cnblogs.com/softidea/p/6119362.html

Alibaba Cloud ES service: https://www.aliyun.com/product/bigdata/product/elasticsearch

Alibaba Cloud search service: https://www.aliyun.com/product/opensearch

Xunsearch search service: http://xunsearch.com

Solr learning and use: https://www.cnblogs.com/yanduanduan/p/7344667.html

 

 

[Scheme 5: Other open source full-text search engines or search libraries]

Recommended index: ☆☆

Programming language: Golang/C/C++/Java

Applicable scenarios: In addition to the conventional and well-known search engines such as ES/Lucene/Solr, there are also many other good open source search engines that you can refer to, learn and use to see if they are suitable for your business scenario.

Disadvantages: Some need secondary development or good compilation and deployment capabilities, and some engines also need to consider update and maintenance.

 

Classic full-text search engine: Xapian (developed in C++, https://xapian.org )

Tencent WeChat full-text search engine: wwsearch (developed in C/C++, https://github.com/Tencent/wwsearch )

Peking University search engine TSE: LBTSE (C/C++ development, https://gitee.com/lewsn2008/LBTSE )

Redis search engine: RediSearch (C development, https://github.com/RediSearch/RediSearch )

Go distributed full-text search engine: Riot (Go development, https://gitee.com/veni0/riot )

Wukong search engine: Wukong (developed by Go, https://github.com/huichen/wukong )

Standalone search engine: nsearch (developed in Go, https://github.com/HughNian/nsearch )

Inverted compressed full-text search engine: Zettair (developed in C, http://www.seg.rmit.edu.au/zettair )

 

Recommended by other search engines: https://blog.csdn.net/xum2008/article/details/8740063

 

 

[Scheme 6: Realize full-text search service with pure native programming]

Recommended index: ☆☆

Programming language: Java/Python/Golang/C++

Applicable scenarios: The amount of data depends on the scale of the business. It is recommended to have a data volume of 100,000 to one million scale. If you want to have a self-developed full-text search engine suitable for your business, you can consider this solution.

Disadvantages: You need to develop a lot of program codes yourself, and performance, reliability, etc. are completely dependent on your own implementation technology.

 

1. Main service composition of full-text search engine

a. Data interface service: includes interfaces for inputting data and querying data, may provide tcp or http service interfaces, can accept large-scale concurrent requests, will do some query result caching work, and can directly return results without directly requesting query services Wait for some engineering layer optimization.

b. Data storage service: Including basic normalization processing of data after input data, basic word segmentation weight calculation of data, storage index and corresponding original content relationship according to inverted table/hashmap/tree structure, and finally generating corresponding data files and index files .

c. Data retrieval service: including word segmentation of input query words, weight correlation calculation (tf-idf/bm25), pulling corresponding indexes and final data, sorting output according to weight, including core keywords and corresponding entity text content .

d. According to your own business situation, you can also add other services, such as data capture, data cleaning and so on.

 

2. Mainly involved algorithms and technologies

 

a. Word segmentation algorithm: mainly divided into traditional mechanical word segmentation and machine learning word segmentation technology. If the application scenario is simple, traditional mechanical word segmentation technology can be used, which is simple and effective. There is also a more advanced natural language processing technology NLP that involves in-depth understanding of semantics. (Partial mechanical word segmentation algorithms include forward maximum matching, reverse maximum matching, Ngram element matching, CRF conditional random field, word frequency statistics and other algorithms, and there are many open source libraries that can be used directly)

Reference document: https://blog.csdn.net/fox_wayen/article/details/78416181

 

b. Keyword-text association structure and algorithm: Generally, the method of inverted list is mainly used, which is a technique of correlating related search keywords with the final text. (There are many open source implementations that can be used directly)

Reference document: https://zhuanlan.zhihu.com/p/145934063

 

c. Thesaurus storage structure and algorithm: general thesaurus storage is implemented by hash or trie-tree (dictionary tree/prefix tree) algorithm (there are many open source implementations that can be used directly)

Reference document 1: https://zhuanlan.zhihu.com/p/34747612

Reference document 2: https://blog.csdn.net/youfefi/article/details/72886646

 

d. Keyword-text weight correlation algorithm: tf-idf model/bm25 algorithm, mainly calculates the entire correlation calculation of a keyword in a document through word frequency and other methods. (This is crucial to the search results, and there are many implementation algorithms, but the most commonly used and simplest ones are introduced. ES also supports TF-IDF and BM25 algorithms)

Reference document 01: https://www.pianshen.com/article/3957200800

Reference document 02: https://www.cnblogs.com/jiangxinyang/p/10516302.html

Reference document 03: https://my.oschina.net/stanleysun/blog/1617727

Reference document 11: https://www.jianshu.com/p/f70d3dba74cc

Reference document 12: https://blog.csdn.net/zrc199021/article/details/53728499

Reference document 21: https://blog.csdn.net/Tink1995/article/details/104745144

Reference document 22: https://www.jianshu.com/p/53e379483f3e

 

 

3. Recommend full-text search related learning books (if you want to study in depth)

"Introduction to Information Retrieval": https://item.jd.com/12554083.html

"Search Engine - Principle Technology and System": https://item.jd.com/12496373.html

"This is the Search Engine: Detailed Explanation of Search Engine Technology": http://jxz1.j9p.com/pc/zjsssyq.zip (e-book)

"Homemade Search Engine": https://item.jd.com/11837411.html

"Writing Your Own Distributed Search Engine" https://item.jd.com/12202453.html

"In-depth understanding of ElasticSearch": https://item.jd.com/12617323.html

 

 


About the Author:

Heiyeluren (heiyeluren), currently working in Xueersi Online School, is the chairman of Xueersi Online School Technical Committee; CSDN blog technical expert, Internet back-end technical architect, many years of PHP/C/C++/Golang development experience, LNMP technology stack/ Distributed/high concurrency and other technical enthusiasts, domestic open source community and open source technology evangelist.

 

Personal blog: http://blog.csdn.net/heiyeshuwu

Sina Weibo: http://weibo.com/heiyeluren

WeChat public account: Night passerby

 

 

 

Guess you like

Origin blog.csdn.net/heiyeshuwu/article/details/112582583