Build one of your own search engines

I. Introduction

I have quite a lot of contact with search engines. Let me briefly recall what I have done before. There are no documents in my previous work and many things have almost been forgotten. It is a pity.

1. I played with Lucene when Dongqi Software was building a corporate website more than 10 years ago. At that time, Chinese word segmentation was still very weak, and many words could not be searched. It seemed that I was building a corporate website for the Bank of China, and then my technology was used by the bank. People despise it, but although people in the bank's technical department can build the bank's information system, they use C or Dephi, and they don't know JSP and can't build a website at all, haha.

2. I have been exposed to three sets of search engines in Alibaba. I have done Taobao education docking with VSearch, and done classified information docking with Taobao main search. In addition, I have also asked the Taobao final search team to share the technical architecture with our team. VSearch and final search are both based on Lucene and Solr. It was built, and some extensions were made to provide a framework to support the import of full and incremental data. Vsearch served Taobao's vertical market strategy at that time, because the docking period for the main search was extremely long and many personalized needs could not be met, and the final search was A competing product developed by another team, Final Search literally means to unify all searches. At that time, the two groups fought very hard, and then the main search was developed based on C language. The initial architecture should be derived from Yahoo, Yitao and Searches on Taobao list pages (called the Hasper system in this impression) are all built by the main search engine.

3. When I was working on stock software at Niubang, the query of new stock data was provided by ElasticSearch. I was responsible for writing Python crawlers to crawl new stock data on several websites and importing it into ElasticSearch. The leader was responsible for building ElasticSearch and providing service interfaces.

4. When the company was doing the milk business, I once asked a colleague to build ES, mainly to solve the delivery order change log query. However, the business never improved and the switch was not made after the setup was completed.

5. The curtain system is building ES to solve some problems encountered in practice, such as daily inventory value report snapshots, real-time inventory report export, various reports, now completely dependent on the database, the sudden load is still relatively high, and some real-time Data query export requirements are difficult to meet using a database.

二、MySQL VS ElasticSearch

The general approach to using ES is to synchronize a copy of the data stored in MySQL to ES, and then use ES to perform real-time queries on massive data. Solving fuzzy search is only a small feature.

1. Relational databases store structured business data and are designed more to satisfy business processes. If the design is too concerned with satisfying queries, the entire structure will be confusing and have a lot of redundancy. ES is designed to solve queries. Multiple table data can be merged into one Schema (or type), which can solve many performance-consuming multi-table related queries.

2. MySQL uses transaction features to ensure that dirty data will not be generated, but ES has no support for transactions, so MySQL is generally used to store original data.

3. MySQL needs to divide databases and tables to do massive data queries, but in the end your query will still fall into a certain table in a certain database, and ES is a naturally distributed architecture. The data is stored in shards, and one node is used for querying. When A receives the request, it will forward it to its data node. Other data nodes will query locally and return the result ID to node A. Then node A will sort and page all the results, and then go to each data node to check the original data according to the ID and return it. to users.

4. Even for single-piece queries, ES Lucene’s inverted index is faster than MySQL’s B+TREE.

3. Inverted index

ES is a search service based on Lucene, which provides a distributed multi-user full-text search engine, while Lucene is only a full-text search engine toolkit.

1. What is an inverted index?

Inverted index is a concept in full-text search. Since there is an inverted index, there must be a forward index. Regardless of forward and inverted indexes, it is a concept in full-text search. MySQL database Innodb is a B+TREE index, which has nothing to do with forward and inverted indexes.

Forward index : By searching the full text and then finding the words, this is the full-text forward index.

Document->Word 1, Word 2

The number of occurrences of word 1, the position where word 1 appears; the number of occurrences of word 2, the position where word 2 appears.

Inverted index : Find the full text through words, which is the full-text inverted index

Word 1--->Document 1, Document 2, Document 3

Word 2--->Document 1, Document 2

2. Example of inverted index generation

Article 1 content: qingcai lives in Hangzhou, I live in Hangzhou too.

Article 2 content: he once lived in Shangrao.

step1: Get keywords

English words are relatively simple and can be segmented directly with spaces. Chinese words need a word segmenter to filter out meaningless words such as in, too, etc., filter out punctuation marks, and then unify the case, restore lives and lived to live, so that lives and lived can be separated by checking live. Lived is checked out together.

Article 1 Keywords: [qingcai][live][hangzhou][i][live][hangzhou]

Article 2 Keywords: [he][live][shangrao]

step2: Create an inverted index

Keywords are arranged in character order, and you can use the binary search algorithm to quickly search for keywords.

Lucene will save the above three columns as the following three files.

  • Dictionary file (Term Dictionary)

  • Frequencies files (frequencies)

  • Positions files

The Term Dictionary also saves pointers to two other files, and finally Lucene compresses the index.

3. Index query

If the word "live" is queried, Lucene performs a binary search on the dictionary file (Term Dictionary), then reads out all article numbers through the pointer, and finally returns the result data. The dictionary file is generally very small, and the entire query process is very fast and efficient.

Guess you like

Origin blog.csdn.net/2301_76787421/article/details/133156449