Thinking of ElasticSearch Query_string + match_phrase in hundreds of billions of searches

In the application scenario of public opinion analysis, the data scale is usually more than 100 billion. Using Elasticsearch to build a search engine and do related analysis faces many challenges.
Let me introduce first, in the public opinion analysis scenario, we need to use match phrase grammar to do accurate sentence matching for articles! In
this article:
1. I will first talk about the challenges we face;
2. Then I will analyze the retrieval process of match phrase grammar with questions;
3. The underlying principle.
4. According to the retrieval principle, consider what optimizations can be done;
5. And some of my optimization methods for the challenges we face.

Target

Explore ES's performance optimization scheme for precise sentence matching in the retrieval scenario of hundreds of billions of data. In the scene of real-time interaction, to deal with so many searches, achieve the goal of paying attention to within 3 seconds. This article will first talk about the many challenges faced by using ES for retrieval in the context of public opinion analysis. Then there will be an in-depth disassembly of the retrieval principle of ES.

challenge background

  • The data scale is large , and public opinion analysis usually requires at least three months to search. Joining in to get all the existing media data on the Internet is very much. For example Weibo, Douyin, Facebook, Twitter, etc. If the daily average data volume is all placed in es, there may be NT at least (at least 2T). If in such a data scale, the data of three months is 90 * NT. Usually we have some data that is retrieved within one year or even two years. That's 365*2*NT. Doing real-time interaction at this scale and obtaining results in single-digit seconds requires a lot of computing resources. Usually, an es instance can efficiently run 2T of data. If it is exceeded, the performance will decrease. Because we can't afford the machine, our single machine loads 8T of data.

  • To use the exact match of the sentence . Usually, in the public opinion retrieval scenario, it is not the relevance retrieval of es, but the precise matching of sentences. This has to use the match_phrase syntax in ES. To do exact sentence matching. Students who are familiar with es should know that es is best at term query, followed by match search, and then match_phrase. This is equivalent to doing a search like %China%.

  • The search conditions are complex and there are many keywords to search . Usually a lot of must and must not are used, and the query statement contains multiple operators, clauses and filters. That is to say, in a wave of retrieval, 100+ retrieval terms may be output. So this has to use the query string search syntax, and the matching pattern uses the same logic as phrase (and match_phrase).

  • To hit the full amount of data . In the question answering system, just like Baidu and Google, it only needs to return the most relevant answer to the question. Usually, in the public opinion scenario, it needs to hit the full amount of data. Because users usually want how many results are hit by this condition.

  • There must be a lot of aggregation analysis . The performance of es aggregation analysis is not high. Because it requires a lot of CPU and disk IO. When the data scale far exceeds the machine scale. The overall retrieval effect will be very poor. I have considered other types of databases that are good at OLAP, such as CK, but it is not good at sentence matching.

  • Search in unstructured data . Searches are usually done in articles, not in a field. This time the retrieval becomes extraordinarily difficult. When doing aggregate analysis, databases like CK are completely useless.

  • Real-time interaction . Return results within 3-5s as much as possible. The above problems have become extremely serious under the requirements of real-time interaction.

in conclusion:

Having said all that, there are insufficient resources. Any problem can be solved by adding resources. The problem is that resources are very expensive, so what we have to do is to do infinite optimization under the condition of limited resources.

The retrieval principle of match_phrase of ES

Let's see why match_phrase is slow.

A match_phrase query is a type of query in Elasticsearch that is used to exactly match documents that contain a specific set of words. Specifically, the match_phrase query finds documents that contain a particular phrase and the words in the phrase appear in the document in the correct order.
There is only one way for es to organize data, and that is to segment words and save them in the inverted list. In theory, the best performance must be, according to a word, the word is as short as possible. Then do a term query. because it only requires

Process analysis of match_phrase query

  1. Query Parsing: Parse the query string entered by the user into the corresponding query syntax. During the parsing process, Elasticsearch will generate corresponding query statements based on the matching type, search fields, matching conditions and other information. For example, for a match phrase query, Elasticsearch generates a MatchPhraseQuery object and passes the query string as a parameter to the object. Parse the query statement into an internal query object (Query Object), and check the syntax and semantics. At this stage, Elasticsearch will use the query parser (Query Parser) to parse the query statement into a query object.

  1. Query Optimization: Optimize internal query objects to improve query performance. This stage includes operations such as optimizing query logic, merging repeated queries, and reducing query scope. In the optimization phase, Elasticsearch will use the query optimizer (Query Optimizer) to optimize the query object.

  1. Query Execution: Execute the query operation and convert the query object into an operation for searching the inverted index (Inverted Index). During the query execution phase, Elasticsearch will use the inverted index (Inverted Index) to find documents that meet the query conditions. Specifically, Elasticsearch will use the following files for query operations:

  • .tim file and .tip file: These two files save the position information of the term and are used to locate the document where the term is located in the inverted index.

  • .doc file: This file saves the document ID and the location information of the term, which is used to locate the document where the term is located in the inverted index.

  • .pos file and .pay file: These two files save the position information and additional information of the term in the document, which are used to calculate the relevance score in the inverted index.

  • .liv file and .del file: These two files are used to indicate whether the document exists and whether it is marked for deletion.

  • .fdt file and .fdx file: These two files are used to store the content and structure information of the document.

  • .nvd file and .nvm file: These two files save information such as document frequency, inverse document frequency and document length of terms, which are used for weighting when calculating the relevance score.

  • .tvx files and .tvd files: These two files are used to provide quick access to information about fields in the document.

In the query execution phase, Elasticsearch will use the above files for query operations to find document IDs that meet the query conditions. In order to improve query performance, Elasticsearch also uses some technologies, such as Boolean operation optimization, term lookup optimization, etc.

  1. Scoring: Sort the documents according to the relevance score to get the final search results. In the phase of calculating the relevance score, Elasticsearch will use machine learning models such as the BM25 algorithm or the TF-IDF algorithm to calculate the relevance score. At the same time, Elasticsearch will also use technologies such as query vector (Query Vector) and document vector (Document Vector) to calculate the relevance score. When calculating the relevance score, Elasticsearch uses the term relevance score information stored in the .nvd file and the .nvm file. The .nvd file saves the document frequency and inverse document frequency information of the term, and the .nvm file saves information such as the length and square sum of each document. This information is used to calculate a relevance score for the query term. Documents with high scores are ranked higher in the search results. Calculate the Relevance Score of the query results and sort them according to the Relevance Score.

  1. Deleted document processing: Elasticsearch will process documents marked for deletion based on the information in the .liv file and the .del file. The .liv file is used to indicate which documents exist (live) and which documents have been marked for deletion (deleted). The .del file is used to store information about documents that have been marked for deletion. When searching, Elasticsearch will use the .liv file to skip documents that have been marked for deletion to ensure that these documents will not be included in the search results.

  1. Cache processing: In order to improve search performance, Elasticsearch will cache some query results in memory to quickly respond to subsequent query requests. Cache processing includes two aspects, one is to generate query cache (Query Cache) according to query statement, and the other is to generate filter cache (Filter Cache) according to document ID. The query cache is used to cache the document IDs matched by the query statement, and the filter cache is used to cache the filtering results of the document IDs. These caches are stored in memory or on disk for use in subsequent queries.

  1. Result return: Finally, Elasticsearch returns the matching document ID to the user and, if necessary, the content of the document. The returned document content is read from the .fdt file. The .tvx and .tvd files are used to provide quick access to the information of the fields in the document, so that the corresponding field values ​​can be quickly obtained when the results are returned.

Source code parsing of match_phrase query - source code in es

In Elasticsearch, the source code for processing match_phrase queries is mainly distributed in the following files:

  1. org.elasticsearch.index.query.MatchPhraseQueryBuilder: This class defines the query statement structure for the match_phrase query. It inherits from the org.elasticsearch.index.query.MatchQueryBuilder class and implements operations such as query parsing, construction, and execution.

  1. org.elasticsearch.index.query.MatchPhraseQueryParser: This class is used to parse match_phrase query statement and generate MatchPhraseQueryBuilder object. It implements the org.elasticsearch.index.query.QueryParser interface, which can be called by Elasticsearch's query parser.

  1. org.elasticsearch.index.mapper.TextFieldMapper: This class is used to define mapping rules for text fields. It implements the org.elasticsearch.index.mapper.Mapper interface, and includes the implementation of operations such as parsing text, word segmentation, and building inverted indexes.

  1. org.elasticsearch.index.search.MatchPhraseQuery: This class is the implementation class of match_phrase query. It inherits from the org.apache.lucene.search.MultiTermQuery class and implements operations such as query matching and scoring.

The source code for these files can be found in the Elasticsearch Github repository. Specifically, you can find these files by going to the following links:

Please note that the above files are the source code of Elasticsearch 7.x version. If you are using other versions of Elasticsearch, you may need to find the corresponding files in the warehouse of the corresponding version.

Source code parsing of match_phrase query - in lucene

  1. Lucene Query Syntax: match_phrase query is a query type based on Lucene Query Syntax. Lucene Query Syntax is a syntax for constructing query expressions. It supports query operators such as AND, OR, NOT, *, ?, etc., and can use parentheses, quotation marks, etc. to adjust the priority and logic of the query. This syntax can be used not only in Elasticsearch, but also in other Lucene-based search engines, such as Solr, Amazon CloudSearch, etc.

  1. Elasticsearch REST API: Elasticsearch provides a REST API interface to manage and operate the Elasticsearch cluster. You can use REST API to execute match_phrase query, for example, use HTTP POST method to send a JSON query statement to /{index}/_search request, which contains match_phrase query condition. The Elasticsearch REST API also provides parameters and options for controlling the behavior of queries and the format of results.

  1. Elasticsearch Java API: Elasticsearch Java API is the Java client library officially provided by Elasticsearch, which provides a set of Java interfaces and classes for connecting and operating the Elasticsearch cluster. You can use the Java API to execute the match_phrase query, for example, use the MatchPhraseQueryBuilder class to build the query statement, and then use the SearchRequest class to execute the query and process the returned query results.

The above tools and libraries are related to the use and implementation of match_phrase query, which can help you understand and apply match_phrase query more deeply. You can refer to Elasticsearch's official documentation and related tutorials to learn how to use and optimize match_phrase queries.

Optimizations I can think of

To build a decent-sized cluster, this requires a reasonable planning of the cluster size . Sufficient resources will surely be able to solve these problems.

In the challenge background, the problems that have been listed, starting from the essence of the problem, try to solve the problem.

  • Data pruning . At this scale, with limited resources. We should start from the perspective of data pruning to reduce the amount of data and reduce the scale. The overall direction is to use heterogeneity to reduce the storage of original data, and treat es only as an index engine, not a place to fetch data. Reduce the data scope of word retrieval and adjust the data organization method, such as partitioning data according to time. To reduce the number of search hits, the search criteria should be as streamlined as possible.

  • Segmented request : The data scale is large, and the interaction method is real-time interaction. Users expect to get a response within 3-5 seconds. From the perspective of large data specifications, we can only think about how to reduce the data size. This is actually the idea of ​​data pruning. If it is for interaction, an interactive mode can be designed to change the idea of ​​scanning the full amount of data into the idea of ​​scanning part of the data. In this way, the logic of retrieving ABC for three months can be split into, retrieving the data of month A, and returning the results in advance, so as to meet the needs of getting a response within 3-5 seconds as much as possible. If month A has satisfied the user's viewing needs, then BC does not need to search. In an optimistic scenario, a search of 3 months becomes a search of one month. In this way, the size of the data is only one-third. The problem under this idea is sorting, if you want to sort based on the global, how should you do it. If it's time-based sorting, don't worry. If you need to sort according to the number of likes and reposts, it will not work. The benefits of segmented requests : Segmented matching: Long sentences can be divided into multiple paragraphs, matched separately, and finally the results are combined. This method can effectively reduce memory consumption, and at the same time, use multi-threaded concurrent processing to improve query performance

  • Try to skip the scoring process, and also remove the calculation of correlation in time . In Elasticsearch, you can skip the scoring process by setting relevant parameters, thereby speeding up the query. Specifically, you can use the constant_score filter of the bool query to skip the scoring process. This filter assigns a fixed score to all eligible documents and no longer calculates the relevance score. Using this filter can improve query performance in some specific scenarios, such as filtering out certain documents that do not meet the conditions. At the same time, you can also modify the boost value in the query parameters to affect the relevance score, thereby controlling the ranking of documents in the query results.

  • To hit the full amount of data. The count value should be returned separately in an asynchronous manner . In fact, many analysis requests can find a way out in the direction of asynchronous or offline. A large amount of aggregation analysis will consume a lot of resources, and it is difficult to get a response within 5s.

  • Optimize word breaker , word breaker selection: For long sentence matching, you can choose a suitable word breaker, such as N-gram word breaker, Edge N-gram word breaker, etc., to divide long sentences into several phrases for matching, thereby reducing memory consumption and query time.

  • It still needs to be sorted out. I always feel that in the process of matching many keywords, the merging process of posting links can be optimized. This piece will be studied later, and when there are results, we will share them.

  • In the direction of cache utilization, do exploration and put the most needed things in the cache.

  • Research on underlying IO. I am also researching recently, and I will add and share when I have results.

Guess you like

Origin blog.csdn.net/star1210644725/article/details/129647096