Theoretical elasticsearch relevancy scoring behind

Description: elasticsearch query results are based on what sort of it? The answer is sorted according to the level of relevance score herein, this highlights the theory behind elasticsearch scoring mechanism.

          Mainly translated from elasticsearch official documents, official documents at the following address:

         Theoretical relevancy scoring behind: https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html

         Relevancy scoring weights amended illustration: https://www.elastic.co/guide/en/elasticsearch/guide/current/boosting-by-popularity.html

       These are selected links two chapters of interest, you can select points into other scoring chapters look, do not list them here.

1 Boolean model

    Boolean model simple to use "and", "or", "not" express condition in a query to find all matching documents.

    This process is simple and quick. It may not be used to exclude any documents that match the query. Whether a document can be queried, Boolean model is based on judgment.

2 Term Frequency / Inverse Document Frequency (TF / IDF) model described in

1, word frequency calculation

How often a word? We hope to have a word with more frequent high weight, a field appears 5 words certainly more than once a formula is more and more relevant, word frequency as follows:

 

 

tf(t in d) = √frequency

 

Formula explanation: the word t in the document word frequency is the number of occurrences of a document worthy of the square root.

 

If you do not want to focus on the frequency of a word appearing in the field, just want to focus on whether the word is there, it can be disabled in the mapping field. Disable will disable word frequency word frequency and location of the word, can not count the number of times a word appears in the field, do not phrase and proximity queries. Set of string field not_analyzed default this setting.

 

 

2, Inverse Document Frequency Calculation

 

The more frequent the right to appear in the set of all documents in the weight lower, such as the common word "and", "That," "of," and so on, there should be less relevant as they appear in almost many of the documents which, while less common words, such as "artemisinin", "proportional" can help us to enlarge the document we are interested in right weight. Reverse document word frequency computing company as follows:

 

 

idf(t) = 1 + log ( numDocs / (docFreq + 1))

 

Explanation: Reverse document word frequency is logarithmic number of documents in the index divided by the number of documents containing the word plus one.

 

3, the length of the text field norm computation

 

Shorter length text field has a higher weight, if a word appears in a short text field, such as title, then the word is more likely to be related to the content of the field, rather than a larger body field.

 

 

norm(d) = 1 / √numTerms

 

The field length norm is the negative of the square root of the number of words in the fields.

 

For the full text search field in terms of the length of the norm it is important, but for many other fields is not needed. Regardless of whether the document contains the field, an index field for each document type of each string norm consumes approximately 1 byte. Not_analyzed default string field set norm is disabled, provided analyzed fields may be provided to disable norm. Disabled norm of fields, field length will not let norm involved in the calculation, a long text fields and a short text fields are scored as if they are the same length of the same. For example log type, the norm is not used, the concern is whether the code contains an error or a precise accurate browsing mark, the length field does not affect the results. Disabling specification can save a lot of memory.

 

 

4, the above-binding

 

Three factors - word frequency, word frequency inverse document, norm field length - be used to calculate the index and scoring when they are right for a single word of calculating the weight of a particular document. In the above companies, involved in the actual document we're talking about a document field. Each field has its own inverted index, so for IF / IDF action in terms of the value of the document is the value of the field. When we use a simple word query, you will see that the only factor involved in the calculation of the score is a factor in the previous sections explained:

PUT /my_index/doc/1

{

  "text" : "quick brown fox"

}

GET /my_index/doc/_search?explain

{

  "query": {

    "term": {

      "text": "fox"

    }

   }

}

A brief explanation of the foregoing request score as follows:

eight(text:fox in 0) [PerFieldSimilarity]: 0.153426411

result of:

fieldWeight in 0 0.15342641

product of: tf(freq=1.0), with freq of 1: 1.02

idf(docFreq=1, maxDocs=1): 0.306852823

fieldNorm(doc=0): 0.54

 

 

1 - word in the text field of this document "fox" final score

2 - word in the text field of this document "fox" appear once

3 - The following document word frequency counter all text fields in a document in the index, the emphasis here on the base of the logarithm is a natural number e

. 4 - Field length of this field normalization factor. Here only three index word, there should be calculated according to 4, there is a norm should be occupied by a byte in the field

 

The method is usually a query contains more than words, so we need a right to re-combine multiple terms, for which we turn to the vector space model

 

 

5, vector space model

Vector space model provides a method for the multi-word query with the document. The output is a score indicating the degree of matching documents and query. To do this, the document and query model is represented as a vector. Really just a vector comprising a one-dimensional array of numbers, for example:

[1,2,5,22,3,8]

In the vector space model, each number in the vector is the weight of a heavy word. The vector space model, TF / IDF term weighting is calculated by weight of the default mode, this is not the only way, there are other models, such as in elasticsearch the Okapi-BM25. TF / IDF is the default, because it is a simple and efficient algorithm, capable of producing high-quality search results, and stand the test of time. Imagine we have a query about the "happy hippopotamus" of. Like common words such as "" happy "there will be a lower weight, but like" hippopotamus "This is not a common word will have more weight. Let us assume that" happy "weight is 2, the right to" hippopotamus "heavy 5. we can put a simple two-dimensional vector [2,5] painted from a start point (0,0) to the point (2,5) of the end of the line as shown:

 

 

Now, imagine we have three documents:

1/I am happy in summer.

2/After Christmas I’m a hippopotamus.

3/The happy hippopotamus helped Harry.

 

We can create a similar vector for each document, including the right of each query term appears in the document (happy and hippopotamus) heavy, and draw on these vectors in the same chart, as shown:

 

Vector advantage is that they can be compared. By measuring the angle between the query vector and document vectors can be assigned a relevance score for each document. The angle between the query and the document 1 is large, so the correlation is very low. Document 2 is closer to the query, which means it is reasonably related to, and document 3 is a perfect match. In fact, only two-dimensional vector (query with two words) can easily be plotted on a graph. Fortunately, linear algebra - a branch of mathematics deal with vectors - provides tools for comparing the angle between the multi-dimensional vector, which means we can apply the same principle explained above apply to queries containing multiple words.

3 Lucene's scoring function utility

For multiple-word search, Lucene using Boolean model, TF / IDF and vector space model, and combines them into an effective package, the package of documents collected match, and score them in the implementation process. The first query will be split word, and then find qualified document with the Boolean model. Once the documents that match the query, Lucene query will calculate the scores, and combined score for each matching word.

Practical scoring function is as follows:

score(q,d) =1

  queryNorm(q)2

  · coord(q,d)3

  · ∑ (4

    tf(t in d)5

    · idf(t)²6

    · t.getBoost()7

    · norm(t,d)8

  ) (t in q)9

 

1 - For query document relevance score

2 - Query normalization factor

3 - Coordinate factor

4, 9 - in the query document sum of the weights of each word and t

5 - word frequency document

6 - word document word frequency counter

7 - Weighted lifting weights applied to the query parameters boost

. 8 - The length of time norm, boost enhance the right weight at field level, if any, binding index.

 

1, the query normalization factor calculation

 

Query normalization factor (QueryNorm) is an attempt to standardize the queries so that the results of a query with the results of another query to compare. The purpose is to make the results of the query specification different queries comparable. Correlation sole purpose of scoring is to sort the results of the current query in the correct order. You should not try to compare the correlation scores from different queries.

 

This factor is calculated at the beginning of the query. Depending on actual calculations involved in the query, but typically implemented as follows:

 

queryNorm = 1 / √sumOfSquaredWeights

 

sumOfSquaredWeights square IDF is in a query of each word are added together.

 

2, the coordination factor

 

Coordination factor is used to contain the query term reward a higher percentage of the document. The more query terms appear in the document, the document and the likelihood of a good match the query will be.

We have to imagine the following query "quick brown fox", the heavy weight of each word is 1.5, in the case without considering the factor of coordination, the document will only be re-score the weight of each word sum, such as:

Documents have fox → score: 1.5

Documents have quick fox → score: 3.0

Documents have quick brown fox → score: 4.5

 

Coordination factor will match score multiplied by the number of words in the document, the total number of words in the query and then divided by. In terms of coordination factor scores as follows:

Document has fox → score: 1.5 * 1/3 = 0.5

Document has quick fox → score: 3.0 * 2/3 = 2.0

Documents have quick brown fox → score: 4.5 * 3/3 = 4.5

Coordination factor leading to contain all three words of the document more relevant than a document which contains only two words.

Boolean queries under the default clause to all queries should be coordinated. But it allows you to disable coordination. Why do this? The answer is usually, you do not do that. Query coordinator usually a good thing. When you use bool query to include more advanced queries (such as matching the query), the deprecated coordinate activation also makes sense. The more matches clause, the higher the degree of overlap between the document and the search request is returned. In any case, some advanced use cases, disabling coordination may make sense. Imagine that you are looking for synonyms jump, leapand hop. You do not care about how many of these synonyms exist because they represent the same concept. In fact, there may be only a synonym. It would be good coordination disabled factors:

 

3, when the index field level increases

 

We will discuss how to enhance a field weight - making it more important than other fields, improve the query time. You can also use Boost to field in the index. In fact, this lifting weights applicable to every word in the field, rather than the field itself. Boost in the index store this value need not take up more space, while the field level with field length index Boost norm grouped together as a byte and stored in the index. This is the value returned by the previous formula norm (t, d) is.

 

Note: For the following reasons, we strongly recommend against using enhanced when the field-level weights index.

1) The field length Boost with norm combined, and in one byte, it means that the field length norm loses its precision storage. The result is indistinguishable from ElasticSearch field contains three words and contains five words fields.

2) To change the index lifted weights, you must re-index all documents. On the other hand, enhance the right weight at the time of the query may change with each query.

3) Boost lifting weights if having a field having a plurality of index values, the field itself is multiplied by each value the value, thereby significantly increasing the field of the right weight.

 

4, lifting weights inquiry

 

When the index is not recommended for lifting weights, and lifting weights when a query is even more useful.

If you do not specify a query boost lifting weights, then the default factor is 1.

Lifting weights query parameters to be used in practical Lucene scoring function t.getBoost () element.

 

In fact, the output of reading and interpreting a bit more complex. You will see the full value or t.getBoost not explain Boost mentioned in (). Instead, Boost is scrolled to queryNorm for a particular word in. Although we have said, queryNorm of each word are the same, but you will see that for a heavy lift right word, queryNorm than heavy lifting for a word unweighted queryNorm higher.

 

 

5, by lifting weights looking people

 

Imagine we have a push blog site, allows users to vote for their favorite blog posts. We want to be more popular posts in the results list, but will remain as a major fraction of the full text of the relevant drive. Voting by the number of stores each blog post, we can easily do this.

 

When searching, we can use function_Score query and field_value_factor function to combine the votes associated with the entirety Rating:

GET /blogposts/post/_search

{

  "query": {

    "function_score": {1

      "query": {2

        "multi_match": {

          "query":    "popularity",

          "fields": [ "title", "content" ]

        }

      },

      "field_value_factor": {3

        "field": "votes"4

      }

    }

  }

}

 

1 function --function_score query contains the main query and we want to apply

2 - First execute the main query.

3 --field_value_factor main function is applied to each document matching a query

4 - Each document must have a digital work in order to make function_Score votes field

 

In the previous example, the final score of each document has been changed as follows:

new_score = old_score * number_of_votes

 

 

We hope that the difference between the 1 and 0 votes votes lot of big difference between the 10 votes and 11 votes than the correction formula is as follows:

new_score = old_score * log(1 + number_of_votes)

Smoothing function log field effect ballots, as shown:

Adding a factor formula will change

new_score = old_score * log(1 + factor * number_of_votes)

Results as shown below:

Weight lifting mode (boost_mode) modify, boost_mode support modes are as follows:

multiply - this is the default mode, the correlation score is multiplied by the correction function scores

sum - relevance scores and scores of functions and

min - taking the correlation function with a lower score that score

max - calculate correlation function with a higher score that score

replace - Replace function relevance score Score

 

If set sum, the scoring formula is as follows:

new_score = old_score + log(1 + 0.1 * number_of_votes)

As shown below:

Finally, we can limit function can have a maximum effect by using max_boost parameter, if set - "max_boost": 1.5, regardless of the outcome Field_Value_Factor function, will never be greater than 1.5.

 

The most relevant concept is vague target to hit, different people sort of documentation often have different views. In the absence of any significant progress in the case, easy to fall into the ever-changing cycle.

 

We encourage you to avoid this (very attractive) behavior, rather than properly measure your results. Monitor the user clicks the top of the results, the first 10 pages and a frequency of the first page; they perform auxiliary frequency queries without first selecting results; click the search results and return immediately the result of frequency and so on.

 

We encourage you to avoid this (very attractive) behavior, rather than properly measure your results. Monitor the user clicks the top of the results, the first 10 pages and a frequency of the first page; they perform auxiliary frequency queries without first selecting results; click the search results and return immediately the result of frequency and so on.

 

Tools chapter provides an overview of just that: tools. You must properly use them to promote your search results into the excellent category, and the only way is to measure user behavior strong.

 

 

Related Articles: Use logstash synchronization mysql data to elasticsearch , IK and phonetic word, a word thermal expansion / stop Thesaurus

 

Technical cooperation:

            qq:281414283

            Micro-channel: so-so-life

 

Guess you like

Origin www.cnblogs.com/javato/p/11385362.html