[Elasticsearch] Relevance, synonym matching, error correction matching

Table of contents

Correlation

boolean model

Term Frequency/Inverse Document Frequency (TF/IDF)

word frequency

reverse document frequency

field length normalized value

In conjunction with

vector space model

Practical scoring functions for Lucene

synonym matching

Synonym query principle

synonym filter

error correction matching


Correlation

Lucene (or Elasticsearch) uses  a Boolean model  to find matching documents and uses a formula called a  practical scoring function  to calculate relevance. This formula borrows  term frequency/inverse document frequency (term frequency/inverse document frequency)  and  vector space model (vector space model) , but also adds some modern new features, such as coordination factor (coordination factor), field length normalization (field length normalization), and word or query sentence weight promotion.

boolean model

The Boolean Model just uses  conditions such as AND ,  OR and  (and, or, and not)  in the query  to find matching documents. The following query:NOT

full AND text AND search AND (elasticsearch OR lucene)

will return all   documents containing the words full ,  text and  search , and  elasticsearch or  as the result set.lucene

The process is simple and fast, and it excludes all potentially mismatched documents.

Term Frequency/Inverse Document Frequency (TF/IDF)

When a set of documents is matched, these documents need to be sorted according to their relevance. Not all documents contain all words, and some words are more important than others. A document's relevance score depends in part on the weight of each query term in the document   .

The weight of a word is determined by three factors, which   have been introduced in What is Relevance . If you are interested, you can understand the following formula, but it is not required to remember it.

word frequency

How often does the word appear in the document? The higher the frequency,  the higher the weight  . A field with 5 mentions of the same word is more relevant than a field with only 1 mention. Word frequency is calculated as follows:

tf(t in d) = √frequency 

 The word frequency ( ) of  a word  t in a document   is the square root of the number of times that word occurs in the document.dtf

If you don't care how often a word appears in a field, but only care about whether it has appeared, you can disable word frequency statistics in the field mapping:

PUT /my_index
{
  "mappings": {
    "doc": {
      "properties": {
        "text": {
          "type":          "string",
          "index_options": "docs" 
        }
      }
    }
  }
}

Set the parameter  index_options to  docs disable word frequency statistics and word frequency position. This mapped field will not count the number of occurrences of words, and it will not be available for phrase or similar queries. not_analyzed String fields that require an exact query  will use this setting by default.

reverse document frequency

How often does the word appear in all documents in the collection? The higher the frequency, the lower the weight   . Common words such as  and or  the contribute little to the relevance, because they appear in most documents, and some uncommon words such as  elastic or  hippopotamus can help us quickly narrow down the scope and find interesting documents. The formula for calculating the reverse document frequency is as follows:

idf(t) = 1 + log ( numDocs / (docFreq + 1)) 

t The inverse document frequency ( ) of  a term  idf is the logarithm of the number of documents in the index divided by the number of all documents containing the term.

field length normalized value

What is the length of the field? The shorter the field,  the higher the weight of the field  .  A word is more relevant if it appears in  title a field like title than if it appears in a field like content  . bodyThe formula for normalizing the field length is as follows:

norm(d) = 1 / √numTerms

The field length normalized value (  norm ) is the reciprocal of the square root of the number of words in the field.

The normalized value of the field length is very important for full-text search, and many other fields do not need to have a normalized value. Regardless of whether the document includes this field, each field of each document in the index  string occupies about 1 byte of space. Normalized values ​​for  not_analyzed string fields are disabled by default, and  analyzed normalized values ​​​​for fields can also be disabled by modifying the field mapping:

PUT /my_index
{
  "mappings": {
    "doc": {
      "properties": {
        "text": {
          "type": "string",
          "norms": { "enabled": false } 
        }
      }
    }
  }
}

This field will not take field length normalization into account, long and short fields will be scored with the same length.

For some application scenarios such as logs, the normalized value is not very useful, and the only concern is whether the field contains a special error code or a specific browser unique identifier. The length of the field has no effect on the result, and disabling normalized values ​​can save a lot of memory space.

In conjunction with

The following three factors—term frequency, inverse document frequency, and field-length norm—are computed and stored at index time. Finally combine them to calculate  the weight of a single word in a particular document  .

The document mentioned in the previous formula   actually refers to a certain field in the document, and each field has its own inverted index, so the TF/IDF value of the field is the TF/IDF value of the document.

When  explain looking at a simple  term query (see  explain  ), it can be found that the factors used to calculate the relevance score are those introduced in the previous chapters:

PUT /my_index/doc/1
{ "text" : "quick brown fox" }

GET /my_index/doc/_search?explain
{
  "query": {
    "term": {
      "text": "fox"
    }
  }
}

The above request (simplified)  explanation is explained as follows:

weight(text:fox in 0) [PerFieldSimilarity]:  0.15342641 
result of:
    fieldWeight in 0                         0.15342641
    product of:
        tf(freq=1.0), with freq of 1:        1.0 
        idf(docFreq=1, maxDocs=1):           0.30685282 
        fieldNorm(doc=0):                    0.5 

The word  fox is in the internal Lucene doc ID of the document  0 , and the field is  text the final score in it.

The word   appears only once in fox this document  field.text

foxtext Inverse document frequency indexed  across all document  fields.

The field length normalized value for this field.

Of course, queries usually have more than one word, so a way to incorporate multi-word weights is needed - a vector space model.

vector space model

The vector space model  provides a way to compare multi-word queries. A single score represents how well a document matches a query. To do this, the model represents both documents and queries in the   form of vectors :

A vector is actually a one-dimensional array containing multiple numbers, for example:

[1,2,5,22,3,8]

In the vector space model, each number in the vector space model represents  the weight of a word  , which is similar to  the calculation method of term frequency/inverse document frequency  .

Although TF/IDF is the default way for vector space models to calculate word weights, it is not the only way. Elasticsearch also has other models like Okapi-BM25. TF/IDF is the default because it is a proven simple yet efficient algorithm that provides high-quality search results.

Imagine that if you query "happy hippopotamus",  happy the weight of common words is low, and  hippopotamus the weight of uncommon words is high. Suppose  happy the weight is 2 and  hippopotamus the weight of 5 is 5. This two-dimensional vector can be drawn as  [2,5] a straight line in the coordinate system , the starting point of the line is (0,0) and the end point is (2,5), as shown in  Figure 27, "representing the two-dimensional query vector of "happy hippopotamus"" 

In practice, only two-dimensional vectors (two-word queries) can be represented on the plane. Fortunately,  linear algebra  —a branch of mathematics that deals with vectors—provides us with the ability to calculate the angle between two multidimensional vectors tool, which means that multi-word queries can be interpreted in the same way as above.

See cosine similarity for more information on comparing two vectors  ;

Practical scoring functions for Lucene

For multi-word queries, Lucene uses  Boolean model  ,  TF/IDF  , and  vector space model,  and then combines them into a single efficient package to collect matching documents and perform scoring calculations.

synonym matching

The ES synonym matching search requires the user to provide a synonym table that meets the corresponding format, and the table is designed to be placed in the index when creating the index settings.

The synonym table can be directly written in the form of a string settingsor put into a text file and read by es.

Synonym lists need to meet the following format requirements:

  1. A => B,CFormat

    • This format will replace the search term A with B and C when searching, and B and C are not synonyms;
  2. A,B,C,DFormat

This format score case discussion:

  • At the timeexpand == true , this format was equivalent to A,B,C,D => A,B,C,DABCD being synonyms for each other

  • At that timeexpand == false , this format was equivalent to A, B, C, D => A, that is, the four words ABCD will be replaced with A when searching

PUT /fond_goods
{
  "settings": {
    "number_of_replicas": 0,
    "number_of_shards": 1,
    "analysis": {
      "analyzer": {
        "my_whitespace":{
          "tokenizer":"whitespace",
          "filter": ["synonymous_filter"]
        }
      },
      "filter": {
        "synonymous_filter":{
          "type": "synonym",
          "expand": true
          "synonyms": [
            "A, B, C, D"
            ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "code":{
        "type": "keyword"
      },
      "context":{
        "type": "text",
        "analyzer": "my_whitespace"
      },
      "color":{
        "type": "text",
        "analyzer": "my_whitespace"
      }
    }
  }
}
  • expandThe default value is true.
  • lenientThe default value is falseIf lenientit is true, es will ignore the error when converting the synonym file. It is worth noting that only exceptions that occur when synonyms cannot be converted will be ignored;
  • synonymsThe list of synonyms is the list of synonyms to be filled in according to the format mentioned at the beginning.
  • synonymsIt can also be replaced by synonyms_path, in this case, you need to fill in the path of an external file. The file may be an external web page, or a file stored locally.
  • formatWhen the parameter value wordnetis , you can use synonyms in the wordnet English vocabulary database.

Synonym query principle

The tokenizer of ES is mainly composed of Character Filter, Tokenizer, and Token Filter.

  • Character Filter
    1. Accepts text as a stream of characters and can transform it by adding, deleting, or changing characters.
    2. An Analyzer can be filter by 0 or more characters
  • Tokenizer
    1. Segment the text filtered by the character filter according to certain rules. An Analyzer is only allowed to have one tokenizer
  • Token Filter
    1. Filter the tokens after word segmentation again, you can add, delete and modify tokens, and there can be multiple token filters in one tokenizer

synonym filter

The key to synonym query is actually custom Token filter. When the filter receives the data sent by the tokenizer (I temporarily call it the word segmentation data), it will first read the synonyms file stored by the user and compare the word segmentation data. When a synonym appears, the Token filter selects a search phrase according to the rules configured in the synonym file to search for synonyms.

We can take the previous index as an experiment: our index uses a custom analyzer my_whitespace, where the tokenizer is whitespacea space tokenizer, and the token filter uses a custom synonym filter. whitespaceFrom the above, we can see that the only difference between our custom analyzer and the official analyzer is the token Filter.

error correction matching

Use phrase Suggester in es for spelling error correction. The phrase suggester adds additional logic on top of the term suggester to select phrases that are corrected as a whole, rather than based on an ngram language model weighted by individual tokens. In practice, the phrase suggester can make better choices based on information such as word frequency of words.

Four Suggester types commonly used in ES: Term, Phrase, Completion, Context.

Term suggester, as its name suggests, only provides suggestions based on a single term that has been analyzed, and does not consider the relationship between multiple terms. The API caller only needs to select the words in the options for each token, combine them and return them to the user front end. So is there a more direct way, the API directly gives the content similar to the user input text? The answer is yes, and this calls for help from Phrase Suggester.
Based on the Term suggester, the Phrase suggester will consider the relationship between multiple terms, such as whether they appear in the original text of the index at the same time, the degree of adjacency, and the frequency of words, etc.
Completion Suggester, its main application scenario is "Auto Completion". In this scenario, every time a user enters a character, a query request needs to be sent to the backend to find a matching item. When the user input speed is high, the response speed of the backend is relatively strict. Therefore, in terms of implementation, it uses a different data structure from the previous two Suggesters. The index is not completed by inversion, but the analyzed data is encoded into FST and stored together with the index. For an index in an open state, the FST will be loaded into the memory by ES, and the prefix search speed is extremely fast. But FST can only be used for prefix lookup, which is also the limitation of Completion Suggester.
Context Suggester will complete according to the context. This method has a better completion effect, but its performance is poor, and not many people use it. This is also an advanced usage of spelling error correction in es.


Spelling error correction is realized through the SpellChecker function under the Suggest module in Lucene

In the source code, two public static member variables are defined. DEFAULT_ACCURACY represents the default minimum score. SpellCheck will score the similarity between each word in the dictionary and the search keyword entered by the user. The default value is 0.5, the similarity The score range is between 0 and 1. The larger the number, the more similar it is. If it is less than 0.5, it will not be considered the same result. F_WORD is the default field name used when creating an index for each row in the dictionary file, and the default value is: word.

Several important APIs:

getAccuracy: Accuracy means accuracy, here means the minimum score, and the higher the score, the more similar it is to the keyword entered by the user. suggestSimilar
: This method is used to determine which words will be judged as similar and then returned in the form of an array. This is the core of SpellChecker;
setComparator: Set the comparator. Since it involves the similarity problem, there must be a similarity size problem. There must be a comparison if there is a size. If there is a comparison, a comparator is necessary. The order of the suggested words returned is determined by the comparator. , because it is generally necessary to display the most relevant ones first, and then sort them in order;
setSpellIndex: set the spell check index directory
setStringDistance: set the edit distance
 

Guess you like

Origin blog.csdn.net/zy_jun/article/details/131327718