Table of contents
Term Frequency/Inverse Document Frequency (TF/IDF)
Practical scoring functions for Lucene
Correlation
Lucene (or Elasticsearch) uses a Boolean model to find matching documents and uses a formula called a practical scoring function to calculate relevance. This formula borrows term frequency/inverse document frequency (term frequency/inverse document frequency) and vector space model (vector space model) , but also adds some modern new features, such as coordination factor (coordination factor), field length normalization (field length normalization), and word or query sentence weight promotion.
boolean model
The Boolean Model just uses conditions such as AND
, OR
and (and, or, and not) in the query to find matching documents. The following query:NOT
full AND text AND search AND (elasticsearch OR lucene)
will return all documents containing the words full
, text
and search
, and elasticsearch
or as the result set.lucene
The process is simple and fast, and it excludes all potentially mismatched documents.
Term Frequency/Inverse Document Frequency (TF/IDF)
When a set of documents is matched, these documents need to be sorted according to their relevance. Not all documents contain all words, and some words are more important than others. A document's relevance score depends in part on the weight of each query term in the document .
The weight of a word is determined by three factors, which have been introduced in What is Relevance . If you are interested, you can understand the following formula, but it is not required to remember it.
word frequency
How often does the word appear in the document? The higher the frequency, the higher the weight . A field with 5 mentions of the same word is more relevant than a field with only 1 mention. Word frequency is calculated as follows:
tf(t in d) = √frequency
The word frequency ( ) of a word |
If you don't care how often a word appears in a field, but only care about whether it has appeared, you can disable word frequency statistics in the field mapping:
PUT /my_index { "mappings": { "doc": { "properties": { "text": { "type": "string", "index_options": "docs" } } } } }
Set the parameter |
reverse document frequency
How often does the word appear in all documents in the collection? The higher the frequency, the lower the weight . Common words such as and
or the
contribute little to the relevance, because they appear in most documents, and some uncommon words such as elastic
or hippopotamus
can help us quickly narrow down the scope and find interesting documents. The formula for calculating the reverse document frequency is as follows:
idf(t) = 1 + log ( numDocs / (docFreq + 1))
|
field length normalized value
What is the length of the field? The shorter the field, the higher the weight of the field . A word is more relevant if it appears in title
a field like title than if it appears in a field like content . body
The formula for normalizing the field length is as follows:
norm(d) = 1 / √numTerms
The field length normalized value ( |
The normalized value of the field length is very important for full-text search, and many other fields do not need to have a normalized value. Regardless of whether the document includes this field, each field of each document in the index string
occupies about 1 byte of space. Normalized values for not_analyzed
string fields are disabled by default, and analyzed
normalized values for fields can also be disabled by modifying the field mapping:
PUT /my_index { "mappings": { "doc": { "properties": { "text": { "type": "string", "norms": { "enabled": false } } } } } }
This field will not take field length normalization into account, long and short fields will be scored with the same length. |
For some application scenarios such as logs, the normalized value is not very useful, and the only concern is whether the field contains a special error code or a specific browser unique identifier. The length of the field has no effect on the result, and disabling normalized values can save a lot of memory space.
In conjunction with
The following three factors—term frequency, inverse document frequency, and field-length norm—are computed and stored at index time. Finally combine them to calculate the weight of a single word in a particular document .
The document mentioned in the previous formula actually refers to a certain field in the document, and each field has its own inverted index, so the TF/IDF value of the field is the TF/IDF value of the document.
When explain
looking at a simple term
query (see explain ), it can be found that the factors used to calculate the relevance score are those introduced in the previous chapters:
PUT /my_index/doc/1 { "text" : "quick brown fox" } GET /my_index/doc/_search?explain { "query": { "term": { "text": "fox" } } }
The above request (simplified) explanation
is explained as follows:
weight(text:fox in 0) [PerFieldSimilarity]: 0.15342641 result of: fieldWeight in 0 0.15342641 product of: tf(freq=1.0), with freq of 1: 1.0 idf(docFreq=1, maxDocs=1): 0.30685282 fieldNorm(doc=0): 0.5
The word |
|
The word appears only once in |
|
|
|
The field length normalized value for this field. |
Of course, queries usually have more than one word, so a way to incorporate multi-word weights is needed - a vector space model.
vector space model
The vector space model provides a way to compare multi-word queries. A single score represents how well a document matches a query. To do this, the model represents both documents and queries in the form of vectors :
A vector is actually a one-dimensional array containing multiple numbers, for example:
[1,2,5,22,3,8]
In the vector space model, each number in the vector space model represents the weight of a word , which is similar to the calculation method of term frequency/inverse document frequency .
Although TF/IDF is the default way for vector space models to calculate word weights, it is not the only way. Elasticsearch also has other models like Okapi-BM25. TF/IDF is the default because it is a proven simple yet efficient algorithm that provides high-quality search results.
Imagine that if you query "happy hippopotamus", happy
the weight of common words is low, and hippopotamus
the weight of uncommon words is high. Suppose happy
the weight is 2 and hippopotamus
the weight of 5 is 5. This two-dimensional vector can be drawn as [2,5]
a straight line in the coordinate system , the starting point of the line is (0,0) and the end point is (2,5), as shown in Figure 27, "representing the two-dimensional query vector of "happy hippopotamus""
In practice, only two-dimensional vectors (two-word queries) can be represented on the plane. Fortunately, linear algebra —a branch of mathematics that deals with vectors—provides us with the ability to calculate the angle between two multidimensional vectors tool, which means that multi-word queries can be interpreted in the same way as above.
See cosine similarity for more information on comparing two vectors ;
Practical scoring functions for Lucene
For multi-word queries, Lucene uses Boolean model , TF/IDF , and vector space model, and then combines them into a single efficient package to collect matching documents and perform scoring calculations.
synonym matching
The ES synonym matching search requires the user to provide a synonym table that meets the corresponding format, and the table is designed to be placed in the index when creating the index settings
.
The synonym table can be directly written in the form of a string settings
or put into a text file and read by es.
Synonym lists need to meet the following format requirements:
-
A => B,C
Format- This format will replace the search term A with B and C when searching, and B and C are not synonyms;
-
A,B,C,D
Format
This format score case discussion:
-
At the time
expand == true
, this format was equivalent toA,B,C,D => A,B,C,D
ABCD being synonyms for each other -
At that time
expand == false
, this format was equivalent to A, B, C, D => A, that is, the four words ABCD will be replaced with A when searching
PUT /fond_goods
{
"settings": {
"number_of_replicas": 0,
"number_of_shards": 1,
"analysis": {
"analyzer": {
"my_whitespace":{
"tokenizer":"whitespace",
"filter": ["synonymous_filter"]
}
},
"filter": {
"synonymous_filter":{
"type": "synonym",
"expand": true
"synonyms": [
"A, B, C, D"
]
}
}
}
},
"mappings": {
"properties": {
"code":{
"type": "keyword"
},
"context":{
"type": "text",
"analyzer": "my_whitespace"
},
"color":{
"type": "text",
"analyzer": "my_whitespace"
}
}
}
}
expand
The default value istrue
.lenient
The default value isfalse
Iflenient
it istrue
, es will ignore the error when converting the synonym file. It is worth noting that only exceptions that occur when synonyms cannot be converted will be ignored;synonyms
The list of synonyms is the list of synonyms to be filled in according to the format mentioned at the beginning.synonyms
It can also be replaced bysynonyms_path
, in this case, you need to fill in the path of an external file. The file may be an external web page, or a file stored locally.format
When the parameter valuewordnet
is , you can use synonyms in the wordnet English vocabulary database.
Synonym query principle
The tokenizer of ES is mainly composed of Character Filter, Tokenizer, and Token Filter.
- Character Filter
- Accepts text as a stream of characters and can transform it by adding, deleting, or changing characters.
- An Analyzer can be filter by 0 or more characters
- Tokenizer
- Segment the text filtered by the character filter according to certain rules. An Analyzer is only allowed to have one tokenizer
- Token Filter
- Filter the tokens after word segmentation again, you can add, delete and modify tokens, and there can be multiple token filters in one tokenizer
synonym filter
The key to synonym query is actually custom Token filter. When the filter receives the data sent by the tokenizer (I temporarily call it the word segmentation data), it will first read the synonyms file stored by the user and compare the word segmentation data. When a synonym appears, the Token filter selects a search phrase according to the rules configured in the synonym file to search for synonyms.
We can take the previous index as an experiment: our index uses a custom analyzer my_whitespace
, where the tokenizer is whitespace
a space tokenizer, and the token filter uses a custom synonym filter. whitespace
From the above, we can see that the only difference between our custom analyzer and the official analyzer is the token Filter.
error correction matching
Use phrase Suggester in es for spelling error correction. The phrase suggester adds additional logic on top of the term suggester to select phrases that are corrected as a whole, rather than based on an ngram language model weighted by individual tokens. In practice, the phrase suggester can make better choices based on information such as word frequency of words.
Four Suggester types commonly used in ES: Term, Phrase, Completion, Context.
Term suggester, as its name suggests, only provides suggestions based on a single term that has been analyzed, and does not consider the relationship between multiple terms. The API caller only needs to select the words in the options for each token, combine them and return them to the user front end. So is there a more direct way, the API directly gives the content similar to the user input text? The answer is yes, and this calls for help from Phrase Suggester.
Based on the Term suggester, the Phrase suggester will consider the relationship between multiple terms, such as whether they appear in the original text of the index at the same time, the degree of adjacency, and the frequency of words, etc.
Completion Suggester, its main application scenario is "Auto Completion". In this scenario, every time a user enters a character, a query request needs to be sent to the backend to find a matching item. When the user input speed is high, the response speed of the backend is relatively strict. Therefore, in terms of implementation, it uses a different data structure from the previous two Suggesters. The index is not completed by inversion, but the analyzed data is encoded into FST and stored together with the index. For an index in an open state, the FST will be loaded into the memory by ES, and the prefix search speed is extremely fast. But FST can only be used for prefix lookup, which is also the limitation of Completion Suggester.
Context Suggester will complete according to the context. This method has a better completion effect, but its performance is poor, and not many people use it. This is also an advanced usage of spelling error correction in es.
Spelling error correction is realized through the SpellChecker function under the Suggest module in Lucene
In the source code, two public static member variables are defined. DEFAULT_ACCURACY represents the default minimum score. SpellCheck will score the similarity between each word in the dictionary and the search keyword entered by the user. The default value is 0.5, the similarity The score range is between 0 and 1. The larger the number, the more similar it is. If it is less than 0.5, it will not be considered the same result. F_WORD is the default field name used when creating an index for each row in the dictionary file, and the default value is: word.
Several important APIs:
getAccuracy: Accuracy means accuracy, here means the minimum score, and the higher the score, the more similar it is to the keyword entered by the user. suggestSimilar
: This method is used to determine which words will be judged as similar and then returned in the form of an array. This is the core of SpellChecker;
setComparator: Set the comparator. Since it involves the similarity problem, there must be a similarity size problem. There must be a comparison if there is a size. If there is a comparison, a comparator is necessary. The order of the suggested words returned is determined by the comparator. , because it is generally necessary to display the most relevant ones first, and then sort them in order;
setSpellIndex: set the spell check index directory
setStringDistance: set the edit distance