[Backend Tutorial] What can I do if Elasticsearch can retrieve it, but it cannot be highlighted correctly?

1. The problem leads
to the online combat problem in the WeChat group:

Brothers, in es:

The keyword type field is highlighted, the value is 123asd456, query sd4, the highlighted result is em 123asd456 em

Is there a way to highlight only the sd4 I query?

Clearly query part of id, but highlight the result is the entire id string, what should I do?

Deadpool Elasticsearch Technology WeChat Group

2. A Demo describes the problem clearly.
Note: The sample DSL in this article runs ok in version 7.2, and earlier versions before 6.X may require fine-tuning.

PUT findex
{
“mappings”: {
“properties”: {
“aname”:{
“type”:“text”
},
“acode”:{
“type”:“keyword”
}
}
}
}

POST findex / _bulk
{"index": {"_ id": 1}}
{"acode": "160213.OF", "aname": "X Tainasdaq 100"}
{"index": {"_ id ": 2}}
{" acode ":" 160218.OF "," aname ":" X Thailand Certificate Real Estate "}

POST findex / _search
{
“highlight”: {
“fields”: {
“acode”: {}
}
},
“query”: {
“bool”: {
“should”: [
{
“wildcard”: {
“acode”: “ 1602
}
}
]
}
}
}
Highlight search results,

"
Highlight ": { "acode": [
" 160213.OF "
]
}
That is to say, the entire string is highlighted, and does not meet expectations.

Actual demand: search 1602, relevant data: 160213.O, 160218.OF can be recalled, and only the search field 1602 is highlighted.

3. Disassembly of the problem The
retrieval and selection of wildcard is to solve the problem of substring matching. The implementation of wildcard is similar to mysql's "like" fuzzy matching.
Traditional text standard word breakers, including Chinese word breaker ik, English word breaker english, standard, etc., cannot solve the above substring matching problem.
And the actual business needs:

On the one hand: the input substring is required to recall the entire string;

On the other hand: it is required to highlight the searched substring.

Only one kind of participle Ngram can be replaced!

4. What is Ngram?
4.1 Definition of
Ngram Ngram is an algorithm based on statistical language model.

The basic idea of ​​Ngram is to perform a sliding window operation of size N in bytes of the content in the text to form a sequence of byte fragments of length N. Each byte segment is called a gram, which counts the frequency of occurrence of all grams and filters according to a pre-set threshold to form a list of key grams, which is the vector feature space of this text. Each type in the list Gram is a feature vector dimension.

The model is based on the assumption that the occurrence of the Nth word is only related to the preceding N-1 words, but not to any other words. The probability of the entire sentence is the product of the occurrence probability of each word.

These probabilities can be obtained by directly counting the number of simultaneous occurrences of N words from the corpus. Commonly used are binary Bi-Gram (binary grammar) and ternary Tri-Gram (ternary grammar).

4.2 Ngram example
Chinese sentence: "Have you eaten today"? Its Bi-Gram (binary grammar) word segmentation result is:

Today you
today
day to eat
dinner
meal
yet
4.3 Ngram scenarios
Scenario 1: compress text, check for spelling errors, speed up the search string, language identification documents.
Scenario 2: New applications in the field of natural language processing automation, such as automatic classification, automatic indexing, automatic generation of hyperlinks, document retrieval, and segmentation of language text without separators.
Scene 3: Natural language automatic classification function. Corresponding to the Elasticsearch search, the application scenario is more clear: the word segmentation without delimiter language text improves the search efficiency (compared to: wildcard query and regular query).
5. Practice a
PUT findex_ext
{
“settings”: {
“index.max_ngram_diff”: 10,
“analysis”: {
“analyzer”: {
“my_analyzer”: {
“tokenizer”: “my_tokenizer”
}
},
“tokenizer”: {
“My_tokenizer”: {
“type”: “ngram”,
“min_gram”: 4,
“max_gram”: 10,
“token_chars”: [
“letter”,
“digit”
]
}
}
}
},
“mappings”: {
“properties”: {
“aname”: {
“type”: “text”
},
“acode”: {
“type”: “text”,
“analyzer”: “my_analyzer”,
“fields”: {
“keyword”: {
“type”: “keyword”
}
}
}
}
}
}

POST findex_ext / _bulk
{"index": {"_ id": 1}}
{"acode": "160213.OF", "aname": "X Tainasdaq 100"}
{"index": {"_ id ": 2}}
{" acode ":" 160218.OF "," aname ":" X Thailand Certificate Real Estate "}

View word segmentation results

POST findex_ext/_analyze
{
“analyzer”: “my_analyzer”,
“text”:“160213.OF”
}

POST findex_ext / _search
{
“highlight”: {
“fields”: {
“acode”: {}
}
},
“query”: {
“bool”: {
“should”: [
{
“match_phrase”: {
“acode”: {
"Query": "1602"
}
}
}
]
}
}
}
Note: Three core parameters

min_gram: the minimum character length (segmentation), the default is 1
max_gram: the maximum character length (segmentation), the default is 2
token_chars: the character types included in the generated word segmentation results, the default is all types. As shown in the above example: Retain numbers and letters. If only "letter" is specified in the above example, the numbers will be filtered out, and only the characters in the string such as "OF" will be left in the word segmentation result.
The snippet of the returned result is as follows:

"highlight" : {
      "acode" : [
        "<em>1602</em>13.OF"
      ]
    }

Can already meet the dual needs of retrieval and highlighting.

5.
Select the type and pay attention to the essence of Ngram: use space for time. The premise of its matching is that it has been written according to min_gram and max_gram when writing.
The amount of data is very small and sub-string highlighting is not required, you can consider keyword.
The amount of data is large and substring highlighting is required. It is recommended to use: Ngram word segmentation combined with match or match_phrase retrieval.
Large amount of data, remember not to use wildcard prefix matching!
Reason: DFA (Deterministic Finite Automaton) constructed with a pattern with wildcards may be complicated and costly! It may even cause the online environment to go down.
Uncle Wood also emphasized many times: wildcard query should avoid using wildcards as the starting point. If you have to do this, you must limit the length of the string entered by the user.
6. Summary In
order to discuss and solve online problems, the principles and usage logic of Ngram are extended, and the applicable business scenarios of wildcard and Ngram are pointed out. Hope to inspire and help you in actual combat!

Do you encounter substring matching and highlighting in your business? How do you segment and search? Welcome to leave a message to discuss.

Service recommendation

Published 0 original articles · liked 0 · visits 356

Guess you like

Origin blog.csdn.net/weixin_47143210/article/details/105628365