Elasticsearch 8.X cannot handle complex word segmentation, what should I do?

1. Practical problems

Question from golfers: I want to disable all pure digital word segmentation. This method on the official website seems to be invalid for the ik word segmenter!

11d354d8a70b27a5cd71dfd87241a6e6.png

Is there any other way? Chart gpt says that word segmentation can be matched with regular expressions, but the test doesn’t seem to work. My es version is 8.5.3.

2. After further communication, get the most accurate description of the problem

My query content may be: "105, Building 10, Tsinghua Garden, Haidian District, Beijing", and the Chinese word segmentation results of ik_smart are: "Beijing", "Haidian District", "Tsinghua Garden", "Building 10", and 105.

329c6198821deba882bb530ed6c4d7a4.png

User expectations: I just want to exclude the words that are pure numbers after word segmentation. That is to say: the final word segmentation results are expected to be: "Beijing", "Haidian District", "Tsinghua Garden", and "Building 10".

To go further: 10 Buildings is a word segmentation, and the user expects to retrieve the word segmentation result: "10 Buildings". But 105 has little meaning, and the user expects to remove pure digital word segmentation units like "105" in the word segmentation stage.

3. Discussion on solutions

Are there any existing tokenizers that can meet the needs of users? So far, no!

So what to do? Only the tokenizer can be customized . As we said before, the core of the custom tokenizer is composed of three parts as shown in the figure below.

10cd09f9856118eaa60c5dba8699a2b5.png

The meaning of the three parts is as follows, combined with the above figure will be better understood.

part meaning
Character Filter Process raw text before tokenization, such as removing HTML tags, or replacing specific characters.
Tokenizer Define how to split the text into terms or tokens. For example, split text into words using spaces or punctuation.
Token Filter Perform further processing on the tokens output by the Tokenizer, such as converting to lowercase, removing stop words, or adding synonyms.

The difference between Character Filter and Token Filter is as follows:

Both are text preprocessing components in Elasticsearch, but their processing timing and goals are slightly different:

Attributes Character Filter Token Filter
processing time Before Tokenizer After Tokenizer
Action object raw character sequence entry or token
The main function Preprocessing text, such as removing HTML, converting specific characters Process terms, such as lowercase, remove stop words, apply synonyms, generate stems, etc.
output Modified sequence of characters Processed list of terms

Essential difference : Character Filter processes at the original character level, while Token Filter processes at the word level after word segmentation .

So far, let's look at the user's needs and expect to remove the "number" after word segmentation. That is, the Token filter processing after word segmentation is the superior solution.

How to deal with Token filter? Consider a regular expression that is uniformly processed at the number level, and the regular expression for numbers is: "^[0-9]+$".

^[0-9]+$ can be broken down into several parts for interpretation:

  • ^: This symbol indicates the starting position of the match. That is, the matched content must start from the beginning of the target string.

  • [0-9]: This is a character class. It matches any one digit character from 0 to 9.

  • +: This is a quantifier. It indicates that the preceding content (in this case the [0-9] character class) must occur one or more times.

  • $: This symbol indicates the end position of the match. That is, the matched content must reach the end of the target string.

So, overall, the meaning of this regular expression is: the string contains only one or more numeric characters from the beginning to the end, and no other characters.

For example:

  • "123" matches this regex.

  • "0123" also matches.

  • Neither "abc", "123a", or "a123" match.

In a word, the regular expression basically meets the needs of users.

During the actual implementation, we found that the corresponding filter link: "pattern_replace-tokenfilter" filter. This filtering will achieve character-level replacement, we can replace the regular matching numbers with a certain character, such as the "" space character.

However, the requirement has not yet been met, and the user expects space characters to be stripped. At this time, we have to consider how to remove the "" space.

Check the official document of filter to know that there is an "analysis-length-tokenfilter" filter, which can filter out space characters with a length of 0 by setting the minimum length to 1.

Since then, the plan has been initially finalized.

4. Finalize and initially verify the solution

After the above discussion. We have a three-step strategy.

  • Step 0: The tokenizer still chooses ik_smart, which is highly consistent with user needs.

  • Step 1: Find the numerical data, and use the regular filter "pattern_replace filter" to achieve. ==> The regular expression ^[0-9]+$ is replaced with a specific character ==> "".

  • Step 2: Remove spaces, with the help of length filter. ==> lenth > 1 Verify in a small range:

GET /_analyze
{
  "tokenizer": "ik_smart",
  "filter": [
    {
      "type": "pattern_replace",
      "pattern": "^[0-9]+$",
      "replacement": ""
    },
    {
      "type": "length",
      "min": 1
    }
  ],
  "text": "11111111北京市10522222海淀区1053333清华园10栋105"
}

After complicating the input text, the word segmentation results can still meet expectations.

64649c59d0ca2e89f443d74235166d7c.png

5. Practical implementation of custom word segmentation

With the previous preliminary implementation, custom word segmentation becomes easy.

DELETE my-index-20230811-000002
PUT my-index-20230811-000002
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "tokenizer": "ik_smart",
          "filter": [
            "regex_process",
            "remove_length_lower_1"
          ]
        }
      },
      "filter": {
        "regex_process": {
          "type": "pattern_replace",
          "pattern": "^[0-9]+$",
          "replacement": ""
        },
        "remove_length_lower_1": {
          "type": "length",
          "min": 1
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "address":{
        "type":"text",
        "analyzer": "my_custom_analyzer"
      }
    }
  }
}

POST my-index-20230811-000002/_analyze
{
  "text": [
    "1111北京市3333海淀区444444清华园10栋105"
  ],
  "analyzer": "my_custom_analyzer"
}

The index definition is interpreted as follows:

part subsection name describe
Settings Analyzer my_custom_analyzer Tokenizers used: ik_smart
- Filters used: regex_process,remove_length_lower_1
Settings Filter regex_process Type: pattern_replace
A pattern that matches all digits and replaces them with an empty string
Settings Filter remove_length_lower_1 type: length
ensures that only terms with length greater than or equal to 1 are kept
Mappings Properties address Type: text
Analyzer used:my_custom_analyzer

The main purpose of the above configuration is to create a custom analyzer that can process Chinese text, replace purely digital tokens with empty tokens, and ensure that empty tokens are not included in the analysis results.

The final result is as follows, and the expected effect is achieved.

902839516fbd1e365b2abde5274c6f58.png

6. Summary

When the traditional default word segmentation cannot meet our specific and complex needs, remember there is another trick: custom word segmentation.

After remembering the three parts of the custom word segmentation, disassemble the needs of complex problems, and the problem will be solved.

The video interpretation is as follows:

Welcome everyone to pay attention to my video account , and share the advanced dry goods of Elasticsearch from time to time!

7, reference

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-overview.html

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-length-tokenfilter.html

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern_replace-tokenfilter.html

recommended reading

  1. First release on the whole network! From 0 to 1 Elasticsearch 8.X clearance video

  2. Heavyweight | Dead Elasticsearch 8.X Methodology Cognition List

  3. How to systematically learn Elasticsearch?

  4. 2023, do something

  5. Elasticsearch custom word segmentation, start with a question

  6. The details of the Elasticsearch custom word segmentation synonym link are not easy to understand...

52d2be9902d7e0b68a3095095e183ae7.jpeg

Acquire more dry goods faster in a shorter time!

Improve with nearly 2000+ Elastic enthusiasts around the world!

f6a34a4c323c5139928022b198d7050d.gif

In the era of large models, learn advanced dry goods one step ahead!

Guess you like

Origin blog.csdn.net/wojiushiwo987/article/details/132255664