A problem encountered in Ik word segmentation search in ElasticSearch

scene description

E-commerce application product search function

elasticSearch version: 7.1.17

Product data is stored in an index named goods

The type of goods_name used for searching is text, the ik tokenizer is configured, and the mode isik_smart

{
  "mappings": {
    "goods_name": {
      "type": "text",
      "analyzer": "ik_smart"
    }
  }
}
复制代码

The ik dictionary configuration content is as follows:

科颜氏
科颜氏高保湿
复制代码

Now there are the following two lines of documents in goods

goods_id goods_name
1 Kiehl's face cream
2 Kiehl's High Moisturizing Essence Toner

The program uses keywords 科颜氏to search, es hits the document whose goods_id is 1, and goods_name is Kiehl's face cream

At this point, it is expected to hit 科颜氏面霜both 科颜氏高保湿精粹爽肤水documents with goods_name and , but only one row is actually hit.

The program uses keywords 科颜氏高保湿to search, es hits the goods_id as 2, and goods_name is the document of Kiehl's High Moisturizing Essence Toner

problem analysis

Since the word segmentation mode configured by goods_name is , and the two entries of andik_smart are configured in the ik dictionary , the es word segmentation results of Kiehl's High Moisturizing Essence Toner are as follows:科颜氏科颜氏高保湿

image.png

It can be seen that because the dictionary is configured with Kiehl's high moisturizing alone, it does not appear in the word segmentation results in ik_smart mode 科颜氏, only more accurate 科颜氏高保湿, which makes it impossible to hit this document when querying 科颜氏through .

solution

First of all, clarify the requirements. My requirement is to hit as many documents as possible when querying through a wider range of keywords, and hit more accurate documents when querying through a narrower range of keywords. So, how to solve this problem.

It is known that ik has two word segmentation strategies: ik_smartand ik_max_word, look at the word segmentation effect of the two strategies

i_smart

least split

image.png

i_max_word

Segmentation with maximum granularity

image.png

可见,在使用 ik_smart 策略时,分词的结果是尽可能小的,而使用 ik_max_word 时,我们可以获得最大粒度也就是尽可能多的拆词结果。

回到我们的需求,我们需要让搜索词命中更多的结果,那么我们需要做两件事情:

  1. 搜索词做最小粒度拆分,即使用 ik_smart 策略,保证 科颜氏 的分词结果是 科颜氏
  2. 索引中的数据做最大粒度拆分,即使用 ik_max_word 策略,让 科颜氏高保湿精粹爽肤水 的分词结果即包含科颜氏,又包含科颜氏高保湿,让此文档尽可能的被命中

后续改进

由于索引中的数据在使用 `ik_max_word ` 策略进行分词后会被做最大粒度的拆分,这也导致当某些搜索词未在字典中进行配置的时候,可能会出现一些低相关的结果,如名称中只是出现了一个一个搜索词中的单字,也被搜索出来了。

为了解决这个问题,可以考虑尽可能的配置完善的字典,也可以考虑排除低得分的文档,这个部分后续进行进一步的学习研究。

Guess you like

Origin juejin.im/post/7229258224910204989