scene description
E-commerce application product search function
elasticSearch version: 7.1.17
Product data is stored in an index named goods
The type of goods_name used for searching is text, the ik tokenizer is configured, and the mode isik_smart
{
"mappings": {
"goods_name": {
"type": "text",
"analyzer": "ik_smart"
}
}
}
复制代码
The ik dictionary configuration content is as follows:
科颜氏
科颜氏高保湿
复制代码
Now there are the following two lines of documents in goods
goods_id | goods_name |
---|---|
1 | Kiehl's face cream |
2 | Kiehl's High Moisturizing Essence Toner |
The program uses keywords 科颜氏
to search, es hits the document whose goods_id is 1, and goods_name is Kiehl's face cream
At this point, it is expected to hit 科颜氏面霜
both 科颜氏高保湿精粹爽肤水
documents with goods_name and , but only one row is actually hit.
The program uses keywords 科颜氏高保湿
to search, es hits the goods_id as 2, and goods_name is the document of Kiehl's High Moisturizing Essence Toner
problem analysis
Since the word segmentation mode configured by goods_name is , and the two entries of andik_smart
are configured in the ik dictionary , the es word segmentation results of Kiehl's High Moisturizing Essence Toner are as follows:科颜氏
科颜氏高保湿
It can be seen that because the dictionary is configured with Kiehl's high moisturizing alone, it does not appear in the word segmentation results in ik_smart mode 科颜氏
, only more accurate 科颜氏高保湿
, which makes it impossible to hit this document when querying 科颜氏
through .
solution
First of all, clarify the requirements. My requirement is to hit as many documents as possible when querying through a wider range of keywords, and hit more accurate documents when querying through a narrower range of keywords. So, how to solve this problem.
It is known that ik has two word segmentation strategies: ik_smart
and ik_max_word
, look at the word segmentation effect of the two strategies
i_smart
least split
i_max_word
Segmentation with maximum granularity
可见,在使用 ik_smart
策略时,分词的结果是尽可能小的,而使用 ik_max_word
时,我们可以获得最大粒度也就是尽可能多的拆词结果。
回到我们的需求,我们需要让搜索词命中更多的结果,那么我们需要做两件事情:
- 搜索词做最小粒度拆分,即使用
ik_smart
策略,保证科颜氏
的分词结果是 科颜氏 - 索引中的数据做最大粒度拆分,即使用
ik_max_word
策略,让科颜氏高保湿精粹爽肤水
的分词结果即包含科颜氏,又包含科颜氏高保湿,让此文档尽可能的被命中
后续改进
由于索引中的数据在使用 `ik_max_word
` 策略进行分词后会被做最大粒度的拆分,这也导致当某些搜索词未在字典中进行配置的时候,可能会出现一些低相关的结果,如名称中只是出现了一个一个搜索词中的单字,也被搜索出来了。
为了解决这个问题,可以考虑尽可能的配置完善的字典,也可以考虑排除低得分的文档,这个部分后续进行进一步的学习研究。