文章目录

1. 概述
2. Term Suggester
3. Phrase Suggester
4. Complection Suggester

1. 概述

Suggester用途：根据用户提供的文本，给出可能相似的词语。类似百度搜索提示，截个图大家就清楚了
本篇使用的ES版本：7.7

2. Term Suggester

2.1 定义

term suggester 先将搜索词进行分词，然后逐个与指定的索引数据进行比较，计算出编辑距离再返回建议词。

编辑距离：这里使用了叫做Levenstein edit distance的算法，核心思想就是一个词改动多少次就可以和另外的词一致。比如说为了从elasticseach得到elasticsearch，就必须加入1个字母 r ，也就是改动1次，所以这两个词的编辑距离就是1。

2.2 实例测试

PUT my_index/_doc/_bulk?refresh
{
    
    "index":{
    
    }}
{
    
    "message":"what is the weather like today"}
{
    
    "index":{
    
    }}
{
    
    "message":"what to eat today"}
{
    
    "index":{
    
    }}
{
    
    "message":"i like eating apples"}
{
    
    "index":{
    
    }}
{
    
    "message":"an apple"}
{
    
    "index":{
    
    }}
{
    
    "message":"how about apples"}

使用term suggester获取建议词：

GET my_index/_search
{
    
    
  "suggest": {
    
    
    "my-suggest": {
    
    
      "text": "eati apple",
      "term": {
    
    
        "field": "message",
        "suggest_mode": "missing"
      }
    }
  }
}

得到结果：

"suggest" : {
    
    
  "my-suggest" : [
    {
    
    
      "text" : "eati",
      "offset" : 0,
      "length" : 4,
      "options" : [
        {
    
    
          "text" : "eat",
          "score" : 0.6666666,
          "freq" : 1
        },
        {
    
    
          "text" : "eating",
          "score" : 0.5,
          "freq" : 1
        }
      ]
    },
    {
    
    
      "text" : "apple",
      "offset" : 5,
      "length" : 5,
      "options" : [ ]
    }
  ]
}

总结：

eati 给出了2个建议词：eat 和 eating
apple没有返回任何建议词

2.3 参数讲解

suggest_mode有3个可选值
- missing：默认。 当分词做term匹配的时候，如果已经足够精确了，将不返回任何结果。比如，上面例子的apple
- popular：当存在比原有输入词频更高的词条时，才会返回可选词。比如，改成popular之后，上面例子apple的可选词会返回apples。因为apples出现了2次，而apple只出现1次，所以apples的词频更高。
- always：返回任何可能匹配的可选词

analyzer 指定对搜索词进行拆分的分词器，默认是查询使用的分词器

size 指定每个options最多返回几个元素

sort 指定options集合的排序方式，有2个可选项
- score： 默认。 先按得分排序，然后词频，最后按元素名称排序
- frequency：先按词频排序，然后得分，最后按元素名称排序

其他参数就不逐一列举了，可以参考官方说明

3. Phrase Suggester

3.1 定义

phrase suggester 网上没有找到太多的资料，有些文章跟我的测试结果有出入。所以咱直接上案例吧，用事实说话。

3.2 实例测试

构造2笔数据数据，第2笔的lucenee和elasticsearcc是故意拼错的，测试有用

PUT my_index/_doc/_bulk?refresh
{
    
    "index":{
    
    }}
{
    
    "message":"search fast with lucene elasticsearch"}
{
    
    "index":{
    
    }}
{
    
    "message":"he like lucenee,but i like elasticsearcc"}

发送phrase suggest请求：

GET my_index/_search
{
    
    
  "suggest": {
    
    
    "my-suggest": {
    
    
      "text": "use the lucen elasticsear",
      "phrase": {
    
    
        "field": "message",
        "highlight":{
    
    
          "pre_tag": "<em>",
          "post_tag": "</em>"
        }
      }
    }
  }
}

结果如下：

"suggest" : {
    
    
    "my-suggest" : [
      {
    
    
        "text" : "use the lucen elasticsear",
        "offset" : 0,
        "length" : 25,
        "options" : [
          {
    
    
            "text" : "use the lucene elasticsearcc",
            "highlighted" : "use the <em>lucene elasticsearcc</em>",
            "score" : 0.0062606935
          },
          {
    
    
            "text" : "use the lucene elasticsearch",
            "highlighted" : "use the <em>lucene elasticsearch</em>",
            "score" : 0.0062606935
          },
          {
    
    
            "text" : "use the lucenee elasticsearcc",
            "highlighted" : "use the <em>lucenee elasticsearcc</em>",
            "score" : 0.005525381
          },
          {
    
    
            "text" : "use the lucenee elasticsearch",
            "highlighted" : "use the <em>lucenee elasticsearch</em>",
            "score" : 0.005525381
          },
          {
    
    
            "text" : "use the lucen elasticsearcc",
            "highlighted" : "use the lucen <em>elasticsearcc</em>",
            "score" : 0.004992289
          }
        ]
      }
    ]
  }

3.3 结果分析

第1、 2笔结果的score相同，他们都是采纳lucene。为什么elasticsearcc顺序排在elasticsearch的前面？因为得分相同的情况下，是按拼音排序的
第3、 4笔结果的score相同，他们都是采纳lucenee。为什么采纳lucenee的得分会比lucene低？看下面这个表格，跟编辑距离有关。编辑距离越小表示跟搜索词越相似，score得分必然越高。

搜索词	目标词	编辑距离
lucen	lucene	1
lucen	lucenee	2

第5笔只对elasticsear进行修正了。为什么返回值只有5笔记录？因为phrase suggester默认就是返回5笔，可以通过size参数设置。

另外：

phrase suggester 返回的是对原始text修正的结果，而不是像term suggester 把每个词拆开单独返回
支持highlighted高亮设置，返回值会对修正的部分加上高亮标签

4. Complection Suggester

4.1 定义

它主要用来实现“Auto-Complete”。在理想情况下，自动完成功能应该和用户输入一样快，以便为用户已经输入的内容提供即时反馈。因此，Complection Suggester在速度上进行了优化，使用了支持快速查找的数据结构，但是构建成本很高，并且存储在内存中。

注意：Complection Suggester 并不像另外2种Suggester那样提供修正功能

4.2 实例测试

构造数据：

PUT my_index
{
    
    
  "mappings": {
    
    
    "properties": {
    
    
      "message": {
    
    
        "type": "completion"
      }
    }
  }
}

PUT my_index/_doc/_bulk?refresh
{
    
    "index":{
    
    }}
{
    
    "message":"elasticsearch is very fast"}
{
    
    "index":{
    
    }}
{
    
    "message":"elastane what"}

发送completion suggest请求：

GET my_index/_search
{
    
    
  "suggest": {
    
    
    "my-suggest": {
    
    
      "prefix": "elast",
      "completion": {
    
    
        "field": "message"
      }
    }
  }
}