大数据学习[15]:elasticsearch之同义词

前提：安装了Elasticsearch5.6.1; 安装了ik分词,安装的分词与es的版本要一致。具体的安装可参看：http://blog.csdn.net/ld326/article/details/78057145

要注意es的版本，认清版本很重要。5.x之后：node settings must not contain any index level settings；网上很多资料都是基于以前的IK配置方法，感觉不合适，刚开始的时候我也按那些方法来配置，出现了各种错误。

一、同义词小例

我们定义了一个同义词类型的语汇单元过滤器，加入到词汇处理流中

PUT /my_index01
{
  "settings": {
    "analysis": {
      "filter": {
        "my_synonym_filter": {
          "type": "synonym", 
          "synonyms": [ 
            "中国,华夏,中华人民共和国",
            "婴儿,新生儿"
          ]
        }
      },
      "analyzer": {
        "my_synonyms": {
          "tokenizer": "ik_smart",
          "filter": [
            "lowercase",
            "my_synonym_filter" 
          ]
        }
      }
    }
  }
}

测试一下：

GET /my_index01/_analyze
{  
  "analyzer":"my_synonyms",  
  "text":"我是中国人，你是华夏人；他生了一个孩子，我也生了一个婴儿。"  
}

运行结果为：

{
  "tokens": [
    {
      "token": "我",
      "start_offset": 0,
      "end_offset": 1,
      "type": "CN_CHAR",
      "position": 0
    },
    {
      "token": "是",
      "start_offset": 1,
      "end_offset": 2,
      "type": "CN_CHAR",
      "position": 1
    },
    {
      "token": "中国人",
      "start_offset": 2,
      "end_offset": 5,
      "type": "CN_WORD",
      "position": 2
    },
    {
      "token": "你",
      "start_offset": 6,
      "end_offset": 7,
      "type": "CN_CHAR",
      "position": 3
    },
    {
      "token": "是",
      "start_offset": 7,
      "end_offset": 8,
      "type": "CN_CHAR",
      "position": 4
    },
    {
      "token": "华夏",
      "start_offset": 8,
      "end_offset": 10,
      "type": "CN_WORD",
      "position": 5
    },
    {
      "token": "中国",
      "start_offset": 8,
      "end_offset": 10,
      "type": "SYNONYM",
      "position": 5
    },
    {
      "token": "中华人民共和国",
      "start_offset": 8,
      "end_offset": 10,
      "type": "SYNONYM",
      "position": 5
    },
    {
      "token": "人",
      "start_offset": 10,
      "end_offset": 11,
      "type": "CN_CHAR",
      "position": 6
    },
    {
      "token": "他",
      "start_offset": 12,
      "end_offset": 13,
      "type": "CN_CHAR",
      "position": 7
    },
    {
      "token": "生了",
      "start_offset": 13,
      "end_offset": 15,
      "type": "CN_WORD",
      "position": 8
    },
    {
      "token": "一个",
      "start_offset": 15,
      "end_offset": 17,
      "type": "CN_WORD",
      "position": 9
    },
    {
      "token": "孩子",
      "start_offset": 17,
      "end_offset": 19,
      "type": "CN_WORD",
      "position": 10
    },
    {
      "token": "我",
      "start_offset": 20,
      "end_offset": 21,
      "type": "CN_CHAR",
      "position": 11
    },
    {
      "token": "也",
      "start_offset": 21,
      "end_offset": 22,
      "type": "CN_CHAR",
      "position": 12
    },
    {
      "token": "生了",
      "start_offset": 22,
      "end_offset": 24,
      "type": "CN_WORD",
      "position": 13
    },
    {
      "token": "一个",
      "start_offset": 24,
      "end_offset": 26,
      "type": "CN_WORD",
      "position": 14
    },
    {
      "token": "婴儿",
      "start_offset": 26,
      "end_offset": 28,
      "type": "CN_WORD",
      "position": 15
    },
    {
      "token": "新生儿",
      "start_offset": 26,
      "end_offset": 28,
      "type": "SYNONYM",
      "position": 15
    }
  ]
}

设置同义词时，可以把同义放在本地文件上或网上：

 "synonyms_path" : "synonyms.txt",

如果两个都设置了会有一个被覆盖，可以在synonyms.txt文件中加入：

高脂血症,高血脂症,高甘油三酯血症,混合性高脂血症,血脂异常
婴儿,新生儿,宝宝,小宝宝,婴幼儿
中国,华夏,中华人民共和国
猫,小花猫,小猫,花猫 => 猫
航海 =>航海,轮船,小船

并把synonyms.tx文件放到config目录下。
再建立索引，运行如下查询：

GET /my_index01/_analyze
{  
  "analyzer":"my_synonyms",  
  "text":["我来自中国，你来自华夏",
          "他生了一个小宝宝，我也生了一个婴儿。",
          "小明家有一个小花猫，小花家有一个小猫",
          "我们开着小船与轮船去航海"
  ]  
}

{
  "tokens": [
    {
      "token": "我",
      "start_offset": 0,
      "end_offset": 1,
      "type": "CN_CHAR",
      "position": 0
    },
    {
      "token": "来自",
      "start_offset": 1,
      "end_offset": 3,
      "type": "CN_WORD",
      "position": 1
    },
    {
      "token": "中国",
      "start_offset": 3,
      "end_offset": 5,
      "type": "CN_WORD",
      "position": 2
    },
    {
      "token": "华夏",
      "start_offset": 3,
      "end_offset": 5,
      "type": "SYNONYM",
      "position": 2
    },
    {
      "token": "中华人民共和国",
      "start_offset": 3,
      "end_offset": 5,
      "type": "SYNONYM",
      "position": 2
    },
    {
      "token": "你",
      "start_offset": 6,
      "end_offset": 7,
      "type": "CN_CHAR",
      "position": 3
    },
    {
      "token": "来自",
      "start_offset": 7,
      "end_offset": 9,
      "type": "CN_WORD",
      "position": 4
    },
    {
      "token": "华夏",
      "start_offset": 9,
      "end_offset": 11,
      "type": "CN_WORD",
      "position": 5
    },
    {
      "token": "中国",
      "start_offset": 9,
      "end_offset": 11,
      "type": "SYNONYM",
      "position": 5
    },
    {
      "token": "中华人民共和国",
      "start_offset": 9,
      "end_offset": 11,
      "type": "SYNONYM",
      "position": 5
    },
    {
      "token": "他",
      "start_offset": 12,
      "end_offset": 13,
      "type": "CN_CHAR",
      "position": 106
    },
    {
      "token": "生了",
      "start_offset": 13,
      "end_offset": 15,
      "type": "CN_WORD",
      "position": 107
    },
    {
      "token": "一个",
      "start_offset": 15,
      "end_offset": 17,
      "type": "CN_WORD",
      "position": 108
    },
    {
      "token": "小宝宝",
      "start_offset": 17,
      "end_offset": 20,
      "type": "CN_WORD",
      "position": 109
    },
    {
      "token": "婴儿",
      "start_offset": 17,
      "end_offset": 20,
      "type": "SYNONYM",
      "position": 109
    },
    {
      "token": "新生儿",
      "start_offset": 17,
      "end_offset": 20,
      "type": "SYNONYM",
      "position": 109
    },
    {
      "token": "宝宝",
      "start_offset": 17,
      "end_offset": 20,
      "type": "SYNONYM",
      "position": 109
    },
    {
      "token": "婴幼儿",
      "start_offset": 17,
      "end_offset": 20,
      "type": "SYNONYM",
      "position": 109
    },
    {
      "token": "我",
      "start_offset": 21,
      "end_offset": 22,
      "type": "CN_CHAR",
      "position": 110
    },
    {
      "token": "也",
      "start_offset": 22,
      "end_offset": 23,
      "type": "CN_CHAR",
      "position": 111
    },
    {
      "token": "生了",
      "start_offset": 23,
      "end_offset": 25,
      "type": "CN_WORD",
      "position": 112
    },
    {
      "token": "一个",
      "start_offset": 25,
      "end_offset": 27,
      "type": "CN_WORD",
      "position": 113
    },
    {
      "token": "婴儿",
      "start_offset": 27,
      "end_offset": 29,
      "type": "CN_WORD",
      "position": 114
    },
    {
      "token": "新生儿",
      "start_offset": 27,
      "end_offset": 29,
      "type": "SYNONYM",
      "position": 114
    },
    {
      "token": "宝宝",
      "start_offset": 27,
      "end_offset": 29,
      "type": "SYNONYM",
      "position": 114
    },
    {
      "token": "小宝宝",
      "start_offset": 27,
      "end_offset": 29,
      "type": "SYNONYM",
      "position": 114
    },
    {
      "token": "婴幼儿",
      "start_offset": 27,
      "end_offset": 29,
      "type": "SYNONYM",
      "position": 114
    },
    {
      "token": "小明",
      "start_offset": 30,
      "end_offset": 32,
      "type": "CN_WORD",
      "position": 215
    },
    {
      "token": "家有",
      "start_offset": 32,
      "end_offset": 34,
      "type": "CN_WORD",
      "position": 216
    },
    {
      "token": "一个",
      "start_offset": 34,
      "end_offset": 36,
      "type": "CN_WORD",
      "position": 217
    },
    {
      "token": "猫",
      "start_offset": 36,
      "end_offset": 39,
      "type": "SYNONYM",
      "position": 218
    },
    {
      "token": "小花",
      "start_offset": 40,
      "end_offset": 42,
      "type": "CN_WORD",
      "position": 219
    },
    {
      "token": "家有",
      "start_offset": 42,
      "end_offset": 44,
      "type": "CN_WORD",
      "position": 220
    },
    {
      "token": "一个",
      "start_offset": 44,
      "end_offset": 46,
      "type": "CN_WORD",
      "position": 221
    },
    {
      "token": "猫",
      "start_offset": 46,
      "end_offset": 48,
      "type": "SYNONYM",
      "position": 222
    },
    {
      "token": "我们",
      "start_offset": 49,
      "end_offset": 51,
      "type": "CN_WORD",
      "position": 323
    },
    {
      "token": "开着",
      "start_offset": 51,
      "end_offset": 53,
      "type": "CN_WORD",
      "position": 324
    },
    {
      "token": "小船",
      "start_offset": 53,
      "end_offset": 55,
      "type": "CN_WORD",
      "position": 325
    },
    {
      "token": "与",
      "start_offset": 55,
      "end_offset": 56,
      "type": "CN_CHAR",
      "position": 326
    },
    {
      "token": "轮船",
      "start_offset": 56,
      "end_offset": 58,
      "type": "CN_WORD",
      "position": 327
    },
    {
      "token": "去",
      "start_offset": 58,
      "end_offset": 59,
      "type": "CN_CHAR",
      "position": 328
    },
    {
      "token": "航海",
      "start_offset": 59,
      "end_offset": 61,
      "type": "SYNONYM",
      "position": 329
    },
    {
      "token": "轮船",
      "start_offset": 59,
      "end_offset": 61,
      "type": "SYNONYM",
      "position": 329
    },
    {
      "token": "小船",
      "start_offset": 59,
      "end_offset": 61,
      "type": "SYNONYM",
      "position": 329
    }
  ]
}

二、同义词格式

格式1：逗号分隔例如上面看到的。
格式2：使用 => 语法，可以指定一个词项列表（在左边），和一个或多个替换（右边）的列表；
例如：
“united states => usa”,
“united states of america => usa”
如果多个规则指定同一个同义词，它们将被合并在一起，且顺序无关，否则使用最长匹配。
如果这些规则相互冲突，Elasticsearch 会将 United States of America 转换为词项 (usa),(of),(america) 。否则，会使用最长的序列，即最终得到词项 (usa) 。

三、同义词分类

这里写图片描述

四、同义词应该到解释器中,应用到生产上

4.1 第一步，设置一个同义词过滤器k_synonym_filter;
4.2 第二步，设置一个ik_smart的分类器，加上了char_filter为html的标签清洗；然后经过ik_smart分词，最后是把所有字母小写并作同义语查找；
4.3 第三步，设置一个k_tag_analyzer解释器，只要是处理以空格分隔开的标签；
4.4 第四步，就是应用刚才的解释器。

PUT info_index
{
    "settings": {
        "analysis": {
            "filter": {
                "k_synonym_filter": {
                    "type": "synonym",
                    "synonyms_path": "synonyms.txt"
                }
            },
            "analyzer": {
                "k_analyzer": {
                    "char_filter": "html_strip",
                    "tokenizer": "ik_smart",
                    "filter": [
                        "lowercase",
                        "k_synonym_filter"
                    ]
                },
                "k_tag_analyzer": {
                    "char_filter": "html_strip",
                    "tokenizer": "whitespace",
                    "filter": [
                        "lowercase",
                        "k_synonym_filter"
                    ]
                }
            }
        }
    },
    "mappings": {
        "articles": {
            "properties": {
                "article_crawtime": {
                    "type": "date",
                   "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis||yyyy-MM-ddTHH:mm:ss"
                },
                "keywords01": {
                    "type": "text",
                    "analyzer": "k_tag_analyzer",
                    "fields": {
                        "raw": {
                            "type": "keyword"
                        }
                    }
                },
                "keywords02": {
                    "type": "nested",
                    "properties": {
                        "keyword": {
                            "type": "keyword"
                        },
                        "weight": {
                            "type": "double"
                        }
                    }
                },
                "title": {
                    "type": "text",
                    "analyzer": "k_analyzer",
                    "store": true
                },
                "article_url": {
                    "type": "keyword"
                }
            }
        }
    }
}
五，对带有同义词过滤的解释器的测试
GET  info_index/_analyze
{
  "analyzer":"k_tag_analyzer",
  "text":"医改 医院 dfdf 中国 你好的"
}
{
  "tokens": [
    {
      "token": "医改",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 0
    },
    {
      "token": "医院",
      "start_offset": 3,
      "end_offset": 5,
      "type": "word",
      "position": 1
    },
    {
      "token": "中国",
      "start_offset": 6,
      "end_offset": 8,
      "type": "word",
      "position": 2
    },
    {
      "token": "华夏",
      "start_offset": 6,
      "end_offset": 8,
      "type": "SYNONYM",
      "position": 2
    },
    {
      "token": "中华人民共和国",
      "start_offset": 6,
      "end_offset": 8,
      "type": "SYNONYM",
      "position": 2
    },
    {
      "token": "你好的",
      "start_offset": 9,
      "end_offset": 12,
      "type": "word",
      "position": 3
    }
  ]
}

另外更详细的参考

Elasticsearch: 权威指南之同义词
https://www.elastic.co/guide/cn/elasticsearch/guide/current/synonyms.html
动态更新同义词：
https://github.com/bells/elasticsearch-analysis-dynamic-synonym
写了实现同义词的三种方法：
http://blog.csdn.net/fighting_one_piece/article/details/77800921

happyprince , http://blog.csdn.net/ld326/article/details/79235303