前提:安装了Elasticsearch5.6.1; 安装了ik分词,安装的分词与es的版本要一致。具体的安装可参看:http://blog.csdn.net/ld326/article/details/78057145
要注意es的版本,认清版本很重要。5.x之后:node settings must not contain any index level settings;网上很多资料都是基于以前的IK配置方法,感觉不合适,刚开始的时候我也按那些方法来配置,出现了各种错误。
一、同义词小例
我们定义了一个 同义词 类型的语汇单元过滤器,加入到词汇处理流中
PUT /my_index01
{
"settings": {
"analysis": {
"filter": {
"my_synonym_filter": {
"type": "synonym",
"synonyms": [
"中国,华夏,中华人民共和国",
"婴儿,新生儿"
]
}
},
"analyzer": {
"my_synonyms": {
"tokenizer": "ik_smart",
"filter": [
"lowercase",
"my_synonym_filter"
]
}
}
}
}
}
测试一下:
GET /my_index01/_analyze
{
"analyzer":"my_synonyms",
"text":"我是中国人,你是华夏人;他生了一个孩子,我也生了一个婴儿。"
}
运行结果为:
{
"tokens": [
{
"token": "我",
"start_offset": 0,
"end_offset": 1,
"type": "CN_CHAR",
"position": 0
},
{
"token": "是",
"start_offset": 1,
"end_offset": 2,
"type": "CN_CHAR",
"position": 1
},
{
"token": "中国人",
"start_offset": 2,
"end_offset": 5,
"type": "CN_WORD",
"position": 2
},
{
"token": "你",
"start_offset": 6,
"end_offset": 7,
"type": "CN_CHAR",
"position": 3
},
{
"token": "是",
"start_offset": 7,
"end_offset": 8,
"type": "CN_CHAR",
"position": 4
},
{
"token": "华夏",
"start_offset": 8,
"end_offset": 10,
"type": "CN_WORD",
"position": 5
},
{
"token": "中国",
"start_offset": 8,
"end_offset": 10,
"type": "SYNONYM",
"position": 5
},
{
"token": "中华人民共和国",
"start_offset": 8,
"end_offset": 10,
"type": "SYNONYM",
"position": 5
},
{
"token": "人",
"start_offset": 10,
"end_offset": 11,
"type": "CN_CHAR",
"position": 6
},
{
"token": "他",
"start_offset": 12,
"end_offset": 13,
"type": "CN_CHAR",
"position": 7
},
{
"token": "生了",
"start_offset": 13,
"end_offset": 15,
"type": "CN_WORD",
"position": 8
},
{
"token": "一个",
"start_offset": 15,
"end_offset": 17,
"type": "CN_WORD",
"position": 9
},
{
"token": "孩子",
"start_offset": 17,
"end_offset": 19,
"type": "CN_WORD",
"position": 10
},
{
"token": "我",
"start_offset": 20,
"end_offset": 21,
"type": "CN_CHAR",
"position": 11
},
{
"token": "也",
"start_offset": 21,
"end_offset": 22,
"type": "CN_CHAR",
"position": 12
},
{
"token": "生了",
"start_offset": 22,
"end_offset": 24,
"type": "CN_WORD",
"position": 13
},
{
"token": "一个",
"start_offset": 24,
"end_offset": 26,
"type": "CN_WORD",
"position": 14
},
{
"token": "婴儿",
"start_offset": 26,
"end_offset": 28,
"type": "CN_WORD",
"position": 15
},
{
"token": "新生儿",
"start_offset": 26,
"end_offset": 28,
"type": "SYNONYM",
"position": 15
}
]
}
设置同义 词时,可以把同义放在本地文件上或网上:
"synonyms_path" : "synonyms.txt",
如果两个都设置了会有一个被覆盖,可以在synonyms.txt文件中加入:
高脂血症,高血脂症,高甘油三酯血症,混合性高脂血症,血脂异常
婴儿,新生儿,宝宝,小宝宝,婴幼儿
中国,华夏,中华人民共和国
猫,小花猫,小猫,花猫 => 猫
航海 =>航海,轮船,小船
并把synonyms.tx文件放到config目录下。
再建立索引,运行如下查询:
GET /my_index01/_analyze
{
"analyzer":"my_synonyms",
"text":["我来自中国,你来自华夏",
"他生了一个小宝宝,我也生了一个婴儿。",
"小明家有一个小花猫,小花家有一个小猫",
"我们开着小船与轮船去航海"
]
}
{
"tokens": [
{
"token": "我",
"start_offset": 0,
"end_offset": 1,
"type": "CN_CHAR",
"position": 0
},
{
"token": "来自",
"start_offset": 1,
"end_offset": 3,
"type": "CN_WORD",
"position": 1
},
{
"token": "中国",
"start_offset": 3,
"end_offset": 5,
"type": "CN_WORD",
"position": 2
},
{
"token": "华夏",
"start_offset": 3,
"end_offset": 5,
"type": "SYNONYM",
"position": 2
},
{
"token": "中华人民共和国",
"start_offset": 3,
"end_offset": 5,
"type": "SYNONYM",
"position": 2
},
{
"token": "你",
"start_offset": 6,
"end_offset": 7,
"type": "CN_CHAR",
"position": 3
},
{
"token": "来自",
"start_offset": 7,
"end_offset": 9,
"type": "CN_WORD",
"position": 4
},
{
"token": "华夏",
"start_offset": 9,
"end_offset": 11,
"type": "CN_WORD",
"position": 5
},
{
"token": "中国",
"start_offset": 9,
"end_offset": 11,
"type": "SYNONYM",
"position": 5
},
{
"token": "中华人民共和国",
"start_offset": 9,
"end_offset": 11,
"type": "SYNONYM",
"position": 5
},
{
"token": "他",
"start_offset": 12,
"end_offset": 13,
"type": "CN_CHAR",
"position": 106
},
{
"token": "生了",
"start_offset": 13,
"end_offset": 15,
"type": "CN_WORD",
"position": 107
},
{
"token": "一个",
"start_offset": 15,
"end_offset": 17,
"type": "CN_WORD",
"position": 108
},
{
"token": "小宝宝",
"start_offset": 17,
"end_offset": 20,
"type": "CN_WORD",
"position": 109
},
{
"token": "婴儿",
"start_offset": 17,
"end_offset": 20,
"type": "SYNONYM",
"position": 109
},
{
"token": "新生儿",
"start_offset": 17,
"end_offset": 20,
"type": "SYNONYM",
"position": 109
},
{
"token": "宝宝",
"start_offset": 17,
"end_offset": 20,
"type": "SYNONYM",
"position": 109
},
{
"token": "婴幼儿",
"start_offset": 17,
"end_offset": 20,
"type": "SYNONYM",
"position": 109
},
{
"token": "我",
"start_offset": 21,
"end_offset": 22,
"type": "CN_CHAR",
"position": 110
},
{
"token": "也",
"start_offset": 22,
"end_offset": 23,
"type": "CN_CHAR",
"position": 111
},
{
"token": "生了",
"start_offset": 23,
"end_offset": 25,
"type": "CN_WORD",
"position": 112
},
{
"token": "一个",
"start_offset": 25,
"end_offset": 27,
"type": "CN_WORD",
"position": 113
},
{
"token": "婴儿",
"start_offset": 27,
"end_offset": 29,
"type": "CN_WORD",
"position": 114
},
{
"token": "新生儿",
"start_offset": 27,
"end_offset": 29,
"type": "SYNONYM",
"position": 114
},
{
"token": "宝宝",
"start_offset": 27,
"end_offset": 29,
"type": "SYNONYM",
"position": 114
},
{
"token": "小宝宝",
"start_offset": 27,
"end_offset": 29,
"type": "SYNONYM",
"position": 114
},
{
"token": "婴幼儿",
"start_offset": 27,
"end_offset": 29,
"type": "SYNONYM",
"position": 114
},
{
"token": "小明",
"start_offset": 30,
"end_offset": 32,
"type": "CN_WORD",
"position": 215
},
{
"token": "家有",
"start_offset": 32,
"end_offset": 34,
"type": "CN_WORD",
"position": 216
},
{
"token": "一个",
"start_offset": 34,
"end_offset": 36,
"type": "CN_WORD",
"position": 217
},
{
"token": "猫",
"start_offset": 36,
"end_offset": 39,
"type": "SYNONYM",
"position": 218
},
{
"token": "小花",
"start_offset": 40,
"end_offset": 42,
"type": "CN_WORD",
"position": 219
},
{
"token": "家有",
"start_offset": 42,
"end_offset": 44,
"type": "CN_WORD",
"position": 220
},
{
"token": "一个",
"start_offset": 44,
"end_offset": 46,
"type": "CN_WORD",
"position": 221
},
{
"token": "猫",
"start_offset": 46,
"end_offset": 48,
"type": "SYNONYM",
"position": 222
},
{
"token": "我们",
"start_offset": 49,
"end_offset": 51,
"type": "CN_WORD",
"position": 323
},
{
"token": "开着",
"start_offset": 51,
"end_offset": 53,
"type": "CN_WORD",
"position": 324
},
{
"token": "小船",
"start_offset": 53,
"end_offset": 55,
"type": "CN_WORD",
"position": 325
},
{
"token": "与",
"start_offset": 55,
"end_offset": 56,
"type": "CN_CHAR",
"position": 326
},
{
"token": "轮船",
"start_offset": 56,
"end_offset": 58,
"type": "CN_WORD",
"position": 327
},
{
"token": "去",
"start_offset": 58,
"end_offset": 59,
"type": "CN_CHAR",
"position": 328
},
{
"token": "航海",
"start_offset": 59,
"end_offset": 61,
"type": "SYNONYM",
"position": 329
},
{
"token": "轮船",
"start_offset": 59,
"end_offset": 61,
"type": "SYNONYM",
"position": 329
},
{
"token": "小船",
"start_offset": 59,
"end_offset": 61,
"type": "SYNONYM",
"position": 329
}
]
}
二、同义词格式
格式1:逗号分隔 例如上面看到的。
格式2:使用 => 语法,可以指定一个词项列表(在左边),和一个或多个替换(右边)的列表;
例如:
“united states => usa”,
“united states of america => usa”
如果多个规则指定同一个同义词,它们将被合并在一起,且顺序无关,否则使用最长匹配。
如果这些规则相互冲突,Elasticsearch 会将 United States of America 转换为词项 (usa),(of),(america) 。否则,会使用最长的序列,即最终得到词项 (usa) 。
三、同义词分类
四、同义词应该到解释器中,应用到生产上
4.1 第一步,设置一个同义词过滤器k_synonym_filter;
4.2 第二步,设置一个ik_smart的分类器,加上了char_filter为html的标签清洗;然后经过ik_smart分词,最后是把所有字母小写并作同义语查找;
4.3 第三步,设置一个k_tag_analyzer解释器,只要是处理以空格分隔开的标签;
4.4 第四步,就是应用刚才的解释器。
PUT info_index
{
"settings": {
"analysis": {
"filter": {
"k_synonym_filter": {
"type": "synonym",
"synonyms_path": "synonyms.txt"
}
},
"analyzer": {
"k_analyzer": {
"char_filter": "html_strip",
"tokenizer": "ik_smart",
"filter": [
"lowercase",
"k_synonym_filter"
]
},
"k_tag_analyzer": {
"char_filter": "html_strip",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"k_synonym_filter"
]
}
}
}
},
"mappings": {
"articles": {
"properties": {
"article_crawtime": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis||yyyy-MM-ddTHH:mm:ss"
},
"keywords01": {
"type": "text",
"analyzer": "k_tag_analyzer",
"fields": {
"raw": {
"type": "keyword"
}
}
},
"keywords02": {
"type": "nested",
"properties": {
"keyword": {
"type": "keyword"
},
"weight": {
"type": "double"
}
}
},
"title": {
"type": "text",
"analyzer": "k_analyzer",
"store": true
},
"article_url": {
"type": "keyword"
}
}
}
}
}
五,对带有同义词过滤的解释器的测试
GET info_index/_analyze
{
"analyzer":"k_tag_analyzer",
"text":"医改 医院 dfdf 中国 你好的"
}
{
"tokens": [
{
"token": "医改",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 0
},
{
"token": "医院",
"start_offset": 3,
"end_offset": 5,
"type": "word",
"position": 1
},
{
"token": "中国",
"start_offset": 6,
"end_offset": 8,
"type": "word",
"position": 2
},
{
"token": "华夏",
"start_offset": 6,
"end_offset": 8,
"type": "SYNONYM",
"position": 2
},
{
"token": "中华人民共和国",
"start_offset": 6,
"end_offset": 8,
"type": "SYNONYM",
"position": 2
},
{
"token": "你好的",
"start_offset": 9,
"end_offset": 12,
"type": "word",
"position": 3
}
]
}
另外更详细的参考
Elasticsearch: 权威指南之同义词
https://www.elastic.co/guide/cn/elasticsearch/guide/current/synonyms.html
动态更新同义词:
https://github.com/bells/elasticsearch-analysis-dynamic-synonym
写了实现同义词的三种方法:
http://blog.csdn.net/fighting_one_piece/article/details/77800921
happyprince , http://blog.csdn.net/ld326/article/details/79235303