ElasticSearch系列五：掌握ES使用IK中文分词器

一、内置分词器的介绍

例：Set the shape to semi-transparent by calling set_trans(5)
standard analyzer（默认）： set, the, shape, to, semi, transparent, by, calling, set_trans, 5
simple analyzer： set, the, shape, to, semi, transparent, by, calling, set, trans
whitespace analyzer： Set, the, shape, to, semi-transparent, by, calling, set_trans(5)
language analyzer： set, shape, semi, transpar, call, set_tran, 5

二、测试分词器

GET /_analyze
{
"analyzer": "standard",
"text": "Text to analyze"
}

三、IK中文分词器

1.步骤：
git clone https://github.com/medcl/elasticsearch-analysis-ik
mvn package
将target/releases/elasticsearch-analysis-ik-*.*.*.zip拷贝到es/plugins/ik目录下
在es/plugins/ik下对elasticsearch-analysis-ik-*.*.*.zip进行解压缩
重启es

2.两种analyzer
ik_max_word: 会将文本做最细粒度的拆分
ik_smart: 会做最粗粒度的拆分
3.使用
PUT /my_index
{
"mappings": {
"my_type": {
"properties": {
"text": {
"type": "text",
"analyzer": "ik_max_word"
}
}
}
}
}
4.测试
GET /my_index/_analyze
{
"text": " 对于你，我始终只能以陌生人的身份去怀念。",
"analyzer": "ik_max_word"
}
5.配置文件
IKAnalyzer.cfg.xml：用来配置自定义词库
main.dic： ik原生内置的中文词库，总共有27万多条，只要是这些单词，都会被分在一起
quantifier.dic：放了一些单位相关的词
suffix.dic：放了一些后缀
surname.dic：中国的姓氏
stopword.dic：英文停用词
6.添加自定义词库
IKAnalyzer.cfg.xml：ext_dict配置项，custom/mydict.dic
添加自定义停用词库
IKAnalyzer.cfg.xml：ext_stopwords配置项，custom/ext_stopword.dic
7.热更新方案

第一种：修改ik分词器源码，然后手动支持从mysql中每隔一定时间，自动加载新的词库

①下载源码
https://github.com/medcl/elasticsearch-analysis-ik/tree/v6.2.4
②修改源码
Dictionary类，169行：Dictionary单例类的初始化方法，在这里需要创建一个我们自定义的线程，并且启动它
HotDictReloadThread类：就是死循环，不断调用Dictionary.getSingleton().reLoadMainDict()，去重新加载词典
Dictionary类，389行：this.loadMySQLExtDict();
Dictionary类，683行：this.loadMySQLStopwordDict();
③mvn package打包代码
target\releases\elasticsearch-analysis-ik-6.2.4.zip
④解压缩ik压缩包
将mysql驱动jar，放入ik的目录下
⑤将mysql驱动jar，放入ik的目录下
⑥修改jdbc相关配置
⑦重启es，观察日志
⑧在mysql中添加词库与停用词
⑨分词实验，验证热更新生效
（点击下载已修改好的zip包）

第二种：基于ik分词器原生支持的热更新方案，部署一个web服务器，提供一个http接口，通过modified和tag两个http响应头，来提供词语的热更新

注：推荐用第一种。第二种ik git社区官方都不建议采用，觉得不太稳定。