elasticsearch IK 中文分词器精确查询

在上面2篇文章的基础上，来学习下IK

IKAnalyzer: 免费开源的java分词器,目前比较流行的中文分词器之一,简单,稳定,想要特别好的效果,需要自行维护词库,支持自定义词典。

一. 安装ik分词器插件

下载地址：https://github.com/medcl/elasticsearch-analysis-ik/releases?after=v6.4.2，版本如下图：

然后在elasticsearch下的plugins文件下创建一个ik文件夹，将上面的压缩包解压到这里。

修改ik文件下的plugin-descriptor.properties文件，设置版本号（我这里用的elasticsearch的版本号是6.2.2，ik版本号是6.3.0，官网说同一个大版本号都是可以用的，这里大版本是6），如下：

description=IK Analyzer for Elasticsearch
#
# 'version': plugin's version
version=6.3.0
#
# 'name': the plugin name
name=analysis-ik
#
# 'classname': the name of the class to load, fully-qualified.
classname=org.elasticsearch.plugin.analysis.ik.AnalysisIkPlugin
#
# 'java.version' version of java the code is built against
# use the system property java.specification.version
# version string must be a sequence of nonnegative decimal integers
# separated by "."'s and may have leading zeros
java.version=1.8
#
# 'elasticsearch.version' version of elasticsearch compiled against
# You will have to release a new version of the plugin for each new
# elasticsearch release. This version is checked when the plugin
# is loaded so Elasticsearch will refuse to start in the presence of
# plugins with the incorrect elasticsearch.version.
elasticsearch.version=6.2.2

然后重启elasticserch：切到bin文件夹下执行命令elasticsearch。

启动成功界面如下所示：

二. 常见ES接口测试

1. 创建一个index

http://localhost:9200/es  PUT

执行结果：
{
    "acknowledged": true,
    "shards_acknowledged": true,
    "index": "es"
}

2.创建表，设置分词create a mapping

http://localhost:9200/es/_mapping/doc  POST
参数：Content-Type  application/json
{
  "properties":{
    "content":{
      "type":"text",
      "analyzer":"ik_max_word",
      "search_analyzer":"ik_smart"
    }
  }
}

执行结果：
{
    "acknowledged": true
}

3.添加数据，添加4条测试数据

http://localhost:9200/es/doc/1  POST
header： Content-Type ：application/json
参数：
{
  "content":"美国留给伊拉克的是个烂摊子吗"
}

http://localhost:9200/es/doc/2  POST
header： Content-Type ：application/json
{
  "content":"公安部：各地校车将享最高路权"
}

http://localhost:9200/es/doc/3  POST
header： Content-Type ：application/json
{
  "content":"中韩渔警冲突调查：韩警平均每天扣1艘中国渔船"
}

http://localhost:9200/es/doc/4  POST
header： Content-Type ：application/json
{
  "content":"中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"
}
执行结果：
{
    "_index": "es",
    "_type": "doc",
    "_id": "4",
    "_version": 1,
    "result": "created",
    "_shards": {
        "total": 2,
        "successful": 1,
        "failed": 0
    },
    "_seq_no": 1,
    "_primary_term": 1
}

4. 分词查找，如下查找到2条数据【match查询（注意，match查询只能是针对单个字段）】

http://localhost:9200/es/_search  POST
参数：
{
  "query": {
    "match": {
      "content": "中国"
    }
  }
}

执行结果：
{
    "took": 2,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": 2,
        "max_score": 0.6489038,
        "hits": [
            {
                "_index": "es",
                "_type": "doc",
                "_id": "4",
                "_score": 0.6489038,
                "_source": {
                    "content": "中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"
                }
            },
            {
                "_index": "es",
                "_type": "doc",
                "_id": "3",
                "_score": 0.2876821,
                "_source": {
                    "content": "中韩渔警冲突调查：韩警平均每天扣1艘中国渔船"
                }
            }
        ]
    }
}

5. 删除index

http://localhost:9200/index  delete

6.分析分词_analyze

http://localhost:9200/_analyze  POST
参数：
{
	"analyzer":"ik_max_word",
	"text":"中国人"
}

执行结果：
{
    "tokens": [
        {
            "token": "中国人",
            "start_offset": 0,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "中国",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "国人",
            "start_offset": 1,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 2
        }
    ]
}

三. 精确合并查询 (and，or)

1. 全部数据如下：

http://localhost:9200/es/_search  查所有数据，如下：

{
    "took": 1,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": 6,
        "max_score": 1,
        "hits": [
            {
                "_index": "es",
                "_type": "doc",
                "_id": "5",
                "_score": 1,
                "_source": {
                    "content": "美国人特朗普垃圾"
                }
            },
            {
                "_index": "es",
                "_type": "doc",
                "_id": "4",
                "_score": 1,
                "_source": {
                    "content": "中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"
                }
            },
            {
                "_index": "es",
                "_type": "doc",
                "_id": "2",
                "_score": 1,
                "_source": {
                    "content": "公安部：各地校车将享最高路权"
                }
            },
            {
                "_index": "es",
                "_type": "doc",
                "_id": "6",
                "_score": 1,
                "_source": {
                    "content": "美航空母舰国人"
                }
            },
            {
                "_index": "es",
                "_type": "doc",
                "_id": "1",
                "_score": 1,
                "_source": {
                    "content": "美国人111"
                }
            },
            {
                "_index": "es",
                "_type": "doc",
                "_id": "3",
                "_score": 1,
                "_source": {
                    "content": "中韩渔警冲突调查：韩警平均每天扣1艘中国渔船"
                }
            }
        ]
    }
}

2. or查询

http://localhost:9200/es/_search 
参数：
{
  "query": {
    "match": {
      "content": "美国人1"   
    }
  }
}

通过上面的结果可以分析出：这是因为解析器会将”美国人111“，拆分为了2个词“美国人”和“111”，而且默认的操作符是or，所以查到了如上的2条数据。

3. 利用and查询，实现精确查询

http://localhost:9200/es/_search?pretty=true  POST

{
  "query": {
      "match": {
	        "content": {
		        "query": "美国人111",
		        "operator": "and"
		        }
             }
        }
}

通过以上的结果可以看出，查询的结果content中包含“美国人”和“111”的记录，用了and，故精确的查到了一条数据

4.“美国人111”上面的分词，可以被划分为下面的几个词，如下截图：

ik_max_word: 会将文本做最细粒度的拆分，比如会将“中华人民共和国国歌”拆分为“中华人民共和国,中华人民,中华,华人,人民共和国,人民,人,民,共和国,共和,和,国国,国歌”，会穷尽各种可能的组合，适合 Term Query；

ik_smart: 会做最粗粒度的拆分，比如会将“中华人民共和国国歌”拆分为“中华人民共和国,国歌”，适合 Phrase 查询。

参考：https://www.jianshu.com/p/362f85ebf383

https://www.cnblogs.com/cjsblog/p/9910788.html