ElasticSearch ik中文分词安装

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/paditang/article/details/78930870

前言

​ 在使用ElasticSearch做搜索时,语句的倒排索引可以说是十分关键。所以如果针对中文段落时,如果进行正确的分词索引就是重中之重,接下来就介绍如何在ElasticSearch中安装ik中文索引。(后文均简称ES)

正文

安装步骤

  1. 插件下载:

    • 源项目地址

      点击跳转到ik项目打包好的发布地址。选择和你服务器安装ES版本相近的ik版本,下载。

      下载安装包

    • 下载地址

      如果github访问有问题,可直接下载本人存在云上的ik5.6.0版本。

  2. 解压配置

    • 在ES_HOME/plugins/文件夹下新建ik文件夹

    • 将压缩包内容解压缩放到ik中

    • 项目文件结构

      项目结构

  3. 启动ES

    此时启动ES应该可以看到已加载ik分词器

    装载分词器

测试分词结果

普通分词
POST {{host}}:{{port}}/_analyze
{
  "analyzer":"english",
  "text":"使用搜索引擎"
}
分词结果:
{
    "tokens": [
        {
            "token": "使",
            "start_offset": 0,
            "end_offset": 1,
            "type": "<IDEOGRAPHIC>",
            "position": 0
        },
        {
            "token": "用",
            "start_offset": 1,
            "end_offset": 2,
            "type": "<IDEOGRAPHIC>",
            "position": 1
        },
        {
            "token": "搜",
            "start_offset": 2,
            "end_offset": 3,
            "type": "<IDEOGRAPHIC>",
            "position": 2
        },
        {
            "token": "索",
            "start_offset": 3,
            "end_offset": 4,
            "type": "<IDEOGRAPHIC>",
            "position": 3
        },
        {
            "token": "引",
            "start_offset": 4,
            "end_offset": 5,
            "type": "<IDEOGRAPHIC>",
            "position": 4
        },
        {
            "token": "擎",
            "start_offset": 5,
            "end_offset": 6,
            "type": "<IDEOGRAPHIC>",
            "position": 5
        }
    ]
}
ik_smart分词
POST {{host}}:{{port}}/_analyze
{
  "analyzer":"ik_smart",
  "text":"使用搜索引擎"
}
{
    "tokens": [
        {
            "token": "使用",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "搜索引擎",
            "start_offset": 2,
            "end_offset": 6,
            "type": "CN_WORD",
            "position": 1
        }
    ]
}
ik_max_word
POST {{host}}:{{port}}/_analyze
{
  "analyzer":"ik_max_word",
  "text":"使用搜索引擎"
}
{
    "tokens": [
        {
            "token": "使用",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "搜索引擎",
            "start_offset": 2,
            "end_offset": 6,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "搜索",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 2
        },
        {
            "token": "索引",
            "start_offset": 3,
            "end_offset": 5,
            "type": "CN_WORD",
            "position": 3
        },
        {
            "token": "引擎",
            "start_offset": 4,
            "end_offset": 6,
            "type": "CN_WORD",
            "position": 4
        }
    ]
}

搜索分词测试

// 创建index
PUT {{host}}:{{port}}/news  
// 创建mapping 并设置分词器
POST {{host}}:{{port}}/news/sports/_mapping
{
    "properties":{
        "content":{
            "type":"text",
            "analyzer":"ik_max_word",
            "index":"analyzed"
        }
    }
}
导入数据....
搜索引擎内数据
{
    "took": 1,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": 4,
        "max_score": 1,
        "hits": [
            {
                "_index": "news",
                "_type": "sports",
                "_id": "AWCgyE7pGEKcCwwZuUe6",
                "_score": 1,
                "_source": {
                    "content": "热火形势一片大好"
                }
            },
            {
                "_index": "news",
                "_type": "sports",
                "_id": "AWCgx7fpGEKcCwwZuUe5",
                "_score": 1,
                "_source": {
                    "content": "火箭98-99不敌凯尔特人,惨遭四连败"
                }
            },
            {
                "_index": "news",
                "_type": "sports",
                "_id": "AWCgyOLYGEKcCwwZuUe7",
                "_score": 1,
                "_source": {
                    "content": "曼城18连胜,英超无人能挡"
                }
            },
            {
                "_index": "news",
                "_type": "sports",
                "_id": "AWCgxyxXGEKcCwwZuUe4",
                "_score": 1,
                "_source": {
                    "content": "巴萨3-0击败皇马赢下国家德比,梅西一球一助再获满分"
                }
            }
        ]
    }
}
POST {{host}}:{{port}}/news/sports/_search
{
    "query":{
        "match":{
            "content":"火箭队新闻"
        }
    }   
}
{
    "took": 5,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": 1,
        "max_score": 0.6099695,
        "hits": [
            {
                "_index": "news",
                "_type": "sports",
                "_id": "AWCgx7fpGEKcCwwZuUe5",
                "_score": 0.6099695,
                "_source": {
                    "content": "火箭98-99不敌凯尔特人,惨遭四连败"
                }
            }
        ]
    }
}
POST {{host}}:{{port}}/news/sports/_search
{
    "query":{
        "match":{
            "content":"火焰"
        }
    }   
}
{
    "took": 1,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": 0,
        "max_score": null,
        "hits": []
    }
}

通过分词测试,可以看到中文分词会将带搜索字段分成更具中文含义的字段,而非每个字都分词。

通过搜索测试,可以看到保留了相关性的搜索结果,而过滤掉了不相关的结果,是的搜索更智能化。

参考文章

​ 以下文章有关分词均做了更多的解释。如果想关注更多细节,可以查阅,本文不做更多介绍。

如何在Elasticsearch中安装中文分词器(IK+pinyin)

ik分词细节

猜你喜欢

转载自blog.csdn.net/paditang/article/details/78930870