【全文搜索引擎】Elasticsearch之分词器

分词器

分词器（Analyzer）：将一段文本，按照一定逻辑，分析成多个词语的一种工具
- 如：华为手机 — > 华为、手、手机
ElasticSearch 内置分词器
- Standard Analyzer - 默认分词器，按词切分，小写处理
- Simple Analyzer - 按照非字母切分(符号被过滤), 小写处理
- Stop Analyzer - 小写处理，停用词过滤(the,a,is)
- Whitespace Analyzer - 按照空格切分，不转小写
- Keyword Analyzer - 不分词，直接将输入当作输出
- Patter Analyzer - 正则表达式，默认\W+(非字符分割)
- Language - 提供了30多种常见语言的分词器
ElasticSearch 内置分词器对中文很不友好，处理方式为：一个字一个词

IK分词器

IKAnalyzer是一个开源的，基于java语言开发的轻量级的中文分词工具包
是一个基于Maven构建的项目
具有60万字/秒的高速处理能力
支持用户词典扩展定义
下载地址： https://github.com/medcl/elasticsearch-analysis-ik/archive/v7.4.0.zip

使用IK分词器

IK分词器有两种分词模式：ik_max_word和ik_smart模式。

ik_max_word

会将文本做最细粒度的拆分，比如会将“乒乓球明年总冠军”拆分为“乒乓球、乒乓、球、明年、总冠军、冠军。

    #方式一ik_max_word
    GET /_analyze
    {
      "analyzer": "ik_max_word",
      "text": "乒乓球明年总冠军"
    }

ik_max_word分词器执行如下：

	{
      "tokens" : [
        {
          "token" : "乒乓球",
          "start_offset" : 0,
          "end_offset" : 3,
          "type" : "CN_WORD",
          "position" : 0
        },
        {
          "token" : "乒乓",
          "start_offset" : 0,
          "end_offset" : 2,
          "type" : "CN_WORD",
          "position" : 1
        },
        {
          "token" : "球",
          "start_offset" : 2,
          "end_offset" : 3,
          "type" : "CN_CHAR",
          "position" : 2
        },
        {
          "token" : "明年",
          "start_offset" : 3,
          "end_offset" : 5,
          "type" : "CN_WORD",
          "position" : 3
        },
        {
          "token" : "总冠军",
          "start_offset" : 5,
          "end_offset" : 8,
          "type" : "CN_WORD",
          "position" : 4
        },
        {
          "token" : "冠军",
          "start_offset" : 6,
          "end_offset" : 8,
          "type" : "CN_WORD",
          "position" : 5
        }
      ]
    }

ik_smart

会做最粗粒度的拆分，比如会将“乒乓球明年总冠军”拆分为乒乓球、明年、总冠军。

    #方式二ik_smart
    GET /_analyze
    {
      "analyzer": "ik_smart",
      "text": "乒乓球明年总冠军"
    }

ik_smart分词器执行如下：

    {
      "tokens" : [
        {
          "token" : "乒乓球",
          "start_offset" : 0,
          "end_offset" : 3,
          "type" : "CN_WORD",
          "position" : 0
        },
        {
          "token" : "明年",
          "start_offset" : 3,
          "end_offset" : 5,
          "type" : "CN_WORD",
          "position" : 1
        },
        {
          "token" : "总冠军",
          "start_offset" : 5,
          "end_offset" : 8,
          "type" : "CN_WORD",
          "position" : 2
        }
      ]
    }

挽远

发布了34 篇原创文章 · 获赞 14 · 访问量 1568

私信关注

【全文搜索引擎】Elasticsearch之分词器

分词器

IK分词器

使用IK分词器

ik_max_word

ik_smart

猜你喜欢