ElasticSearch从入门到精通(持续更新....)—分词器

这是我参与11月更文挑战的第12天，活动详情查看：2021最后一次更文挑战。

(持续更新....)

什么是分词器

分词器 接受一个字符串作为输入，将这个字符串拆分成独立的词或 语汇单元（token） （可能会丢弃一些标点符号等字符），然后输出一个 语汇单元流（token stream） 。

全文搜索引擎会用某种算法对要建索引的文档进行分析，从文档中提取出若干Token(词元)，这些算法称为Tokenizer(分词器)，这些Token会被进一步处理，比如转成小写等，这些处理算法被称为Token Filter(词元处理器), 被处理后的结果被称为Term(词)，文档中包含了几个这样的Term被称为Frequency(词频)。引擎会建立Term和原文档的Inverted Index(倒排索引)，这样就能根据Term很快到找到源文档了。文本被Tokenizer处理前可能要做一些预处理，比如去掉里面的HTML标记，这些处理的算法被称为Character Filter(字符过滤器)，这整个的分析算法被称为Analyzer(分析器)

Analysis

Analysis：文本分析是把全文本转换一系列单词(term/token)的过程，也叫分词。Analysis是通过Analyzer来实现的。

当一个文档被索引时，每个Field都可能会创建一个倒排索引（Mapping可以设置不索引该Field）。

倒排索引的过程就是将文档通过Analyzer分成一个一个的Term,每一个Term都指向包含这个Term的文档集合。

当查询query时，Elasticsearch会根据搜索类型决定是否对query进行analyze，然后和倒排索引中的term进行相关性查询，匹配相应的文档。

Analyzer

分析器（analyzer）都由三种构件块组成的：characterfilters ， tokenizers ， token filters。

character filter 字符过滤器；在一段文本进行分词之前，先进行预处理，比如过滤html标签。

tokenizers 分词器；英文分词可以根据空格将单词分开,中文分词比较复杂,可以采用机器学习算法来分词。

Token filters Token过滤器；将切分的单词进行加工。大小写转换（例将“Quick”转为小写），去掉词（例如停用词像“a”、“and”、“the”等等），或者增加词（例如同义词像“jump”和“leap”）。

三者顺序：Character Filters--->Tokenizer--->Token Filter

三者个数：analyzer = CharFilters（0个或多个） + Tokenizer(恰好一个) + TokenFilters(0个或多个)

二、ES内置分词器

ES内置了一堆分词器，如下：

Standard Analyzer - 默认分词器，按词切分，小写处理
Simple Analyzer - 按照非字母切分(符号被过滤), 小写处理
Stop Analyzer - 小写处理，停用词过滤(the,a,is)
Whitespace Analyzer - 按照空格切分，不转小写
Keyword Analyzer - 不分词，直接将输入当作输出
Patter Analyzer - 正则表达式，默认\W+(非字符分割)
Language - 提供了30多种常见语言的分词器
Customer Analyzer 自定义分词器

这里主要说下标准Standard分词器，Simple，whitespace

`Standard Analyzer`

standard 是默认的分析器。 #标准分词器

POST _analyze
{
  "analyzer": "standard",
  "text":     "Like X 国庆放假的"
}
复制代码

结果：

{
  "tokens" : [
    {
      "token" : "like",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "x",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "国",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "<IDEOGRAPHIC>",
      "position" : 2
    },
    {
      "token" : "庆",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "<IDEOGRAPHIC>",
      "position" : 3
    },
    {
      "token" : "放",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "<IDEOGRAPHIC>",
      "position" : 4
    },
    {
      "token" : "假",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "<IDEOGRAPHIC>",
      "position" : 5
    },
    {
      "token" : "的",
      "start_offset" : 11,
      "end_offset" : 12,
      "type" : "<IDEOGRAPHIC>",
      "position" : 6
    }
  ]
}
复制代码

`Simple Analyzer`

simple 分析器当它遇到只要不是字母的字符，就将文本解析成term，而且所有的term都是小写的。

#Simple

POST _analyze
{
  "analyzer": "simple",
  "text":     "Like X 国庆放假 的"
}
复制代码

结果：

{
  "tokens" : [
    {
      "token" : "like",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "x",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "国庆放假",
      "start_offset" : 7,
      "end_offset" : 11,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "的",
      "start_offset" : 12,
      "end_offset" : 13,
      "type" : "word",
      "position" : 3
    }
  ]
}
复制代码

`whitespace Analyzer`

按照空格分词，英文不区分大小写，中文不分词。

#whitespace

POST _analyze
{
  "analyzer": "whitespace",
  "text":     "Like X 国庆放假 的"
}
复制代码

结果：

{
  "tokens" : [
    {
      "token" : "Like",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "X",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "国",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "庆放假",
      "start_offset" : 9,
      "end_offset" : 12,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "的",
      "start_offset" : 13,
      "end_offset" : 14,
      "type" : "word",
      "position" : 4
    }
  ]
}
复制代码

三、IK中文分词器

中文的分词器现在大家比较推荐的就是 IK分词器，当然也有些其它的比如 smartCN、HanLP。

IK分词器安装

开源分词器 Ik 的github：github.com/medcl/elast…

注意 IK分词器的版本要你安装ES的版本一致，我这边是7.9.0那么就在github找到对应版本，然后重启ES。

IK有两种颗粒度的拆分：

1、ik_smart: 会做最粗粒度的拆分

POST /_analyze
{
  "text":"Howill的中华人民共和国国徽",
  "analyzer":"ik_smart"
}
复制代码

结果：

{
  "tokens" : [
    {
      "token" : "howill",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "ENGLISH",
      "position" : 0
    },
    {
      "token" : "的",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "中华人民共和国",
      "start_offset" : 7,
      "end_offset" : 14,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "国徽",
      "start_offset" : 14,
      "end_offset" : 16,
      "type" : "CN_WORD",
      "position" : 3
    }
  ]
}
复制代码

2、ik_max_word: 会将文本做最细粒度的拆分

POST _analyze
{
  "analyzer": "ik_max_word",
  "text":     "Howill的中华人民共和国国徽"
}
复制代码

结果：

{
  "tokens" : [
    {
      "token" : "howill",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "ENGLISH",
      "position" : 0
    },
    {
      "token" : "的",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "中华人民共和国",
      "start_offset" : 7,
      "end_offset" : 14,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "中华人民",
      "start_offset" : 7,
      "end_offset" : 11,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "中华",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "华人",
      "start_offset" : 8,
      "end_offset" : 10,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
      "token" : "人民共和国",
      "start_offset" : 9,
      "end_offset" : 14,
      "type" : "CN_WORD",
      "position" : 6
    },
    {
      "token" : "人民",
      "start_offset" : 9,
      "end_offset" : 11,
      "type" : "CN_WORD",
      "position" : 7
    },
    {
      "token" : "共和国",
      "start_offset" : 11,
      "end_offset" : 14,
      "type" : "CN_WORD",
      "position" : 8
    },
    {
      "token" : "共和",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "CN_WORD",
      "position" : 9
    },
    {
      "token" : "国",
      "start_offset" : 13,
      "end_offset" : 14,
      "type" : "CN_CHAR",
      "position" : 10
    },
    {
      "token" : "国徽",
      "start_offset" : 14,
      "end_offset" : 16,
      "type" : "CN_WORD",
      "position" : 11
    }
  ]
}

复制代码

不管是拼音分词器还是IK分词器，当深入搜索一条数据是时，必须是通过分词器分析的数据，才能被搜索到，否则搜索不到。

四、配置分词器

创建指定分词器的索引，索引创建之后就可以使用ik进行分词了，当你使用ES搜索的时候也会使用ik对搜索语句进行分词，进行匹配。

创建索引时设置分词器

PUT book_wills
{
  "settings":{
    "number_of_shards": "6",
    "number_of_replicas": "1",  
     //指定分词器  
    "analysis":{   
      "analyzer":{
        "ik":{
          "tokenizer":"ik_max_word"
        }
      }
    }
  },
  "mappings":{
    "novel":{
      "properties":{
        "author":{
          "type":"text"
        },
        "wordCount":{
          "type":"integer"
        },
        "publishDate":{
          "type":"date",
          "format":"yyyy-MM-dd HH:mm:ss || yyyy-MM-dd"
        },
        "briefIntroduction":{
          "type":"text"
        },
        "bookName":{
          "type":"text"
        }
      }
    }
  }
}
复制代码

使用

GET book_will/_analyze
{
  "text":"刘德华在香港唱歌",
  "analyzer": "ik_max_word"
}

{
  "tokens" : [
    {
      "token" : "刘德华",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "在",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "香港",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "唱歌",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "CN_WORD",
      "position" : 3
    }
  ]
}

复制代码

IK分词和拼音分词的组合使用

当我们创建索引时可以自定义分词器，通过指定映射去匹配自定义分词器。

PUT /my_index
{
  "settings": {
        "analysis": {
            "analyzer": {
                "ik_smart_pinyin": {
                    "type": "custom",
                    "tokenizer": "ik_smart",
                    "filter": ["my_pinyin", "word_delimiter"]
                },
                "ik_max_word_pinyin": {
                    "type": "custom",
                    "tokenizer": "ik_max_word",
                    "filter": ["my_pinyin", "word_delimiter"]
                }
            },
            "filter": {
                "my_pinyin": {
                    "type" : "pinyin",
                    "keep_separate_first_letter" : true,
                    "keep_full_pinyin" : true,
                    "keep_original" : true,
                    "limit_first_letter_length" : 16,
                    "lowercase" : true,
                    "remove_duplicated_term" : true 
                }
            }
        }
  }
}
复制代码

映射analyzer属性

当我们建type时，需要在字段的analyzer属性填写自己的映射

PUT /my_index/my_type/_mapping
{
    "my_type":{
      "properties": {
        "id":{
          "type": "integer"
        },
        "name":{
          "type": "text",
          "analyzer": "ik_smart_pinyin"
        }
      }
    }
}
复制代码