elasticsearch之分词器

1.概念

在elasticsearch中索引分析模块是可以通过注册分词器来进行配置的。分词器的作用就是当一个文档被索引的时候，分词器从文档中提取若干词元（token）来支持索引的存储和搜索。elasticsearch内置了很多分词器，分解器，和词元过滤器.

索引分析模块包括：

分词器（analyzer）

分解器（tokenizer）

词元过滤器（token filters）

2.分词器（包含分解器和词元过滤器）

由一个分解器（tokenizer），零个或多个词元过滤器（token filters）组成

①、分解器作用：

首先预处理（比如去掉html标记），分解器会有多个字符过滤器。

然后分解器是用来把字符串分解成一系列的词元

②、词元过滤器作用：

对分解器提取出来的词元进一步处理，比如转大小写，增加同义词等。处理后的结果为索引词（term），文档中包含几个这样term成为词频。

引擎会建立term与原文档的倒排索引。

例子：

POST http://host:port:9200/_analyze
{
“analyzer”: “standard”,
“text”: [“this is my elasticsearch”]
}

也可以这样写：

POST http://host:port:9200/_analyze
{
“tokenizer”: “standard”,
“filter”: [“lowercase”],
“text”: [“this is my elasticsearch”]
}

3.自定义分词器：

PUT http://host:9200/my_index
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0,
    "analysis": {
       # 字符过滤器
      "char_filter": {
        "&_to_and": {
          "type": "mapping",
          "mappings":["&=>and"]
        }
      },
      # 词元过滤器
      "filter": {
       # 词元过滤器名称
        "my_stopwords": {
          "type": "stop",
          "stopwords":["the", "a"]
        }
      },
      # 分词器
      "analyzer": {
         # 分词器名称
        "my_analyzer": {
           # 自定义类型分词器 
          "type": "custom",
          "char_filter": ["html_strip", "&_to_and"],
          # 使用标准分解器
          "tokenizer": "standard",
          # 使用小写词元过滤器和自定义的停止词元过滤器
          "filter": ["lowercase", "my_stopwords"]
        }
      }
    }
  }
}

测试：

POST http://host:9200/my_index/_analyze/
{
  "analyzer": "my_analyzer", 
  "text": ["The quick & brown fox"]
}

结果：
在这里插入图片描述