Detailed explanation of elasticsearch tokenizer

tokenizer

Introduction

The data of the ES document is split into keywords with complete meanings, and the keywords are corresponding to the documents, so that the documents can be queried by keywords

To correctly segment the word, you need to choose the appropriate word segmenter

default tokenizer

Introduction

Segment English words according to spaces and punctuation marks, and convert the case of words. The
default word segmenter is the English word segmenter, and the word segmentation for Chinese is word by word

basic use

GET /_analyze

{
  "text": "月木天上",
  "analyzer": "standard"
}          


           

IK tokenizer

Introduction

The IK tokenizer provides two segmentation algorithms:
ik_smart: the least segmentation
ik_max_word: the most fine-grained division

IK tokenizer dictionary

The IK tokenizer performs word segmentation according to the dictionary, and the dictionary file is in the config directory of the IK tokenizer:
main.dic: The built-in dictionary in IK. IKAnalyzer.cfg.xml records all Chinese words counted by IK
: used to configure custom thesaurus

basic use

GET /_analyze
{
  "text":"月木天上",
  "analyzer":"ik_smart"
}


 

GET /_analyze
{
  "text":"月木天上",
  "analyzer":"ik_max_word"
}  

Pinyin word breaker

Introduction

The pinyin word breaker can divide Chinese into corresponding Quanpin, Quanpin initials, etc.

basic use

GET /_analyze
{
  "text":"月木天上",
  "analyzer":"pinyin"
}

custom tokenizer

Introduction

In real development, we often need to perform word segmentation and pinyin segmentation for a piece of content. At this time, we need to customize the ik+pinyin word segmentation device

Customize the tokenizer when creating the index

PUT /索引名
{
  "settings" : {
    "analysis" : {
      "analyzer" : {
        "ik_pinyin" : { //自定义分词器名
          "tokenizer":"ik_max_word", // 基本分词器
          "filter":"pinyin_filter" // 配置分词器过滤
        }
      },
      "filter" : { // 分词器过滤时配置另一个分词器,相当于同时使用两个分词器
        "pinyin_filter" : {
          "type" : "pinyin", // 另一个分词器
          // 拼音分词器的配置
          "keep_separate_first_letter" : false, // 是否分词每个字的首字母
          "keep_full_pinyin" : true, // 是否分词全拼
          "keep_original" : true, // 是否保留原始输入
          "remove_duplicated_term" : true // 是否删除重复项
        }
      }
    }
  },
  "mappings":{
    "properties":{
      "域名1":{
        "type":域的类型,
        "store":是否单独存储,
        "index":是否创建索引,
        "analyzer":分词器
      },
      "域名2":{
        ...
      }
    }
  }
}

Guess you like

Origin blog.csdn.net/m0_63040701/article/details/131757228