Elasticsearch (10) --- Built-word, a Chinese word breaker

Elasticsearch (10) --- Built-word, a Chinese word breaker

This blog is mainly 分词器概念about: ES内置分词器, ES中文分词器, .

First, the word Concepts

1、Analysis 和 Analyzer

Analysis: Text analysis is to convert a series of full-text word (term / token) process, also known as segmentation. Analysis is done by the Analyzer .

When a document is indexed, each Field are likely to create an inverted index (Mapping can not set index of the Field).

Inverted index of the process is to a document into a through Analyzer Term, each point contains a collection of Term Term of this document.

When a query query, Elasticsearch will decide whether to perform query analyze, and then inverted index term related query, matching the appropriate document in accordance with the type of search.

2, Analyzer composition

Analyzer (Analyzer) by three kinds of building blocks consisting character filtersof: tokenizers, token filters, .

1) character filter 字符过滤器

在一段文本进行分词之前,先进行预处理,比如说最常见的就是,过滤html标签(<span>hello<span> --> hello),& --> and(I&you --> I and you)

2) tokenizers 分词器

English word can space separates the word, Chinese word is more complex, machine learning algorithms can be used for word.

3) Token filters Token过滤器

The segmentation of word processing . Case conversion (for example the "Quick" lowercase), remove the word (e.g., stop words like "a", "and", "the" , etc.), or increase the word (e.g. a synonym as "jump" and "leap ").

三者顺序Character Filters--->Tokenizer--->Token Filter

三者个数: Analyzer CharFilters = (0 or more) + Tokenizer (just a) + TokenFilters (0 or more)

3, Elasticsearch built-in word breaker

  • Standard Analyzer - default word, a word divided by cutting, processing lowercase

  • Simple Analyzer - in a non-alphabetical slicing (symbol is filtered), the processing lowercase

  • Stop Analyzer - lowercase processing, filtering stop words (the, a, is)

  • Whitespace Analyzer - according space division Geqie not turn lowercase

  • Keyword Analyzer - regardless of the word, as a direct input output

  • Patter Analyzer - regular expressions, default \ W + (non-character segmentation)

  • Language - provides a 30 word is more common languages

  • Customer Analyzer custom tokenizer

4, set the tokenizer When creating an index

PUT new_index
{
    "settings": {
        "analysis": {
            "analyzer": {
                "std_folded": {
                    "type": "custom",
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",
                        "asciifolding"
                    ]
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "title": {
                "type": "text",
                "analyzer": "std_folded" #指定分词器
            },
            "content": {
                "type": "text",
                "analyzer": "whitespace" #指定分词器
            }
        }
    }
}


Two, ES built tokenizer

Here on the next several common word breaker: Standard Analyzer, Simple Analyzer, whitespace Analyzer.

1, Standard Analyzer (default)

1) Example

standard is the default analyzer. It provides a tag-based syntax of (Unicode text-based segmentation algorithm) for most languages

POST _analyze
{
  "analyzer": "standard",
  "text":     "Like X 国庆放假的"
}

operation result

2) Configuration

Standard analyzer accepts the following parameters:

  • max_token_length: maximum token length, default 255
  • stopwords: predefined list of stop words like _english_or an array containing a list of stop words, by default_none_
  • stopwords_path: contains the file path of stop words
PUT new_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_english_analyzer": {
          "type": "standard",       #设置分词器为standard
          "max_token_length": 5,    #设置分词最大为5
          "stopwords": "_english_"  #设置过滤词
        }
      }
    }
  }
}

2、Simple Analyzer

simple parser encounters when it just is not alphabetic characters , the text will be parsed into term, and the term are all lowercase.

POST _analyze
{
  "analyzer": "simple",
  "text":     "Like X 国庆放假 的"
}

operation result

3、Whitespace Analyzer

POST _analyze
{
  "analyzer": "whitespace",
  "text":     "Like X 国庆放假 的"
}

return


Third, the Chinese word

The Chinese word now is that we are more recommended IK分词器, of course, some other such smartCN , HanLP .

Where they talk about how to use the IK as the Chinese word.

1, IK word installed

Open source word breaker Ik of GitHub: https://github.com/medcl/elasticsearch-analysis-ik

注意 IK's version of word you want to install the same ES version is 7.1.0 so I am here to find the corresponding version in github, then start command

./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.1.0/elasticsearch-analysis-ik-7.1.0.zip

operation result

注意 After installing the plug-in need to restart Es, to take effect.

2, IK use

There are two kinds of particles IK split degrees:

ik_smart: Split will do the most coarse-grained

ik_max_word: Text will do the most fine-grained Split

1) ik_smart Split

GET /_analyze
{
  "text":"中华人民共和国国徽",
  "analyzer":"ik_smart"
}

operation result

2) ik_max_word Split

GET /_analyze
{
  "text":"中华人民共和国国徽",
  "analyzer":"ik_max_word"
}

operation result


reference

1、Elasticsearch Analyzers

2, elasticsearch tokenizer

3, elasticsearch Pinyin minute installation and use of words and word of IK




 我相信,无论今后的道路多么坎坷,只要抓住今天,迟早会在奋斗中尝到人生的甘甜。抓住人生中的一分一秒,胜过虚度中的一月一年!(15)


Guess you like

Origin www.cnblogs.com/qdhxhz/p/11585639.html