ES-word segmentation

ES-word segmentation

Reprint link: https://www.cnblogs.com/qdhxhz/p/11585639.html

1. The concept of tokenizer

1、Analysis 和 Analyzer

Analysis : Text analysis is the process of converting the whole text into a series of words (term/token), also called word segmentation. Analysis is achieved through Analyzer.

When a document is indexed, each Field may create an inverted index (Mapping can be set to not index the Field).

The process of inverted indexing is to divide the document into one term by one through the Analyzer, and each term points to the document collection that contains this term.

When querying a query, Elasticsearch will determine whether to analyze the query according to the search type, and then perform a correlation query with the term in the inverted index to match the corresponding document.

2. Analyzer composition

Analyzers are composed of three building blocks: character filters, tokenizers, token filters .

  1. character filter character filter

Before segmenting a piece of text, preprocess it first, for example, the most common is to filter html tags (hello --> hello), & --> and (I&you --> I and you)

  1. tokenizers

English word segmentation can separate words according to spaces, while Chinese word segmentation is more complicated, and machine learning algorithms can be used to segment words.

  1. Token filters Token filters

Process the divided words. Case conversion (for example, "Quick" to lowercase), remove words (for example, stop words like "a", "and", "the", etc.), or add words (for example, synonyms like "jump" and "leap" ").

The order of the three: Character Filters—>Tokenizer—>Token Filter

The number of the three: analyzer = CharFilters (0 or more) + Tokenizer (exactly one) + TokenFilters (0 or more)

3. Elasticsearch's built-in tokenizer

species Features
Standard Analyzer Default tokenizer, segmented by word, lowercase processing
Simple Analyzer According to non-letter segmentation (symbols are filtered), lowercase processing
Stop Analyzer Lower case processing, stop word filtering (the, a, is)
Whitespace Analyzer Split according to spaces, not lowercase
Keyword Analyzer No word segmentation, direct input as output
Patter Analyzer Regular expression, default \W+ (non-character segmentation)
Language Provides tokenizers for more than 30 common languages
Customer Analyzer Custom tokenizer

4. Set the tokenizer when creating the index

PUT new_index
{
	"settings": {
		"analysis": {
			"analyzer": {
				"std_folded": {
					"type": "custom",
					"tokenizer": "standard",
					"filter": [
						"lowercase",
						"asciifolding"
					]
				}
			}
		}
	},
	"mappings": {
		"properties": {
			"title": {
				"type": "text",
				"analyzer": "std_folded" #指定分词器
			},
			"content": {
				"type": "text",
				"analyzer": "whitespace" #指定分词器
			}
		}
	}
}

2. ES built-in tokenizer

Here are some common tokenizers: Standard Analyzer, Simple Analyzer, whitespace Analyzer .

1. Standard Analyzer (default)
1) Example

standard is the default analyzer. It provides grammar-based tokenization (based on Unicode text segmentation algorithm), suitable for most languages

POST _analyze
{
  "analyzer": "standard",
  "text":     "Like X 国庆放假的"
}

Operation result
Insert picture description here
2) Configuration

The standard analyzer accepts the following parameters:

max_token_length : 最大token长度,默认255
stopwords : 预定义的停止词列表,如_english_或 包含停止词列表的数组,默认是 _none_
stopwords_path : 包含停止词的文件路径
PUT new_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_english_analyzer": {
          "type": "standard",       #设置分词器为standard
          "max_token_length": 5,    #设置分词最大为5
          "stopwords": "_english_"  #设置过滤词
        }
      }
    }
  }
}

2. Simple Analyzer
Simple analyzer parses the text into terms when it encounters characters that are not alphabetic, and all terms are lowercase.

POST _analyze
{
  "analyzer": "simple",
  "text":     "Like X 国庆放假 的"
}

Operation result
Insert picture description here
3. Whitespace Analyzer

POST _analyze
{
  "analyzer": "whitespace",
  "text":     "Like X 国庆放假 的"
}

The results show that
Insert picture description here

3. Chinese word segmentation

Chinese word segmentation is now recommended by everyone is IK word segmentation, of course there are some other such as smartCN, HanLP.

Here only talk about how to use IK as Chinese word segmentation.

1. IK tokenizer installation
github of the open source tokenizer Ik: download link

Note that the version of the IK tokenizer requires you to install the same version of ES, my side is 7.1.0, then find the corresponding version on github, and then start the command

./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.1.0/elasticsearch-analysis-ik-7.1.0.zip

operation result
Insert picture description here

Note that you need to restart Es after installing the plug-in for it to take effect.

2. IK uses
IK to have two kinds of granularity:

  • ik_smart: Will do the coarsest-grained split

  • ik_max_word: The text will be split at the finest granularity

  1. ik_smart split
GET /_analyze
{
  "text":"中华人民共和国国徽",
  "analyzer":"ik_smart"
}

Operation result
Insert picture description here
2) ik_max_word split

GET /_analyze
{
  "text":"中华人民共和国国徽",
  "analyzer":"ik_max_word"
}

operation result
Insert picture description here

Guess you like

Origin blog.csdn.net/qq_43288259/article/details/114934807