ES-word segmentation
Reprint link: https://www.cnblogs.com/qdhxhz/p/11585639.html
Article Directory
1. The concept of tokenizer
1、Analysis 和 Analyzer
Analysis : Text analysis is the process of converting the whole text into a series of words (term/token), also called word segmentation. Analysis is achieved through Analyzer.
When a document is indexed, each Field may create an inverted index (Mapping can be set to not index the Field).
The process of inverted indexing is to divide the document into one term by one through the Analyzer, and each term points to the document collection that contains this term.
When querying a query, Elasticsearch will determine whether to analyze the query according to the search type, and then perform a correlation query with the term in the inverted index to match the corresponding document.
2. Analyzer composition
Analyzers are composed of three building blocks: character filters, tokenizers, token filters .
- character filter character filter
Before segmenting a piece of text, preprocess it first, for example, the most common is to filter html tags (hello --> hello), & --> and (I&you --> I and you)
- tokenizers
English word segmentation can separate words according to spaces, while Chinese word segmentation is more complicated, and machine learning algorithms can be used to segment words.
- Token filters Token filters
Process the divided words. Case conversion (for example, "Quick" to lowercase), remove words (for example, stop words like "a", "and", "the", etc.), or add words (for example, synonyms like "jump" and "leap" ").
The order of the three: Character Filters—>Tokenizer—>Token Filter
The number of the three: analyzer = CharFilters (0 or more) + Tokenizer (exactly one) + TokenFilters (0 or more)
3. Elasticsearch's built-in tokenizer
species | Features |
---|---|
Standard Analyzer | Default tokenizer, segmented by word, lowercase processing |
Simple Analyzer | According to non-letter segmentation (symbols are filtered), lowercase processing |
Stop Analyzer | Lower case processing, stop word filtering (the, a, is) |
Whitespace Analyzer | Split according to spaces, not lowercase |
Keyword Analyzer | No word segmentation, direct input as output |
Patter Analyzer | Regular expression, default \W+ (non-character segmentation) |
Language | Provides tokenizers for more than 30 common languages |
Customer Analyzer | Custom tokenizer |
4. Set the tokenizer when creating the index
PUT new_index
{
"settings": {
"analysis": {
"analyzer": {
"std_folded": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "std_folded" #指定分词器
},
"content": {
"type": "text",
"analyzer": "whitespace" #指定分词器
}
}
}
}
2. ES built-in tokenizer
Here are some common tokenizers: Standard Analyzer, Simple Analyzer, whitespace Analyzer .
1. Standard Analyzer (default)
1) Example
standard is the default analyzer. It provides grammar-based tokenization (based on Unicode text segmentation algorithm), suitable for most languages
POST _analyze
{
"analyzer": "standard",
"text": "Like X 国庆放假的"
}
Operation result
2) Configuration
The standard analyzer accepts the following parameters:
max_token_length : 最大token长度,默认255
stopwords : 预定义的停止词列表,如_english_或 包含停止词列表的数组,默认是 _none_
stopwords_path : 包含停止词的文件路径
PUT new_index
{
"settings": {
"analysis": {
"analyzer": {
"my_english_analyzer": {
"type": "standard", #设置分词器为standard
"max_token_length": 5, #设置分词最大为5
"stopwords": "_english_" #设置过滤词
}
}
}
}
}
2. Simple Analyzer
Simple analyzer parses the text into terms when it encounters characters that are not alphabetic, and all terms are lowercase.
POST _analyze
{
"analyzer": "simple",
"text": "Like X 国庆放假 的"
}
Operation result
3. Whitespace Analyzer
POST _analyze
{
"analyzer": "whitespace",
"text": "Like X 国庆放假 的"
}
The results show that
3. Chinese word segmentation
Chinese word segmentation is now recommended by everyone is IK word segmentation, of course there are some other such as smartCN, HanLP.
Here only talk about how to use IK as Chinese word segmentation.
1. IK tokenizer installation
github of the open source tokenizer Ik: download link
Note that the version of the IK tokenizer requires you to install the same version of ES, my side is 7.1.0, then find the corresponding version on github, and then start the command
./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.1.0/elasticsearch-analysis-ik-7.1.0.zip
operation result
Note that you need to restart Es after installing the plug-in for it to take effect.
2. IK uses
IK to have two kinds of granularity:
-
ik_smart: Will do the coarsest-grained split
-
ik_max_word: The text will be split at the finest granularity
- ik_smart split
GET /_analyze
{
"text":"中华人民共和国国徽",
"analyzer":"ik_smart"
}
Operation result
2) ik_max_word split
GET /_analyze
{
"text":"中华人民共和国国徽",
"analyzer":"ik_max_word"
}
operation result