Elasticsearch (10) --- Built-word, a Chinese word breaker
This blog is mainly 分词器概念
about: ES内置分词器
, ES中文分词器
, .
First, the word Concepts
1、Analysis 和 Analyzer
Analysis
: Text analysis is to convert a series of full-text word (term / token) process, also known as segmentation. Analysis is done by the Analyzer .
When a document is indexed, each Field are likely to create an inverted index (Mapping can not set index of the Field).
Inverted index of the process is to a document into a through Analyzer Term, each point contains a collection of Term Term of this document.
When a query query, Elasticsearch will decide whether to perform query analyze, and then inverted index term related query, matching the appropriate document in accordance with the type of search.
2, Analyzer composition
Analyzer (Analyzer) by three kinds of building blocks consisting character filters
of: tokenizers
, token filters
, .
1) character filter 字符过滤器
在一段文本进行分词之前,先进行预处理,比如说最常见的就是,过滤html标签(<span>hello<span> --> hello),& --> and(I&you --> I and you)
2) tokenizers 分词器
English word can space separates the word, Chinese word is more complex, machine learning algorithms can be used for word.
3) Token filters Token过滤器
The segmentation of word processing . Case conversion (for example the "Quick" lowercase), remove the word (e.g., stop words like "a", "and", "the" , etc.), or increase the word (e.g. a synonym as "jump" and "leap ").
三者顺序
:Character Filters--->Tokenizer--->Token Filter
三者个数
: Analyzer CharFilters = (0 or more) + Tokenizer (just a) + TokenFilters (0 or more)
3, Elasticsearch built-in word breaker
Standard Analyzer - default word, a word divided by cutting, processing lowercase
Simple Analyzer - in a non-alphabetical slicing (symbol is filtered), the processing lowercase
Stop Analyzer - lowercase processing, filtering stop words (the, a, is)
Whitespace Analyzer - according space division Geqie not turn lowercase
Keyword Analyzer - regardless of the word, as a direct input output
Patter Analyzer - regular expressions, default \ W + (non-character segmentation)
Language - provides a 30 word is more common languages
Customer Analyzer custom tokenizer
4, set the tokenizer When creating an index
PUT new_index
{
"settings": {
"analysis": {
"analyzer": {
"std_folded": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "std_folded" #指定分词器
},
"content": {
"type": "text",
"analyzer": "whitespace" #指定分词器
}
}
}
}
Two, ES built tokenizer
Here on the next several common word breaker: Standard Analyzer
, Simple Analyzer
, whitespace Analyzer
.
1, Standard Analyzer (default)
1) Example
standard is the default analyzer. It provides a tag-based syntax of (Unicode text-based segmentation algorithm) for most languages
POST _analyze
{
"analyzer": "standard",
"text": "Like X 国庆放假的"
}
operation result
2) Configuration
Standard analyzer accepts the following parameters:
- max_token_length: maximum token length, default 255
- stopwords: predefined list of stop words like
_english_
or an array containing a list of stop words, by default_none_
- stopwords_path: contains the file path of stop words
PUT new_index
{
"settings": {
"analysis": {
"analyzer": {
"my_english_analyzer": {
"type": "standard", #设置分词器为standard
"max_token_length": 5, #设置分词最大为5
"stopwords": "_english_" #设置过滤词
}
}
}
}
}
2、Simple Analyzer
simple parser encounters when it just is not alphabetic characters , the text will be parsed into term, and the term are all lowercase.
POST _analyze
{
"analyzer": "simple",
"text": "Like X 国庆放假 的"
}
operation result
3、Whitespace Analyzer
POST _analyze
{
"analyzer": "whitespace",
"text": "Like X 国庆放假 的"
}
return
Third, the Chinese word
The Chinese word now is that we are more recommended IK分词器
, of course, some other such smartCN , HanLP .
Where they talk about how to use the IK as the Chinese word.
1, IK word installed
Open source word breaker Ik of GitHub: https://github.com/medcl/elasticsearch-analysis-ik
注意
IK's version of word you want to install the same ES version is 7.1.0 so I am here to find the corresponding version in github, then start command
./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.1.0/elasticsearch-analysis-ik-7.1.0.zip
operation result
注意
After installing the plug-in need to restart Es, to take effect.
2, IK use
There are two kinds of particles IK split degrees:
ik_smart
: Split will do the most coarse-grained
ik_max_word
: Text will do the most fine-grained Split
1) ik_smart Split
GET /_analyze
{
"text":"中华人民共和国国徽",
"analyzer":"ik_smart"
}
operation result
2) ik_max_word Split
GET /_analyze
{
"text":"中华人民共和国国徽",
"analyzer":"ik_max_word"
}
operation result
reference
3, elasticsearch Pinyin minute installation and use of words and word of IK
我相信,无论今后的道路多么坎坷,只要抓住今天,迟早会在奋斗中尝到人生的甘甜。抓住人生中的一分一秒,胜过虚度中的一月一年!(15)