1. Tokenizer plugin
ElasticSearch
Provides a plug-in system for word segmentation of text content. For word segmenters in different languages, the rules are generally different, and the provided plug-in mechanism ElasticSearch
can well integrate word segmenters of various languages.
Elasticsearch
It does not support Chinese word segmentation itself, but fortunately it supports the writing and installation of additional word segmentation management plug-ins, and the open source Chinese word segmentation device is ik
very powerful, with 20
more than 10,000 commonly used thesaurus, which can meet the general common word segmentation functions.
1.1 The role of the tokenizer plug-in
The main function of the tokenizer is to split the text into words with the smallest granularity, and then ElasticSearch
use them as entries in the index system. The rules for splitting words in different languages are also different. The most common ones are Chinese word segmentation and English word segmentation.
For the same text, using different tokenizers, the splitting effect is also different. For example: "People's Republic of China" ik_max_word
will be split into: People's Republic of China, People's Republic of China, China, Chinese, People's Republic of China, People, Republic, Republic, State using a standard
word breaker, and will be split into: China, country, people, people, republic, peace, country. The split words can be used ElasticSearch
as index entries to build an index system, so that part of the content can be used for searching
2. Common tokenizers
2.1 Introduction to tokenizer
-
standard
Standard tokenizer. Strong ability to deal with English, will convert vocabulary units into lowercase, and remove stop words and punctuation marks, for non-English segmentation by single word
-
whitespace
Space tokenizer. For English, only spaces are removed without any other processing, non-English is not supported
-
simple
For English, the text information is divided by non-alphabetic characters, and then the lexical units are unified into lowercase, and the characters of the numeric type will be removed
-
stop
stop
The function is beyond thatsimple
,stop
onsimple
the basis of the addition of removing common words in English (such asthe
,a
etc.), you can also set common words according to your own needs, and does not support Chinese -
keyword
Keyword
Treat the entire input as a single lexical unit without any splitting of the text, usually used in zip codes, phone numbers and other fields that require full matching -
pattern
The query text will be automatically treated as a regular expression to generate a set of
terms
keywords, and thenElasticsearch
query -
snowball
Snowball analyzer, added on the basis
standard
ofsnowball filter
,Lucene
officially not recommended -
language
A collection for parsing text in special languages
analyzer
, but not including Chinese -
I
IK
Tokenizer is an open sourcejava
language-based lightweight Chinese word segmentation toolkit. It adopts a unique "forward iterative fine-grained segmentation algorithm" and supports two segmentation modes: fine-grained and maximum word length. Support: Word segmentation processing of English letters, numbers, Chinese vocabulary, etc., compatible with Korean and Japanese characters. At the same time, it supports user-defined thesaurus. It comes with two tokenizers:ik_max_word
: Split the text at the finest granularity, and split as many words as possibleik_smart
: Do the most coarse-grained split, the words that have been separated will not be occupied by other words again
-
pinyin
Match Chinese
Elasticsearch
in
2.2 Example of tokenizer
For the same input, use the results of different tokenizers.
Input: No. 14w6 circuit breaker of Yangtze River Line at Qixia Station
2.2.1 standard
{
"tokens": [
{
"token": "栖",
"start_offset": 0,
"end_offset": 1,
"type": "<IDEOGRAPHIC>",
"position": 0
},
{
"token": "霞",
"start_offset": 1,
"end_offset": 2,
"type": "<IDEOGRAPHIC>",
"position": 1
},
{
"token": "站",
"start_offset": 2,
"end_offset": 3,
"type": "<IDEOGRAPHIC>",
"position": 2
},
{
"token": "长",
"start_offset": 3,
"end_offset": 4,
"type": "<IDEOGRAPHIC>",
"position": 3
},
{
"token": "江",
"start_offset": 4,
"end_offset": 5,
"type": "<IDEOGRAPHIC>",
"position": 4
},
{
"token": "线",
"start_offset": 5,
"end_offset": 6,
"type": "<IDEOGRAPHIC>",
"position": 5
},
{
"token": "14w6",
"start_offset": 6,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 6
},
{
"token": "号",
"start_offset": 10,
"end_offset": 11,
"type": "<IDEOGRAPHIC>",
"position": 7
},
{
"token": "断",
"start_offset": 11,
"end_offset": 12,
"type": "<IDEOGRAPHIC>",
"position": 8
},
{
"token": "路",
"start_offset": 12,
"end_offset": 13,
"type": "<IDEOGRAPHIC>",
"position": 9
},
{
"token": "器",
"start_offset": 13,
"end_offset": 14,
"type": "<IDEOGRAPHIC>",
"position": 10
}
]
}
2.2.2 i
ik_smart
{
"tokens": [
{
"token": "栖霞",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 0
},
{
"token": "站",
"start_offset": 2,
"end_offset": 3,
"type": "CN_CHAR",
"position": 1
},
{
"token": "长江",
"start_offset": 3,
"end_offset": 5,
"type": "CN_WORD",
"position": 2
},
{
"token": "线",
"start_offset": 5,
"end_offset": 6,
"type": "CN_CHAR",
"position": 3
},
{
"token": "14w6",
"start_offset": 6,
"end_offset": 10,
"type": "LETTER",
"position": 4
},
{
"token": "号",
"start_offset": 10,
"end_offset": 11,
"type": "COUNT",
"position": 5
},
{
"token": "断路器",
"start_offset": 11,
"end_offset": 14,
"type": "CN_WORD",
"position": 6
}
]
}
ik_max_word
{
"tokens": [
{
"token": "栖霞",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 0
},
{
"token": "站长",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD",
"position": 1
},
{
"token": "长江",
"start_offset": 3,
"end_offset": 5,
"type": "CN_WORD",
"position": 2
},
{
"token": "线",
"start_offset": 5,
"end_offset": 6,
"type": "CN_CHAR",
"position": 3
},
{
"token": "14w6",
"start_offset": 6,
"end_offset": 10,
"type": "LETTER",
"position": 4
},
{
"token": "14",
"start_offset": 6,
"end_offset": 8,
"type": "ARABIC",
"position": 5
},
{
"token": "w",
"start_offset": 8,
"end_offset": 9,
"type": "ENGLISH",
"position": 6
},
{
"token": "6",
"start_offset": 9,
"end_offset": 10,
"type": "ARABIC",
"position": 7
},
{
"token": "号",
"start_offset": 10,
"end_offset": 11,
"type": "COUNT",
"position": 8
},
{
"token": "断路器",
"start_offset": 11,
"end_offset": 14,
"type": "CN_WORD",
"position": 9
},
{
"token": "断路",
"start_offset": 11,
"end_offset": 13,
"type": "CN_WORD",
"position": 10
},
{
"token": "器",
"start_offset": 13,
"end_offset": 14,
"type": "CN_CHAR",
"position": 11
}
]
}
2.2.3 pinyin
{
"tokens": [
{
"token": "q",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 0
},
{
"token": "qi",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 0
},
{
"token": "栖霞站长江线14w6号断路器",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 0
},
{
"token": "qixiazhanzhangjiangxian14w6haoduanluqi",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 0
},
{
"token": "qxzzjx14w6hdlq",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 0
},
{
"token": "x",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 1
},
{
"token": "xia",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 1
},
{
"token": "z",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 2
},
{
"token": "zhan",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 2
},
{
"token": "zhang",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 3
},
{
"token": "j",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 4
},
{
"token": "jiang",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 4
},
{
"token": "xian",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 5
},
{
"token": "14",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 6
},
{
"token": "w",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 7
},
{
"token": "6",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 8
},
{
"token": "h",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 9
},
{
"token": "hao",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 9
},
{
"token": "d",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 10
},
{
"token": "duan",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 10
},
{
"token": "l",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 11
},
{
"token": "lu",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 11
}
]
}
3. Custom tokenizer
For the effects of the above three word breakers, they may all meet the requirements in some scenarios. Let's see why you need a custom word breaker.
As we all know, in the recommendation system, it is very necessary for pinyin search, such as input: "ls"
, hope to return "ls"
related index entries, "snacks ( ls
)", "Razer ( ls
)", "Jeremy Lin ( lsh
)"..., All of the above are correct, but if only the pinyin tokenizer is used here, "l"
related index entries may also be hit, such as "Li Ning ( l
)", "Lancome ( l
)"..., the recommendation in this case It's just unreasonable.
If you use a pinyin word breaker, for the above input "Qixia Station Yangtze River Line Number Circuit 14w6
Breaker", many single-letter index entries will be generated, such as: "h"
, "d"
, "l"
and so on. If the user enters the query condition "ql"
, he does not want to see 14w6
the data of "Qixia Station Changjiang Line No. Circuit Breaker" at all, but because "l"
the entry is hit, all the data of this item will also be returned.
What about avoiding the index generation of these single-letter entries? Next, write a ElasticSearch
tokenizer by yourself, customize it! ! !
3.1 Principle of tokenizer
3.1.1 Tokenizer plug-in workflow
ElasticSearch
plugins/分词器/plugin-descriptor.properties
Files are read during startup- Read the configuration file to obtain the startup class information of the tokenizer plug-in and initialize it,
classname
the startup class pointed to by the property - The tokenizer plug-in startup class must be inherited
AnalysisPlugin
to ensure thatElasticSearch
our custom class can be called to obtain the tokenizer object - When
ElasticSearch
calling word segmentation for word segmentation,AnalyzerProvider
an object will be instantiated. There are methods in this objectget()
to obtain our customAnalyzer
object, and the internaltokenStream()
method will callcreateComponents()
the method to instantiate our customTokenizer
object Tokenizer
It is the core component of the custom tokenizer, there are 4 core methods, as follows:incrementToken()
: It is used to judge whether there are unread entry information in the word segmentation set list, and set theterm
basic attributes such as: length, start offset, end offset, entry, etc.reset()
: Reset the default data and load a custom model to process the string data entered by the user and perform word segmentation processing, adding it to the word segmentation set listend():
Set the offset information when the participle endsclose()
: Destroy the input stream object and custom data
Tokenizer
Every time the object completes the word segmentation process of the user input text, it will perform the above 4-step method call
3.2 Tokenizer Verification
-
install word breaker
Copy the packaged tokenizer
zip
file to the directoryElasticSearch
of the installation directory , and use the command to view itplugins
elasticsearch-plugin list
-
start up
ElasticSearch
-
Verify tokenizer
{
"tokens": [
{
"token": "栖霞",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 0
},
{
"token": "qixia",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 1
},
{
"token": "qx",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 2
},
{
"token": "栖霞站长江线14w6号断路器",
"start_offset": 1,
"end_offset": 14,
"type": "word",
"position": 3
},
{
"token": "qixiazhanzhangjiangxian14w6haoduanluqi",
"start_offset": 1,
"end_offset": 14,
"type": "word",
"position": 4
},
{
"token": "qxzzjx14w6hdlq",
"start_offset": 1,
"end_offset": 14,
"type": "word",
"position": 5
},
{
"token": "站",
"start_offset": 2,
"end_offset": 3,
"type": "word",
"position": 6
},
{
"token": "长江",
"start_offset": 3,
"end_offset": 5,
"type": "word",
"position": 7
},
{
"token": "zhangjiang",
"start_offset": 3,
"end_offset": 5,
"type": "word",
"position": 8
},
{
"token": "zj",
"start_offset": 3,
"end_offset": 5,
"type": "word",
"position": 9
},
{
"token": "线",
"start_offset": 5,
"end_offset": 6,
"type": "word",
"position": 10
},
{
"token": "14w6",
"start_offset": 6,
"end_offset": 10,
"type": "word",
"position": 11
},
{
"token": "号",
"start_offset": 10,
"end_offset": 11,
"type": "word",
"position": 12
},
{
"token": "断路",
"start_offset": 11,
"end_offset": 13,
"type": "word",
"position": 13
},
{
"token": "duanlu",
"start_offset": 11,
"end_offset": 13,
"type": "word",
"position": 14
},
{
"token": "dl",
"start_offset": 11,
"end_offset": 13,
"type": "word",
"position": 15
},
{
"token": "断路器",
"start_offset": 11,
"end_offset": 14,
"type": "word",
"position": 16
},
{
"token": "duanluqi",
"start_offset": 11,
"end_offset": 14,
"type": "word",
"position": 17
},
{
"token": "dlq",
"start_offset": 11,
"end_offset": 14,
"type": "word",
"position": 18
}
]
}
3.3 Source code compilation
JDK17
,idea
supportJDK17
luncene
The versionElasticSearch
is consistent with the version requirements
ElasticSearch
The packaged tokenizerElasticSearch
is consistent with the used version
Source address: https://gitee.com/frank_zxd/elasticsearch-search-analyzer