Continue from section 20
4. Word segmentation
A tokenizer
(tokenizer) receiving one 字符流
, it will be divided into separate tokens
(LUs usually independent words), then the output tokens
stream.
For example, whitespace tokenizer
splits text when it encounters a blank character. It will text " Quick Brown Fox! " Is divided into [ Quick
, brown
, fox!
L.
The tokenizer
(tokenizer) is also responsible for recording various term
(entry) of the order or position
position (for phrase
phrases and word proximity
words neighbor queries), and term
the original (entry) represented by the word
(word) of start
(start) and end
(end) of character offsets
( Character offset) (used to highlight the search content).
Elasticsearch
A lot of built-in tokenizers are provided, which can be used to build custom analyzers (custom tokenizers).
Test ES default standard tokenizer
英文
::
POST _analyze
{
"analyzer": "standard",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
中文
:
POST _analyze
{
"analyzer": "standard",
"text": "pafcmall电商项目"
}
1), install the ik tokenizer
注意
: Cannot use the default elasticsearch-plugin install xxx.zip for automatic installation
Go to https://github.com/medcl/elasticsearch-analysis-ik/releases to
find the corresponding es version installation
1、进入 es 容器内部 plugins 目录
docker exec -it 容器id /bin/bash
wget https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.4.2/elasticsearch-analysis-ik-7.4.2.zip
2. Installation wget
:
yum install wget
3, download and ES matching versions of ik
Word Breaker:
wget https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.4.2/elasticsearch-analysis-ik-7.4.2.zip
4, unzip
download the file and extract
1), using the unzip
decompression elasticsearch-analysis-ik-7.4.2.zip
discovery unzip
command has not been installed, install unzip
2), extract the files to a plugins
directory ik
directory
3) delete the archive, and to ik
authorize directory and its files
rm -rf *.zip
chmod -R 777 ik/
5. You can confirm whether the tokenizer is installed
cd../bin
elasticsearch plugin list:即可列出系统的分词器
1), enter the es container in docker
2), list the tokenizer of the system
6. Restart ES to make the ik tokenizer take effect
docker restart elasticsearch
2), test the tokenizer
使用默认分词
:
POST _analyze
{
"analyzer": "standard",
"text": "pafcmall电商项目"
}
result:
ik智能分词
:
POST _analyze
{
"analyzer": "ik_smart",
"text": "pafcmall电商项目"
}
result:
POST _analyze
{
"analyzer": "ik_smart",
"text": "我是中国人"
}
result:
ik_max_word分词
:
POST _analyze
{
"analyzer": "ik_max_word",
"text": "我是中国人"
}
Result: It can
be seen that different tokenizers have obvious differences in word segmentation. Therefore, in the future, you can no longer use the default mapping to define an index. The mapping must be created manually because the tokenizer must be selected.
reference:
Getting started with the full-text search engine Elasticsearch