ElasticSearch has built-in tokenizers, such as standard tokenizers, simple tokenizers, and blank tokenizers. But these word breakers are not friendly to our most commonly used Chinese, and cannot perform word segmentation according to our language habits.
The ik word breaker is a standard Chinese word breaker. It can segment the domain according to the defined dictionary, and supports users to configure their own dictionaries, so in addition to segmenting words according to common habits, we can also customize word segmentation.
The ik tokenizer is a plug-in package, and we can connect it to ES by plug-in.
1. Installation
1.1 download
Download address: ik tokenizer address
Be careful to choose the version that is consistent with your own es to download.
1.2 Decompression
Create a new ik folder under the plugins of the downloaded installation package under the es installation directory, and decompress the file.
1.3 start
After the startup is successful, you can see that the ik plugin has been running.
You can also check whether the plugin is installed through the current command.
It is ready to use out of the box, and the installation of the ik tokenizer is now complete.
2. Use the IK tokenizer
The IK tokenizer has two segmentation modes: ik_max_word and ik_smart modes.
1、i_max_word
The text will be split into the finest granularity, for example, "Good morning Zeng Shuqi, chairman" will be split into "Zeng, Shuqi, chairman, director, director, good morning, morning, good morning"
GET /_analyze
{
"analyzer": "ik_max_word", // 最细粒度划分
"text": "曾舒琪董事长早上好"
}
The execution results are as follows:
{
"tokens" : [
{
"token" : "曾",
"start_offset" : 0,
"end_offset" : 1,
"type" : "CN_CHAR",
"position" : 0
},
{
"token" : "舒琪",
"start_offset" : 1,
"end_offset" : 3,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "董事长",
"start_offset" : 3,
"end_offset" : 6,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "董事",
"start_offset" : 3,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 3
},
{
"token" : "长",
"start_offset" : 5,
"end_offset" : 6,
"type" : "CN_CHAR",
"position" : 4
},
{
"token" : "早上好",
"start_offset" : 6,
"end_offset" : 9,
"type" : "CN_WORD",
"position" : 5
},
{
"token" : "早上",
"start_offset" : 6,
"end_offset" : 8,
"type" : "CN_WORD",
"position" : 6
},
{
"token" : "上好",
"start_offset" : 7,
"end_offset" : 9,
"type" : "CN_WORD",
"position" : 7
}
]
}
2、ik_smart
Will do the most coarse-grained split, for example, will split "Chairman Zeng Shuqi good morning" into "Zeng, Shuqi, chairman, good morning"
GET /_analyze
{
"analyzer": "ik_smart", // 最粗粒度划分
"text": "曾舒琪董事长早上好"
}
The execution results are as follows:
{
"tokens" : [
{
"token" : "曾",
"start_offset" : 0,
"end_offset" : 1,
"type" : "CN_CHAR",
"position" : 0
},
{
"token" : "舒琪",
"start_offset" : 1,
"end_offset" : 3,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "董事长",
"start_offset" : 3,
"end_offset" : 6,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "早上好",
"start_offset" : 6,
"end_offset" : 9,
"type" : "CN_WORD",
"position" : 3
}
]
}
These are the two simple usage modes of the ik tokenizer
question
We use these two modes, and want the ik word segmenter to separate and divide nouns, but there is a problem, Zeng Shuqi is obviously a person's name, and the two modes do not separate the vocabulary together
Solution
In fact, the ik word breaker provides us with a series of dictionaries, we only need to add a dictionary of our own.
1. Find the xml configuration file in the config directory.
2. Here we need to add our own dictionary. In fact, the so-called dictionary is to create a file whose name suffix ends with dict.
3. Here I added a dictionary of shipley_zeng.dict
4. Where did this dictionary come from? Appear out of thin air? We return to the previous directory. You can see that there are many dictionaries, let's just open one and have a look.
Looking at this main.dict,
you can see that there are a lot of vocabulary here. These vocabulary are definitely not enough in the actual application development process. We need to create a dictionary of our own.
5. Create your own dictionary in the config directory. The name is the same as the one mentioned above called shipley_zeng.dict. The
content of dict is as follows. Here we should pay attention to the encoding format as UTF-8.
6. After adding this dictionary, we restart es. You can see that the dictionary we created has been successfully loaded.
7. We are using ik_max_word to query the effect at the finest granularity
GET /_analyze
{
"analyzer": "ik_max_word", // 最细粒度划分
"text": "曾舒琪董事长早上好"
}
The execution results are as follows:
{
"tokens" : [
{
"token" : "曾舒琪",
"start_offset" : 0,
"end_offset" : 3,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "舒琪",
"start_offset" : 1,
"end_offset" : 3,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "董事长",
"start_offset" : 3,
"end_offset" : 6,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "董事",
"start_offset" : 3,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 3
},
{
"token" : "长",
"start_offset" : 5,
"end_offset" : 6,
"type" : "CN_CHAR",
"position" : 4
},
{
"token" : "早上好",
"start_offset" : 6,
"end_offset" : 9,
"type" : "CN_WORD",
"position" : 5
},
{
"token" : "早上",
"start_offset" : 6,
"end_offset" : 8,
"type" : "CN_WORD",
"position" : 6
},
{
"token" : "上好",
"start_offset" : 7,
"end_offset" : 9,
"type" : "CN_WORD",
"position" : 7
}
]
}
8. Use ik_smart's coarsest-grained query to see the effect
GET /_analyze
{
"analyzer": "ik_smart", // 最粗粒度划分
"text": "曾舒琪董事长早上好"
}
The execution results are as follows:
{
"tokens" : [
{
"token" : "曾舒琪",
"start_offset" : 0,
"end_offset" : 3,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "董事长",
"start_offset" : 3,
"end_offset" : 6,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "早上好",
"start_offset" : 6,
"end_offset" : 9,
"type" : "CN_WORD",
"position" : 2
}
]
}
9. We can see that no matter whether he uses ik_max_word or ik_smart, he can disassemble and combine the word Zeng Shuqi to meet our needs.
Summarize
The above is the local elasticsearch Chinese word breaker ik word breaker and its use. I hope it will be helpful to those who have just come into contact with es. Thank you. If you have any questions, please feel free to contact me.