IK Chinese word segmentation_IK word segmentation configuration file explanation and custom thesaurus

1. ik configuration file

ik configuration file address: es/plugins/ik/config directory

IKAnalyzer.cfg.xml: used to configure the custom thesaurus
main.dic: ik's native built-in Chinese thesaurus, there are more than 270,000 words in total, as long as these words are grouped together
quantifier.dic: put some units Related words
suffix.dic: put some suffixes
surname.dic: Chinese surname stopword.dic
: English stop word

The two most important configuration files of ik native

main.dic: Contains native Chinese words, which will be segmented according to the words inside stopword.dic:
Contains English stop words

stop word

a the and at but

Generally, like stop words, they will be killed directly during word segmentation and will not be built in the inverted index.

2. Custom thesaurus

(1) Build your own thesaurus: some special buzzwords emerge every year, Internet celebrities, blue and thin mushrooms, shouting wheat, ghost animals, which are generally not in ik's native dictionary

Add your own latest words and go to ik's thesaurus

IKAnalyzer.cfg.xml:ext_dict,custom/mydict.dic

Add your own words, and then you need to restart es to take effect

(2) Build a stop thesaurus by yourself: for example, yes, what, what, we may not want to build an index and let people search

custom/ext_stopword.dic, there are already commonly used Chinese stop words, you can add your own stop words, and then restart es

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325347253&siteId=291194637