Installation and use of the local elasticsearch Chinese word breaker ik word breaker

ElasticSearch has built-in tokenizers, such as standard tokenizers, simple tokenizers, and blank tokenizers. But these word breakers are not friendly to our most commonly used Chinese, and cannot perform word segmentation according to our language habits.

The ik word breaker is a standard Chinese word breaker. It can segment the domain according to the defined dictionary, and supports users to configure their own dictionaries, so in addition to segmenting words according to common habits, we can also customize word segmentation.

The ik tokenizer is a plug-in package, and we can connect it to ES by plug-in.

1. Installation

1.1 download

Download address: ik tokenizer address
Be careful to choose the version that is consistent with your own es to download.
insert image description here

1.2 Decompression

Create a new ik folder under the plugins of the downloaded installation package under the es installation directory, and decompress the file.
insert image description here
insert image description here

1.3 start

After the startup is successful, you can see that the ik plugin has been running.
insert image description here
You can also check whether the plugin is installed through the current command.
insert image description here
It is ready to use out of the box, and the installation of the ik tokenizer is now complete.

2. Use the IK tokenizer

The IK tokenizer has two segmentation modes: ik_max_word and ik_smart modes.
insert image description here

1、i_max_word

The text will be split into the finest granularity, for example, "Good morning Zeng Shuqi, chairman" will be split into "Zeng, Shuqi, chairman, director, director, good morning, morning, good morning"

GET /_analyze 
{
    
    
  "analyzer": "ik_max_word", // 最细粒度划分
  "text": "曾舒琪董事长早上好"
}

The execution results are as follows:

{
    
    
  "tokens" : [
    {
    
    
      "token" : "曾",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
    
    
      "token" : "舒琪",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
    
    
      "token" : "董事长",
      "start_offset" : 3,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
    
    
      "token" : "董事",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
    
    
      "token" : "长",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "CN_CHAR",
      "position" : 4
    },
    {
    
    
      "token" : "早上好",
      "start_offset" : 6,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
    
    
      "token" : "早上",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "CN_WORD",
      "position" : 6
    },
    {
    
    
      "token" : "上好",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 7
    }
  ]
}

2、ik_smart

Will do the most coarse-grained split, for example, will split "Chairman Zeng Shuqi good morning" into "Zeng, Shuqi, chairman, good morning"

GET /_analyze
{
    
    
  "analyzer": "ik_smart",  // 最粗粒度划分
  "text": "曾舒琪董事长早上好"
}

The execution results are as follows:

{
    
    
  "tokens" : [
    {
    
    
      "token" : "曾",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
    
    
      "token" : "舒琪",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
    
    
      "token" : "董事长",
      "start_offset" : 3,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
    
    
      "token" : "早上好",
      "start_offset" : 6,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 3
    }
  ]
}

These are the two simple usage modes of the ik tokenizer

question

We use these two modes, and want the ik word segmenter to separate and divide nouns, but there is a problem, Zeng Shuqi is obviously a person's name, and the two modes do not separate the vocabulary together

Solution

In fact, the ik word breaker provides us with a series of dictionaries, we only need to add a dictionary of our own.

1. Find the xml configuration file in the config directory.
insert image description here
2. Here we need to add our own dictionary. In fact, the so-called dictionary is to create a file whose name suffix ends with dict.
insert image description here
3. Here I added a dictionary of shipley_zeng.dict
insert image description here
4. Where did this dictionary come from? Appear out of thin air? We return to the previous directory. You can see that there are many dictionaries, let's just open one and have a look.
insert image description here
Looking at this main.dict,
insert image description here
you can see that there are a lot of vocabulary here. These vocabulary are definitely not enough in the actual application development process. We need to create a dictionary of our own.

5. Create your own dictionary in the config directory. The name is the same as the one mentioned above called shipley_zeng.dict. The
insert image description here
content of dict is as follows. Here we should pay attention to the encoding format as UTF-8.
insert image description here
6. After adding this dictionary, we restart es. You can see that the dictionary we created has been successfully loaded.
insert image description here
7. We are using ik_max_word to query the effect at the finest granularity

GET /_analyze 
{
    
    
  "analyzer": "ik_max_word", // 最细粒度划分
  "text": "曾舒琪董事长早上好"
}

The execution results are as follows:

{
    
    
  "tokens" : [
    {
    
    
      "token" : "曾舒琪",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
    
    
      "token" : "舒琪",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
    
    
      "token" : "董事长",
      "start_offset" : 3,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
    
    
      "token" : "董事",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
    
    
      "token" : "长",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "CN_CHAR",
      "position" : 4
    },
    {
    
    
      "token" : "早上好",
      "start_offset" : 6,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
    
    
      "token" : "早上",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "CN_WORD",
      "position" : 6
    },
    {
    
    
      "token" : "上好",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 7
    }
  ]
}

8. Use ik_smart's coarsest-grained query to see the effect

GET /_analyze
{
    
    
  "analyzer": "ik_smart",  // 最粗粒度划分
  "text": "曾舒琪董事长早上好"
}

The execution results are as follows:

{
    
    
  "tokens" : [
    {
    
    
      "token" : "曾舒琪",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
    
    
      "token" : "董事长",
      "start_offset" : 3,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
    
    
      "token" : "早上好",
      "start_offset" : 6,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 2
    }
  ]
}

9. We can see that no matter whether he uses ik_max_word or ik_smart, he can disassemble and combine the word Zeng Shuqi to meet our needs.
insert image description here

Summarize

The above is the local elasticsearch Chinese word breaker ik word breaker and its use. I hope it will be helpful to those who have just come into contact with es. Thank you. If you have any questions, please feel free to contact me.

Guess you like

Origin blog.csdn.net/aq_money/article/details/130440968