Elasticsearch 2.2.0 Word Segmentation: Chinese Word Segmentation

    In Elasticsearch, there are many built-in analyzers, but none of the default analyzers support Chinese very well. Therefore, you need to install a separate plug-in to support it. The more commonly used smartcn and IKAnanlyzer of ICTCLAS of the Chinese Academy of Sciences are still good, but currently IKAnanlyzer does not support the latest Elasticsearch 2.2.0 version, but the smartcn Chinese tokenizer is officially supported by default. It provides a Chinese Or a parser for mixed Chinese and English text. The latest version 2.2.0 is supported. But smartcn does not support custom thesaurus, you can use it first as a test. The following sections describe how to support the latest version.

smartcn

Install word segmentation: plugin install analysis - smartcn

Uninstall: plugin remove analysis - smartcn

test:

Request: POST http://127.0.0.1:9200/_analyze/

{
  "analyzer": "smartcn",
  "text": "联想是全球最大的笔记本厂商"
}

Return result:

{
    "tokens": [
        {
            "token": "联想", 
            "start_offset": 0, 
            "end_offset": 2, 
            "type": "word", 
            "position": 0
        }, 
        {
            "token": "是", 
            "start_offset": 2, 
            "end_offset": 3, 
            "type": "word", 
            "position": 1
        }, 
        {
            "token": "全球", 
            "start_offset": 3, 
            "end_offset": 5, 
            "type": "word", 
            "position": 2
        }, 
        {
            "token": "最", 
            "start_offset": 5, 
            "end_offset": 6, 
            "type": "word", 
            "position": 3
        }, 
        {
            "token": "大", 
            "start_offset": 6, 
            "end_offset": 7, 
            "type": "word", 
            "position": 4
        }, 
        {
            "token": "的", 
            "start_offset": 7, 
            "end_offset": 8, 
            "type": "word", 
            "position": 5
        }, 
        {
            "token": "笔记本", 
            "start_offset": 8, 
            "end_offset": 11, 
            "type": "word", 
            "position": 6
        }, 
        {
            "token": "厂商", 
            "start_offset": 11, 
            "end_offset": 13, 
            "type": "word", 
            "position": 7
        }
    ]
}

As a comparison, let's take a look at the results of the standard word segmentation. In the request, smartcn is replaced by standard.

Then look at the return result:

{
    "tokens": [
        {
            "token": "联", 
            "start_offset": 0, 
            "end_offset": 1, 
            "type": "<IDEOGRAPHIC>", 
            "position": 0
        }, 
        {
            "token": "想", 
            "start_offset": 1, 
            "end_offset": 2, 
            "type": "<IDEOGRAPHIC>", 
            "position": 1
        }, 
        {
            "token": "是", 
            "start_offset": 2, 
            "end_offset": 3, 
            "type": "<IDEOGRAPHIC>", 
            "position": 2
        }, 
        {
            "token": "全", 
            "start_offset": 3, 
            "end_offset": 4, 
            "type": "<IDEOGRAPHIC>", 
            "position": 3
        }, 
        {
            "token": "球", 
            "start_offset": 4, 
            "end_offset": 5, 
            "type": "<IDEOGRAPHIC>", 
            "position": 4
        }, 
        {
            "token": "最", 
            "start_offset": 5, 
            "end_offset": 6, 
            "type": "<IDEOGRAPHIC>", 
            "position": 5
        }, 
        {
            "token": "大", 
            "start_offset": 6, 
            "end_offset": 7, 
            "type": "<IDEOGRAPHIC>", 
            "position": 6
        }, 
        {
            "token": "的", 
            "start_offset": 7, 
            "end_offset": 8, 
            "type": "<IDEOGRAPHIC>", 
            "position": 7
        }, 
        {
            "token": "笔", 
            "start_offset": 8, 
            "end_offset": 9, 
            "type": "<IDEOGRAPHIC>", 
            "position": 8
        }, 
        {
            "token": "记", 
            "start_offset": 9, 
            "end_offset": 10, 
            "type": "<IDEOGRAPHIC>", 
            "position": 9
        }, 
        {
            "token": "本", 
            "start_offset": 10, 
            "end_offset": 11, 
            "type": "<IDEOGRAPHIC>", 
            "position": 10
        }, 
        {
            "token": "厂", 
            "start_offset": 11, 
            "end_offset": 12, 
            "type": "<IDEOGRAPHIC>", 
            "position": 11
        }, 
        {
            "token": "商", 
            "start_offset": 12, 
            "end_offset": 13, 
            "type": "<IDEOGRAPHIC>", 
            "position": 12
        }
    ]
}

    It can be seen from this that it is basically impossible to use, that is, a Chinese character has become a word.

This article was originally created by secisland, please indicate the author and source when reprinting.

IKAnanlyzer supports version 2.2.0

    At present, the latest version on github only supports Elasticsearch2.1.1, and the path is https://github.com/medcl/elasticsearch-analysis-ik. But now the latest Elasticsearch has reached 2.2.0, so it needs to be processed before it can be supported.

1. Download the source code, unzip it to any directory after downloading, and then modify the pom.xml file in the elasticsearch-analysis-ik-master directory. Find the <elasticsearch.version> line, and then change the version number behind it to 2.2.0.

2. Compile the code mvn package.

3. After the compilation is completed, the elasticsearch-analysis-ik-1.7.0.zip file will be generated in target\releases.

4. Unzip the file to the Elasticsearch/plugins directory.

5. Modify the configuration file to add a line: index.analysis.analyzer.ik.type : "ik"

6. Restart Elasticsearch.

Test: same as the above request, just replace the participle with ik

Returned result:

{
    "tokens": [
        {
            "token": "联想", 
            "start_offset": 0, 
            "end_offset": 2, 
            "type": "CN_WORD", 
            "position": 0
        }, 
        {
            "token": "全球", 
            "start_offset": 3, 
            "end_offset": 5, 
            "type": "CN_WORD", 
            "position": 1
        }, 
        {
            "token": "最大", 
            "start_offset": 5, 
            "end_offset": 7, 
            "type": "CN_WORD", 
            "position": 2
        }, 
        {
            "token": "笔记本", 
            "start_offset": 8, 
            "end_offset": 11, 
            "type": "CN_WORD", 
            "position": 3
        }, 
        {
            "token": "笔记", 
            "start_offset": 8, 
            "end_offset": 10, 
            "type": "CN_WORD", 
            "position": 4
        }, 
        {
            "token": "笔", 
            "start_offset": 8, 
            "end_offset": 9, 
            "type": "CN_WORD", 
            "position": 5
        }, 
        {
            "token": "记", 
            "start_offset": 9, 
            "end_offset": 10, 
            "type": "CN_CHAR", 
            "position": 6
        }, 
        {
            "token": "本厂", 
            "start_offset": 10, 
            "end_offset": 12, 
            "type": "CN_WORD", 
            "position": 7
        }, 
        {
            "token": "厂商", 
            "start_offset": 11, 
            "end_offset": 13, 
            "type": "CN_WORD", 
            "position": 8
        }
    ]
}

It can be seen from this that the results of the two tokenizers are still different.

Expand the thesaurus, add the required phrases in mydict.dic under config\ik\custom, and then restart Elasticsearch. It should be noted that the file encoding is UTF-8 without BOM format encoding .

For example, the word Sykeland was added. Then query again:

Request: POST http://127.0.0.1:9200/_analyze/

parameter:

{
  "analyzer": "ik",
  "text": "赛克蓝德是一家数据安全公司"
}

Return result:

{
    "tokens": [
        {
            "token": "赛克蓝德", 
            "start_offset": 0, 
            "end_offset": 4, 
            "type": "CN_WORD", 
            "position": 0
        }, 
        {
            "token": "克", 
            "start_offset": 1, 
            "end_offset": 2, 
            "type": "CN_WORD", 
            "position": 1
        }, 
        {
            "token": "蓝", 
            "start_offset": 2, 
            "end_offset": 3, 
            "type": "CN_WORD", 
            "position": 2
        }, 
        {
            "token": "德", 
            "start_offset": 3, 
            "end_offset": 4, 
            "type": "CN_CHAR", 
            "position": 3
        }, 
        {
            "token": "一家", 
            "start_offset": 5, 
            "end_offset": 7, 
            "type": "CN_WORD", 
            "position": 4
        }, 
        {
            "token": "一", 
            "start_offset": 5, 
            "end_offset": 6, 
            "type": "TYPE_CNUM", 
            "position": 5
        }, 
        {
            "token": "家", 
            "start_offset": 6, 
            "end_offset": 7, 
            "type": "COUNT", 
            "position": 6
        }, 
        {
            "token": "数据", 
            "start_offset": 7, 
            "end_offset": 9, 
            "type": "CN_WORD", 
            "position": 7
        }, 
        {
            "token": "安全", 
            "start_offset": 9, 
            "end_offset": 11, 
            "type": "CN_WORD", 
            "position": 8
        }, 
        {
            "token": "公司", 
            "start_offset": 11, 
            "end_offset": 13, 
            "type": "CN_WORD", 
            "position": 9
        }
    ]
}

From the above results, it can be seen that the Sykeland word has been supported.

    Secisland will gradually analyze the various functions of the latest version of Elasticsearch in the future, please look forward to it. You are also welcome to join the secisland public account to follow .

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324387131&siteId=291194637