ElasticSearch - Handwritten an ElasticSearch tokenizer (with source code)

1. Tokenizer plugin

ElasticSearchProvides a plug-in system for word segmentation of text content. For word segmenters in different languages, the rules are generally different, and the provided plug-in mechanism ElasticSearchcan well integrate word segmenters of various languages.

Elasticsearch It does not support Chinese word segmentation itself, but fortunately it supports the writing and installation of additional word segmentation management plug-ins, and the open source Chinese word segmentation device is ikvery powerful, with 20more than 10,000 commonly used thesaurus, which can meet the general common word segmentation functions.

1.1 The role of the tokenizer plug-in

The main function of the tokenizer is to split the text into words with the smallest granularity, and then ElasticSearchuse them as entries in the index system. The rules for splitting words in different languages ​​are also different. The most common ones are Chinese word segmentation and English word segmentation.

For the same text, using different tokenizers, the splitting effect is also different. For example: "People's Republic of China" ik_max_wordwill be split into: People's Republic of China, People's Republic of China, China, Chinese, People's Republic of China, People, Republic, Republic, State using a standardword breaker, and will be split into: China, country, people, people, republic, peace, country. The split words can be used ElasticSearchas index entries to build an index system, so that part of the content can be used for searching

2. Common tokenizers

2.1 Introduction to tokenizer

  • standard

    Standard tokenizer. Strong ability to deal with English, will convert vocabulary units into lowercase, and remove stop words and punctuation marks, for non-English segmentation by single word

  • whitespace

    Space tokenizer. For English, only spaces are removed without any other processing, non-English is not supported

  • simple

    For English, the text information is divided by non-alphabetic characters, and then the lexical units are unified into lowercase, and the characters of the numeric type will be removed

  • stop

    stop The function is beyond that simple, stopon simplethe basis of the addition of removing common words in English (such as the, a etc.), you can also set common words according to your own needs, and does not support Chinese

  • keyword

    KeywordTreat the entire input as a single lexical unit without any splitting of the text, usually used in zip codes, phone numbers and other fields that require full matching

  • pattern

    The query text will be automatically treated as a regular expression to generate a set of termskeywords, and then Elasticsearchquery

  • snowball

    Snowball analyzer, added on the basis standard of snowball filter, Luceneofficially not recommended

  • language

    A collection for parsing text in special languages analyzer, but not including Chinese

  • I

    IKTokenizer is an open source javalanguage-based lightweight Chinese word segmentation toolkit. It adopts a unique "forward iterative fine-grained segmentation algorithm" and supports two segmentation modes: fine-grained and maximum word length. Support: Word segmentation processing of English letters, numbers, Chinese vocabulary, etc., compatible with Korean and Japanese characters. At the same time, it supports user-defined thesaurus. It comes with two tokenizers:

    • ik_max_word : Split the text at the finest granularity, and split as many words as possible
    • ik_smart: Do the most coarse-grained split, the words that have been separated will not be occupied by other words again
  • pinyin

    Match Chinese Elasticsearch in

2.2 Example of tokenizer

For the same input, use the results of different tokenizers.

Input: No. 14w6 circuit breaker of Yangtze River Line at Qixia Station

2.2.1 standard

insert image description here

{
    
    
    "tokens": [
        {
    
    
            "token": "栖",
            "start_offset": 0,
            "end_offset": 1,
            "type": "<IDEOGRAPHIC>",
            "position": 0
        },
        {
    
    
            "token": "霞",
            "start_offset": 1,
            "end_offset": 2,
            "type": "<IDEOGRAPHIC>",
            "position": 1
        },
        {
    
    
            "token": "站",
            "start_offset": 2,
            "end_offset": 3,
            "type": "<IDEOGRAPHIC>",
            "position": 2
        },
        {
    
    
            "token": "长",
            "start_offset": 3,
            "end_offset": 4,
            "type": "<IDEOGRAPHIC>",
            "position": 3
        },
        {
    
    
            "token": "江",
            "start_offset": 4,
            "end_offset": 5,
            "type": "<IDEOGRAPHIC>",
            "position": 4
        },
        {
    
    
            "token": "线",
            "start_offset": 5,
            "end_offset": 6,
            "type": "<IDEOGRAPHIC>",
            "position": 5
        },
        {
    
    
            "token": "14w6",
            "start_offset": 6,
            "end_offset": 10,
            "type": "<ALPHANUM>",
            "position": 6
        },
        {
    
    
            "token": "号",
            "start_offset": 10,
            "end_offset": 11,
            "type": "<IDEOGRAPHIC>",
            "position": 7
        },
        {
    
    
            "token": "断",
            "start_offset": 11,
            "end_offset": 12,
            "type": "<IDEOGRAPHIC>",
            "position": 8
        },
        {
    
    
            "token": "路",
            "start_offset": 12,
            "end_offset": 13,
            "type": "<IDEOGRAPHIC>",
            "position": 9
        },
        {
    
    
            "token": "器",
            "start_offset": 13,
            "end_offset": 14,
            "type": "<IDEOGRAPHIC>",
            "position": 10
        }
    ]
}

2.2.2 i

  • ik_smart

insert image description here

{
    
    
    "tokens": [
        {
    
    
            "token": "栖霞",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 0
        },
        {
    
    
            "token": "站",
            "start_offset": 2,
            "end_offset": 3,
            "type": "CN_CHAR",
            "position": 1
        },
        {
    
    
            "token": "长江",
            "start_offset": 3,
            "end_offset": 5,
            "type": "CN_WORD",
            "position": 2
        },
        {
    
    
            "token": "线",
            "start_offset": 5,
            "end_offset": 6,
            "type": "CN_CHAR",
            "position": 3
        },
        {
    
    
            "token": "14w6",
            "start_offset": 6,
            "end_offset": 10,
            "type": "LETTER",
            "position": 4
        },
        {
    
    
            "token": "号",
            "start_offset": 10,
            "end_offset": 11,
            "type": "COUNT",
            "position": 5
        },
        {
    
    
            "token": "断路器",
            "start_offset": 11,
            "end_offset": 14,
            "type": "CN_WORD",
            "position": 6
        }
    ]
}
  • ik_max_word

insert image description here

{
    
    
    "tokens": [
        {
    
    
            "token": "栖霞",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 0
        },
        {
    
    
            "token": "站长",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 1
        },
        {
    
    
            "token": "长江",
            "start_offset": 3,
            "end_offset": 5,
            "type": "CN_WORD",
            "position": 2
        },
        {
    
    
            "token": "线",
            "start_offset": 5,
            "end_offset": 6,
            "type": "CN_CHAR",
            "position": 3
        },
        {
    
    
            "token": "14w6",
            "start_offset": 6,
            "end_offset": 10,
            "type": "LETTER",
            "position": 4
        },
        {
    
    
            "token": "14",
            "start_offset": 6,
            "end_offset": 8,
            "type": "ARABIC",
            "position": 5
        },
        {
    
    
            "token": "w",
            "start_offset": 8,
            "end_offset": 9,
            "type": "ENGLISH",
            "position": 6
        },
        {
    
    
            "token": "6",
            "start_offset": 9,
            "end_offset": 10,
            "type": "ARABIC",
            "position": 7
        },
        {
    
    
            "token": "号",
            "start_offset": 10,
            "end_offset": 11,
            "type": "COUNT",
            "position": 8
        },
        {
    
    
            "token": "断路器",
            "start_offset": 11,
            "end_offset": 14,
            "type": "CN_WORD",
            "position": 9
        },
        {
    
    
            "token": "断路",
            "start_offset": 11,
            "end_offset": 13,
            "type": "CN_WORD",
            "position": 10
        },
        {
    
    
            "token": "器",
            "start_offset": 13,
            "end_offset": 14,
            "type": "CN_CHAR",
            "position": 11
        }
    ]
}

2.2.3 pinyin

insert image description here

{
    
    
    "tokens": [
        {
    
    
            "token": "q",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 0
        },
        {
    
    
            "token": "qi",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 0
        },
        {
    
    
            "token": "栖霞站长江线14w6号断路器",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 0
        },
        {
    
    
            "token": "qixiazhanzhangjiangxian14w6haoduanluqi",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 0
        },
        {
    
    
            "token": "qxzzjx14w6hdlq",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 0
        },
        {
    
    
            "token": "x",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 1
        },
        {
    
    
            "token": "xia",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 1
        },
        {
    
    
            "token": "z",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 2
        },
        {
    
    
            "token": "zhan",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 2
        },
        {
    
    
            "token": "zhang",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 3
        },
        {
    
    
            "token": "j",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 4
        },
        {
    
    
            "token": "jiang",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 4
        },
        {
    
    
            "token": "xian",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 5
        },
        {
    
    
            "token": "14",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 6
        },
        {
    
    
            "token": "w",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 7
        },
        {
    
    
            "token": "6",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 8
        },
        {
    
    
            "token": "h",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 9
        },
        {
    
    
            "token": "hao",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 9
        },
        {
    
    
            "token": "d",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 10
        },
        {
    
    
            "token": "duan",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 10
        },
        {
    
    
            "token": "l",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 11
        },
        {
    
    
            "token": "lu",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 11
        }
    ]
}

3. Custom tokenizer

For the effects of the above three word breakers, they may all meet the requirements in some scenarios. Let's see why you need a custom word breaker.

As we all know, in the recommendation system, it is very necessary for pinyin search, such as input: "ls", hope to return "ls"related index entries, "snacks ( ls)", "Razer ( ls)", "Jeremy Lin ( lsh)"..., All of the above are correct, but if only the pinyin tokenizer is used here, "l"related index entries may also be hit, such as "Li Ning ( l)", "Lancome ( l)"..., the recommendation in this case It's just unreasonable.

If you use a pinyin word breaker, for the above input "Qixia Station Yangtze River Line Number Circuit 14w6Breaker", many single-letter index entries will be generated, such as: "h", "d", "l"and so on. If the user enters the query condition "ql", he does not want to see 14w6the data of "Qixia Station Changjiang Line No. Circuit Breaker" at all, but because "l"the entry is hit, all the data of this item will also be returned.

What about avoiding the index generation of these single-letter entries? Next, write a ElasticSearchtokenizer by yourself, customize it! ! !

3.1 Principle of tokenizer

3.1.1 Tokenizer plug-in workflow

insert image description here

  • ElasticSearchplugins/分词器/plugin-descriptor.propertiesFiles are read during startup
  • Read the configuration file to obtain the startup class information of the tokenizer plug-in and initialize it, classnamethe startup class pointed to by the property
  • The tokenizer plug-in startup class must be inherited AnalysisPluginto ensure that ElasticSearchour custom class can be called to obtain the tokenizer object
  • When ElasticSearchcalling word segmentation for word segmentation, AnalyzerProvideran object will be instantiated. There are methods in this object get()to obtain our custom Analyzerobject, and the internal tokenStream()method will call createComponents()the method to instantiate our custom Tokenizerobject
  • TokenizerIt is the core component of the custom tokenizer, there are 4 core methods, as follows:
    • incrementToken(): It is used to judge whether there are unread entry information in the word segmentation set list, and set the termbasic attributes such as: length, start offset, end offset, entry, etc.
    • reset(): Reset the default data and load a custom model to process the string data entered by the user and perform word segmentation processing, adding it to the word segmentation set list
    • end():Set the offset information when the participle ends
    • close(): Destroy the input stream object and custom data
  • TokenizerEvery time the object completes the word segmentation process of the user input text, it will perform the above 4-step method call

3.2 Tokenizer Verification

  • install word breaker

    Copy the packaged tokenizer zipfile to the directory ElasticSearchof the installation directory , and use the command to view itpluginselasticsearch-plugin list
    insert image description here

  • start upElasticSearch

  • Verify tokenizer

insert image description here

{
    
    
    "tokens": [
        {
    
    
            "token": "栖霞",
            "start_offset": 0,
            "end_offset": 2,
            "type": "word",
            "position": 0
        },
        {
    
    
            "token": "qixia",
            "start_offset": 0,
            "end_offset": 2,
            "type": "word",
            "position": 1
        },
        {
    
    
            "token": "qx",
            "start_offset": 0,
            "end_offset": 2,
            "type": "word",
            "position": 2
        },
        {
    
    
            "token": "栖霞站长江线14w6号断路器",
            "start_offset": 1,
            "end_offset": 14,
            "type": "word",
            "position": 3
        },
        {
    
    
            "token": "qixiazhanzhangjiangxian14w6haoduanluqi",
            "start_offset": 1,
            "end_offset": 14,
            "type": "word",
            "position": 4
        },
        {
    
    
            "token": "qxzzjx14w6hdlq",
            "start_offset": 1,
            "end_offset": 14,
            "type": "word",
            "position": 5
        },
        {
    
    
            "token": "站",
            "start_offset": 2,
            "end_offset": 3,
            "type": "word",
            "position": 6
        },
        {
    
    
            "token": "长江",
            "start_offset": 3,
            "end_offset": 5,
            "type": "word",
            "position": 7
        },
        {
    
    
            "token": "zhangjiang",
            "start_offset": 3,
            "end_offset": 5,
            "type": "word",
            "position": 8
        },
        {
    
    
            "token": "zj",
            "start_offset": 3,
            "end_offset": 5,
            "type": "word",
            "position": 9
        },
        {
    
    
            "token": "线",
            "start_offset": 5,
            "end_offset": 6,
            "type": "word",
            "position": 10
        },
        {
    
    
            "token": "14w6",
            "start_offset": 6,
            "end_offset": 10,
            "type": "word",
            "position": 11
        },
        {
    
    
            "token": "号",
            "start_offset": 10,
            "end_offset": 11,
            "type": "word",
            "position": 12
        },
        {
    
    
            "token": "断路",
            "start_offset": 11,
            "end_offset": 13,
            "type": "word",
            "position": 13
        },
        {
    
    
            "token": "duanlu",
            "start_offset": 11,
            "end_offset": 13,
            "type": "word",
            "position": 14
        },
        {
    
    
            "token": "dl",
            "start_offset": 11,
            "end_offset": 13,
            "type": "word",
            "position": 15
        },
        {
    
    
            "token": "断路器",
            "start_offset": 11,
            "end_offset": 14,
            "type": "word",
            "position": 16
        },
        {
    
    
            "token": "duanluqi",
            "start_offset": 11,
            "end_offset": 14,
            "type": "word",
            "position": 17
        },
        {
    
    
            "token": "dlq",
            "start_offset": 11,
            "end_offset": 14,
            "type": "word",
            "position": 18
        }
    ]
}

3.3 Source code compilation

  • JDK17, ideasupportJDK17

insert image description here

  • lunceneThe version ElasticSearchis consistent with the version requirements

insert image description here

  • ElasticSearchThe packaged tokenizer ElasticSearchis consistent with the used version

Source address: https://gitee.com/frank_zxd/elasticsearch-search-analyzer

Guess you like

Origin blog.csdn.net/zxd1435513775/article/details/127870981