[Elasticsearch Tutorial 20] Pinyin Pinyin Word Segmenter and Polyphone Modification

1 Introduction

When developing enterprise projects, it is very common to search based on pinyin, such as:

  • Personnel address book, not sure which Chinese character the name is, only know the pronunciation, you can enter Chinese characters + full pinyin, Chinese characters + initials of pinyin, initials of pinyin, etc.
  • Stock names, everyone who trades in stocks knows that there are too many stocks, and it is impossible to remember all the stock codes, so the pinyin initials are often used to look up stocks.

insert image description here
insert image description here
The Medcl boss provided us with a Pinyin tokenizer , which allows us to search for documents using Pinyin very conveniently.

2. The error modification of the polyphonic characters of the pinyin tokenizer

There are a lot of blogs about the installation and use of the pinyin tokenizer on the Internet, so I won't go into details here. But I have to say a very important question. The latest version 8.x still has this problem when I write a blog. Although someone has raised this issue on GitHub, but it has not been fixed yet, so let's change it manually by ourselves.

The problem is the "行" of the polyphonic word "bank". The pinyin tokenizer will convert "yin hang" into "yin xing" by mistake . When you test "Bank of China", it is correct, but when you test "China Construction Bank", it is wrong. wrong again. Try it yourself if you don't believe me.

At this time, you need to modify the jar package in the figure below to decompress it ,
insert image description here
and then modify the file in the figure below polyphone.txt. Be careful not to"yin xing" replace it all at once "yin hang".

Because “隐形”、“银杏”the pinyin of these words is "yin xing". You need to manually change them one by one. I am not sure whether this problem is caused by nlp-lang or by the author of the pinyin tokenizer. I see that the "bank" in the source code of nlp-lang1.7 is indeed correct. "yin hang".

After changing it, repackage it into nlp-lang-1.7.jar package, replace the nlp-lang-1.7.jar file in the picture above, and restart ES.
insert image description here

3. Case

There are still many configuration parameters for the Pinyin tokenizer
. You can refer to its GitHub description. These configurations must be carefully configured. Different configurations will produce different search results.

3.1 Create Mapping

  • I apply the Pinyin tokenizer to name.pinyinthis subfield, because namethis main field "standard"is useful according to word segmentation;
  • My design is suitable for short-term scenarios such as person name search and stock name search;
  • If it is the content of the article, leave a comment and comment on the long content, generally follow the IK word segmentation. If you must support pinyin search, you can combine IK+Pinyin, set the tokenizer to ik_smart, and filter to pinyin;
PUT pigg_test_pinyin
{
    
    
    "settings":{
    
    
        "analysis":{
    
    
            "analyzer":{
    
    
                "pinyin_analyzer":{
    
    
                    "tokenizer":"my_pinyin"
                }
            },
            "tokenizer":{
    
    
                "my_pinyin":{
    
    
                    "type":"pinyin",
                    "keep_first_letter":true,
                    "keep_separate_first_letter":true,
                    "keep_full_pinyin":true,
                    "remove_duplicated_term":true
                }
            }
        }
    },
    "mappings":{
    
    
        "properties":{
    
    
            "name":{
    
    
                "type":"text",
                "analyzer":"standard",
                "fields":{
    
    
                    "pinyin":{
    
    
                        "type":"text",
                        "analyzer":"pinyin_analyzer"
                    }
                }
            }
        }
    }
}

3.2 Insert test document

PUT pigg_test_pinyin/_doc/1
{
    
    
  "name": "亚瑟王"
}

PUT pigg_test_pinyin/_doc/2
{
    
    
  "name": "鼓励王"
}

3.3 Testing Pinyin Search

According to Chinese:

GET pigg_test_pinyin/_search
{
    
    
  "query": {
    
    
    "match": {
    
    
      "name.pinyin": {
    
    
        "query": "瑟王",
        "operator": "and"
      }
    }
  }
}

According to Chinese + pinyin spelling:

GET pigg_test_pinyin/_search
{
    
    
  "query": {
    
    
    "match": {
    
    
      "name.pinyin": {
    
    
        "query": "亚sewang",
        "operator": "and"
      }
    }
  }
}

According to Chinese + pinyin initials:

GET pigg_test_pinyin/_search
{
    
    
  "query": {
    
    
    "match": {
    
    
      "name.pinyin": {
    
    
        "query": "亚sw",
        "operator": "and"
      }
    }
  }
}

According to the first letter of pinyin:

GET pigg_test_pinyin/_search
{
    
    
  "query": {
    
    
    "match": {
    
    
      "name.pinyin": {
    
    
        "query": "ysw",
        "operator": "and"
      }
    }
  }
}

3.4 Check the results after pinyin word segmentation

Method 1:

GET pigg_test_pinyin/_doc/1/_termvectors?fields=name.pinyin

Method 2:

GET pigg_test_pinyin/_analyze
{
    
    
  "analyzer" : "pinyin_analyzer",
  "text" : "亚瑟王"
}

Through the above two methods, the word segmentation results can be found as follows:

y, s, w, ysw, ya, se, wang

This is because I set it "keep_separate_first_letter":trueso that the first letter of the pinyin yswwill be split into again y, s, w.
In this way, when we search "亚sw", we can match the document.

4. Conclusion

As an older programmer, in his 30s, he does not need to involve too many technologies. He has to concentrate on one or two technologies, quit impetuousness, and polish himself in project practice.
Therefore, it is useless to just read documents and blogs. Only by solving business problems with hands-on practice will there be better growth.

Guess you like

Origin blog.csdn.net/winterking3/article/details/126900562