SpringBoot integrates Elasticsearch advanced to realize Chinese and Pinyin word segmentation

I checked a lot of articles about the es pinyin tokenizer, but there are not many valuable ones, so write one by yourself

1. Definition

Word participle is divided into reading time participle and writing time participle .
When the word segmentation occurs when the user queries, ES will segment the keywords entered by the user instantly, and the result of the word segmentation is only stored in the memory. When the query ends, the result of the word segmentation will disappear immediately. When writing word segmentation occurs when the document is written, ES will segment the document and store the result in the inverted index. This part will eventually be stored on the disk in the form of a file and will not be lost due to the end of the query or the restart of ES .
The tokenizer when writing needs to be specified in the mapping, and once specified, it cannot be modified. If you want to modify it, you must create a new index.

Word segmentation is generally processed by a tokenizer in ES. English is Analyzer, which determines the rules for word segmentation. Es comes with many word segmenters by default, such as
Standard, english, Keyword, Whitespace, etc. The default tokenizer is Standard, through their respective functions, you can combine the
tokenization rules you want. For specific details of the word segmentation device, please refer to the official website: word segmentation device.
In addition, the commonly used Chinese word segmentation device, pinyin word segmentation device, traditional and simplified Chinese conversion plug-in. The ones used in China are:
elasticsearch-analysis-ik
elasticsearch-analysis-pinyin
elasticsearch-analysis-stconvert

2. Plug-in installation

(The plug-in needs to download the plug-in corresponding to the version of es. I have the version of es 6.6.1. For stability, the 7.x version is not used)

Package the downloaded project with maven, and elasticsearch-analysis-pinyin-6.6.1.zip will be generated in the project target folder

Put the compressed package into es plugins, unzip and rename analysis-pinyin, and restart es.

3. Configuration 

Custom setting

Create an elasticsearch_setting.json file in the resources directory

{
  "index": {
    "analysis": {
      "analyzer": {
        "pinyin_analyzer": {
          "tokenizer": "my_pinyin"
        }
      },
      "tokenizer": {
        "my_pinyin": {
          "type": "pinyin",
          //true:支持首字母
          "keep_first_letter": true,
          //false:首字母搜索只有两个首字母相同才能命中,全拼能命中
          //true:任何情况全拼,首字母都能命中
          "keep_separate_first_letter": false,
          //true:支持全拼  eg: 刘德华 -> [liu,de,hua]
          "keep_full_pinyin": true,
          "keep_original": true,
          //设置最大长度
          "limit_first_letter_length": 16,
          "lowercase": true,
          //重复的项将被删除,eg: 德的 -> de
          "remove_duplicated_term": true
        }
      }
    }
  }
}

Create an elasticsearch_mapping.json file in the resources directory

{
  "block": {
    "properties": {
      "preClosePx": {
        "type": "keyword",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "blockTypeName": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      } ,
      "blockId": {
        "type": "keyword",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "blockName": {
        "type": "text",
        "analyzer": "pinyin_analyzer",
        "search_analyzer": "pinyin_analyzer",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      }
    }
  }
}

Entity settings
 

@ToString
@Getter
@Setter
@Mapping(mappingPath = "elasticsearch_mapping.json")//设置mapping
@Setting(settingPath = "elasticsearch_setting.json")//设置setting
@Document(indexName = "info", type = "block", shards = 5, replicas = 1)
public class BlockInfoItem {
    /**
     * id
     */
    @Id
    private String id;
 
    @Field(type = FieldType.Keyword)
    private String preClosePx;
   
    @Field(type = FieldType.Text)
    private String blockTypeName;
    
    @Field(type = FieldType.Text,analyzer = "pinyin_analyzer",searchAnalyzer = "pinyin_analyzer")
    private String blockName;
 
    @Field(type = FieldType.Keyword)
    private String blockId;
}

 

Guess you like

Origin blog.csdn.net/CharlesYooSky/article/details/93471585