[ElasticSearch] (8) - Auto-completion

Table of contents

1. Pinyin word breaker

2. Custom word breaker

3. Autocomplete query

4. Realize the automatic completion of the hotel search box

1. Modify the hotel mapping structure

2. HotelDoc entity

3. Java API for automatic query completion


When the user enters a character in the search box, we should prompt the search item related to the character, as shown in the figure:

This function of prompting complete entries based on the letters entered by the user is automatic completion.

Because it needs to be inferred based on the pinyin letters, the pinyin word segmentation function is used.

1. Pinyin word breaker

To achieve completion based on letters, it is necessary to segment the document according to pinyin. There happens to be a pinyin word segmentation plugin for elasticsearch on GitHub. Address: https://github.com/medcl/elasticsearch-analysis-pinyin

Or mirror mirrors / medcl / elasticsearch-analysis-pinyin on gitCode  · GitCode

 

The installation method is the same as the IK tokenizer, in three steps:

① Unzip

② Upload to the virtual machine, the plugin directory of elasticsearch

③Restart elasticsearch

④ test

For detailed installation steps, please refer to the installation process of IK tokenizer.

[ElasticSearch] (2) - Install elasticsearch

The test syntax is as follows:

POST /_analyze
{   "text": "I'm learning pinyin tokenizer",   "analyzer": "pinyin" }


{
  "tokens" : [
    {
      "token" : "wo",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "wzxxpyfcq",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "zai",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "xue",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "xi",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "pin",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "yin",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "fen",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "ci",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "qi",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 8
    }
  ]
}

2. Custom word breaker

The default pinyin word breaker divides each Chinese character into pinyin, but what we want is to form a set of pinyin for each entry. We need to customize the pinyin word breaker to form a custom word breaker.

The composition of the analyzer in elasticsearch consists of three parts:

  • Character filters: Process the text before the tokenizer. e.g. delete characters, replace characters

  • tokenizer: Cut the text into terms according to certain rules. For example, keyword is not participle; there is also ik_smart

  • tokenizer filter: further process the entries output by the tokenizer. For example, case conversion, synonyms processing, pinyin processing, etc.

When document word segmentation, the document will be processed by these three parts in turn:

 The syntax for declaring a custom tokenizer is as follows:

PUT /test
{   "settings": {     "analysis": {       "analyzer": { // custom tokenizer         "my_analyzer": { // tokenizer name           "tokenizer": "ik_max_word",           "filter": "py"         }       },       "filter": { // custom tokenizer filter         "py": { // filter name           "type": "pinyin", // filter type, here is pinyin           "keep_full_pinyin": false,           "keep_joined_full_pinyin" : true,           "keep_original": true,           "limit_first_letter_length": 16,           "remove_duplicated_term": true,           "none_chinese_pinyin_tokenize": false         }       }



















    }
  },
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "analyzer": "my_analyzer",
        "search_analyzer": "ik_smart"
      }
    }
  }
}

 test:

GET /test0/_analyze
{   "text": ["The weather is really nice today"],     "analyzer": "my_analyzer" }


result:

summary 

How to use Pinyin tokenizer?

  • ①Download the pinyin tokenizer

  • ② Unzip and put it in the plugin directory of elasticsearch

  • ③Restart

How to customize the tokenizer?

  • ① When creating an index library, configure it in settings, which can contain three parts

  • ②character filter

  • ③tokenizer

  • ④filter

Precautions for pinyin word breaker?

  • In order to avoid searching for homophones, do not use the pinyin word breaker when searching

3. Autocomplete query

Elasticsearch provides Completion Suggester query to achieve automatic completion. This query will match terms beginning with the user input and return them. In order to improve the efficiency of the completion query, there are some constraints on the types of fields in the document:

  • The fields participating in the completion query must be of completion type.

  • The content of the field is generally an array formed by multiple entries for completion.

For example, an index library like this:

// Create index library
PUT test
{   "mappings": {     "properties": {       "title":{         "type": "completion"       }     }   } }







Then insert the following data:

// 示例数据
POST test/_doc
{
  "title": ["Sony", "WH-1000XM3"]
}
POST test/_doc
{
  "title": ["SK-II", "PITERA"]
}
POST test/_doc
{
  "title": ["Nintendo", "switch"]

The query DSL statement is as follows:

// Auto-completion query
GET /test/_search
{   "suggest": {     "title_suggest": {       "text": "s", // keyword       "completion": {         "field": "title", // supplement Full query field         "skip_duplicates": true, // skip duplicate         "size": 10 // get the first 10 results       }     }   }










4. Realize the automatic completion of the hotel search box

Now, our hotel index library has not set up a pinyin word breaker, and we need to modify the configuration in the index library. But we know that the index library cannot be modified, it can only be deleted and then recreated.

In addition, we need to add a field for auto-completion, and put the brand, suggestion, city, etc. into it as a prompt for auto-completion.

So, to summarize, the things we need to do include:

  1. Create a new hotel index library structure and set a custom pinyin word breaker

  2. Modify the name and all fields of the index library and use a custom tokenizer

  3. The index library adds a new field suggestion, the type is completion type, using a custom tokenizer

  4. Add a suggestion field to the HotelDoc class, which contains brand and business

  5. Re-import data to the hotel library

1. Modify the hotel mapping structure

// 酒店数据索引库
PUT /hotel
{
  "settings": {
    "analysis": {
      "analyzer": {
        "text_anlyzer": {
          "tokenizer": "ik_max_word",
          "filter": "py"
        },
        "completion_analyzer": {
          "tokenizer": "keyword",
          "filter": "py"
        }
      },
      "filter": {
        "py": {
          "type": "pinyin",
          "keep_full_pinyin": false,
          "keep_joined_full_pinyin": true,
          "keep_original": true,
          "limit_first_letter_length": 16,
          "remove_duplicated_term": true,
          "none_chinese_pinyin_tokenize": false
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "id":{
        "type": "keyword"
      },
      "name":{
        "type": "text",
        "analyzer": "text_anlyzer",
        "search_analyzer": "ik_smart",
        "copy_to": "all"
      },
      "address":{
        "type": "keyword",
        "index": false
      },
      "price":{
        "type": "integer"
      },
      "score":{
        "type": "integer"
      },
      "brand":{
        "type": "keyword",
        "copy_to": "all"
      },
      "city":{
        "type": "keyword"
      },
      "starName":{
        "type": "keyword"
      },
      "business":{
        "type": "keyword",
        "copy_to": "all"
      },
      "location":{
        "type": "geo_point"
      },
      "pic":{
        "type": "keyword",
        "index": false
      },
      "all":{
        "type": "text",
        "analyzer": "text_anlyzer",
        "search_analyzer": "ik_smart"
      },
      "suggestion":{
          "type": "completion",
          "analyzer": "completion_analyzer"
      }
    }
  }
}

2. HotelDoc entity

package com.elasticsearch.hotel.pojo;

import lombok.Data;
import lombok.NoArgsConstructor;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.List;

@Data
@NoArgsConstructor
public class HotelDoc {
    private Long id;
    private String name;
    private String address;
    private Integer price;
    private Integer score;
    private String brand;
    private String city;
    private String starName;
    private String business;
    private String location;
    private String pic;
    private Object distance;
    private Boolean isAD;
    private List<String> suggestion;

    public HotelDoc(Hotel hotel) {
        this.id = hotel.getId();
        this.name = hotel.getName();
        this.address = hotel.getAddress();
        this.price = hotel.getPrice();
        this.score = hotel.getScore();
        this.brand = hotel.getBrand();
        this.city = hotel.getCity();
        this.starName = hotel.getStarName();
        this.business = hotel.getBusiness();
        this.location = hotel.getLatitude() + ", " + hotel.getLongitude();
        this.pic = hotel.getPic();
        // 组装suggestion
        if(this.business.contains("/")){
            // business has multiple values, need to cut 
            String[] arr = this.business.split("/");
            // add elements 
            this.suggestion = new ArrayList<>(); 
            this.suggestion.add(this.brand); 
            Collections.addAll(this.suggestion, arr); 
        }else { 
            this.suggestion = Arrays.asList(this.brand, this.business); 
        } 
    } 
}

3. Java API for automatic query completion

Core implementation method:

@Override
public List<String> getSuggestions(String prefix) {
    try {
        // 1.准备Request
        SearchRequest request = new SearchRequest("hotel");
        // 2.准备DSL
        request.source().suggest(new SuggestBuilder().addSuggestion(
            "suggestions",
            SuggestBuilders.completionSuggestion("suggestion")
            .prefix(prefix)
            .skipDuplicates(true)
            .size(10
        ));
        // 3.发起请求
        SearchResponse response = client.search(request, RequestOptions.DEFAULT);
        // 4.解析结果
        Suggest suggest = response.getSuggest();
        // 4.1.根据补全查询名称,获取补全结果
        CompletionSuggestion suggestions = suggest.getSuggestion("suggestions");
        // 4.2.获取options
        List<CompletionSuggestion.Entry.Option> options = suggestions.getOptions();
        // 4.3.遍历
        List<String> list = new ArrayList<>(options.size());
        for (CompletionSuggestion.Entry.Option option : options) {
            String text = option.getText().toString();
            list.add(text);
        }
        return list;
    } catch (IOException e) {
        throw new RuntimeException(e);
    }
}

 

 

Guess you like

Origin blog.csdn.net/a6470831/article/details/125667184