ElasticSearch - Simulates the automatic completion function of "Baidu" search box based on Pinyin word segmenter and IK word segmenter

Table of contents

1. Automatic completion

1.1. Effect description

1.2. Install Pinyin word segmenter

1.3. Custom word segmenter

1.3.1. Why customize the word segmenter?

1.3.2. Structure of word segmenter

1.3.3. Custom word segmenter

1.3.4. Problems faced and solutions

question

solution

1.4. Completion suggester query

1.4.1. Basic concepts and syntax

1.4.2. Example

1.4.3. Example (Dark Horse Tourism)

a) Modify the hotel index library structure and set up a custom Pinyin word segmenter.

b) Add the suggestion field to the HotelDoc class

c) Re-import the data into the hotel index database

d) Write DSL based on JavaRestClient

1.5. Dark horse tourism case

1.5.1. Demand

1.5.2. Front-end docking

1.5.3. Implement controller

1.5.4. Create interface and implement it.

1.5.5. Effect display


1. Automatic completion


1.1. Effect description

When the user enters a character in the search box, we should prompt search terms related to that character.

For example, in Baidu, if you enter the keyword "byby", the effect will be as follows:

1.2. Install Pinyin word segmenter

 To achieve completion based on letters, you need to segment the document according to Pinying. There is a Pinying word segmentation plug-in for es on GitHub.

地址:GitHub - medcl/elasticsearch-analysis-pinyin: This Pinyin Analysis plugin is used to do conversion between Chinese characters and Pinyin.

The installation method here is the same as the IK word segmenter, which is divided into four steps:

1. Install and unzip.

2. Upload to the cloud server, the plugin directory of es.

3. Restart es.

4. Test.

As you can see here, the Pinyin word segmenter not only segments each character using Pinyin, but also segments the first letter of each character.

1.3. Custom word segmenter

1.3.1. Why customize the word segmenter?

According to the above test, it can be seen.

1. The Pinyin word segmenter divides each word in a sentence into Pinyin, which is of no practical use.

2. There are no Chinese characters here, only Pinyin. In actual use, users mostly use Chinese characters to search. Pinyin is just the icing on the cake, but you cannot just use the Pinyin word segmenter and lose the Chinese characters.

So here we need to make some custom configurations for the Pinyin word segmenter.

1.3.2. Structure of word segmenter

If you want to customize the tokenizer, you must first understand the composition of the tokenizer in es.

The word segmenter mainly consists of the following three parts:

  1. Character filters: Before the tokenizer, special characters of the text are processed. For example, it will convert some special characters appearing in the text into Chinese characters, such as :) => happy.
  2. Tokenizer: Cut the text into terms according to certain rules. For example, "I am very happy" will be cut into "I", "very", and "happy".
  3. tokenizer filter: further process the tokenizer. For example, convert Chinese characters into pinyin.

1.3.3. Custom word segmenter

PUT /test
{
  "settings": {
    "analysis": {
      "analyzer": { //自定义分词器
        "my_analyzer": { //自定义分词器名称
          "tokenizer": "ik_max_word",
          "filter": "py"
        }
      },
      "filter": {
        "py": { 
          "type": "pinyin",
          "keep_full_pinyin": false, 
          "keep_joined_full_pinyin": true,
          "keep_original": true,
          "limit_first_letter_length": 16,
          "remove_duplicated_term": true,
          "none_chinese_pinyin_tokenize": false
        }
      }
    }
  }
}

  • "type": "pinyin": Specifies to use Pinyin filter for Pinyin conversion.
  • "keep_full_pinyin": false: Indicates that the complete pinyin is not retained. If set to true, the complete pinyin will be preserved.
  • "keep_joined_full_pinyin": true: indicates that the complete pinyin of the connection is retained. When set to true, if the pinyin of a word has multiple syllables, they will be concatenated together as a complete pinyin.
  • "keep_original": true: indicates that the original vocabulary is retained. When set to true, the original Chinese words will also be retained in the word segmentation results.
  • "limit_first_letter_length": 16: Limit the length of the first letter of Pinyin. The default is 16, which means only the first 16 characters of the first letter of Pinyin are retained.
  • "remove_duplicated_term": true: Indicates removing duplicate Pinyin words. If set to true, duplicate words in Pinyin results will be removed.
  • "none_chinese_pinyin_tokenize": false: Indicates whether to perform pinyin word segmentation processing for non-Chinese text. When set to false, non-Chinese text will remain as is and no Pinyin segmentation will be performed.

For example, create a test index library to test a custom tokenizer.

PUT /test
{
  "settings": {
    "analysis": {
      "analyzer": { 
        "my_analyzer": { 
          "tokenizer": "ik_max_word",
          "filter": "py"
        }
      },
      "filter": {
        "py": { 
          "type": "pinyin",
          "keep_full_pinyin": false,
          "keep_joined_full_pinyin": true,
          "keep_original": true,
          "limit_first_letter_length": 16,
          "remove_duplicated_term": true,
          "none_chinese_pinyin_tokenize": false
        }
      }
    }
  },
  
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "analyzer": "my_analyzer"
      }
    }
  }
}

Use this index library's tokenizer for testing

As can be seen from the picture above:

1. Not only does it have pinyin, but it also has Chinese word segmentation.

2. There is also the full English spelling of the Chinese participle, as well as the first letter of the participle.

1.3.4. Problems faced and solutions

question

The Pinyin word segmenter implemented above cannot be applied to actual production environments~

You can imagine a scenario like this:

If there are these two words in the thesaurus: "lion" and "lice", it means that when creating the inverted index, these two words will be classified into one document through the above-mentioned customized Pinyin word segmenter. , because when they segment the words, they will separate the common pinyin "shizi" and "sz", which will cause the document numbers of the two to correspond to the same entry, causing the user to enter "lion" in the search box in the future and click search After that, "lion" and "lice" will be searched at the same time, which is not what we want to see.

solution

Therefore, the field should use the my_analyzer tokenizer when creating an inverted index, but the field should use the ik_smart tokenizer when searching. 

That is to say, when the user inputs Chinese, the search will be based on Chinese, and when the user inputs Pinyin, the search will be based on Pinyin. Even if the above situation occurs and these two words are searched at the same time, then you are searching based on Pinyin, and the two words are searched at the same time. They are all consistent and there is no ambiguity.

as follows:

PUT /test
{
  "settings": {
    "analysis": {
      "analyzer": { 
        "my_analyzer": { 
          "tokenizer": "ik_max_word",
          "filter": "py"
        }
      },
      "filter": {
        "py": { 
          "type": "pinyin",
          "keep_full_pinyin": false,
          "keep_joined_full_pinyin": true,
          "keep_original": true,
          "limit_first_letter_length": 16,
          "remove_duplicated_term": true,
          "none_chinese_pinyin_tokenize": false
        }
      }
    }
  },
  
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "analyzer": "my_analyzer" //创建倒排索引使用 my_analyzer 分词器.
        "search_analyzer": "ik_smart"  //搜索时使用 ik_smart 分词器.
      }
    }
  }
}

1.4. Completion suggester query

1.4.1. Basic concepts and syntax

The completion suggester query is provided in es to implement the automatic completion function. This query will match the terms at the beginning of the user's input content and return it.

In order to improve the efficiency of completion queries, there are some constraints on the field types in the document, as follows:

  1. The fields participating in the completion query must be of completion type.
  2. The content participating in the auto-complete field is generally an array formed by multiple entries.

POST /test2/_search
{
  "suggest": {
    "title_suggest": { //自定义补全名
      "text": "s",  //用户在搜索框中输入的关键字
      "completion": { // completion 是自动补全中的一种类型(最常用的)
        "field": "补全时需要查询的字段名", //这里的字段名指向的是一个数组(字段必须是 completion 类型),就是要根据数组中的字段进行查询,然后自动补全
        "skip_duplicates": true,  //如果查询时有重复的词条,是否自动跳过(true 为跳过)
        "size": 10 // 获取前 10 条结果.
      }
    }
  }
}

1.4.2. Example

Here I use an example to demonstrate the usage of completion suggester.

First create an index library (the field type participating in automatic completion must be completion).

PUT /test2
{
  "mappings": {
    "properties": {
      "title": {
        "type": "completion"
      }
    }
  }
}

Insert sample data (the field content is generally an array formed by multiple entries used for completion.)

POST test2/_doc
{
 "title": ["Sony", "WH-1000XM3"]
}
POST test2/_doc
{
  "title": ["SK-II", "PITERA"]
}
POST test2/_doc
{
  "title": ["Nintendo", "switch"]
}

Here we set the keyword to "s" to automatically complete the query, as follows:

POST /test2/_search
{
  "suggest": {
    "title_suggest": {
      "text": "s", 
      "completion": {
        "field": "title", 
        "skip_duplicates": true, 
        "size": 10
      }
    }
  }
}

1.4.3. Example (Dark Horse Tourism)

Here we make a chestnut based on the previously implemented dark horse tourism case. The implementation steps are as follows:

a) Modify the hotel index library structure and set up a custom Pinyin word segmenter.

1. Set up a custom word segmenter.

2. Modify the name and all fields of the index database (use the Pinyin word separator to build the inverted index, and use the ik word segmenter when searching).

3. Add a new field suggestion to the index library, the type is completion type, and use a custom word segmenter.

PUT /hotel
{
  "settings": {
    "analysis": {
      "analyzer": {
        "text_anlyzer": {
          "tokenizer": "ik_max_word",
          "filter": "py"
        },
        "completion_analyzer": {
          "tokenizer": "keyword",
          "filter": "py"
        }
      },
      "filter": {
        "py": {
          "type": "pinyin",
          "keep_full_pinyin": false,
          "keep_joined_full_pinyin": true,
          "keep_original": true,
          "limit_first_letter_length": 16,
          "remove_duplicated_term": true,
          "none_chinese_pinyin_tokenize": false
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "id":{
        "type": "keyword"
      },
      "name":{
        "type": "text",
        "analyzer": "text_anlyzer",
        "search_analyzer": "ik_smart",
        "copy_to": "all"
      },
      "address":{
        "type": "keyword",
        "index": false
      },
      "price":{
        "type": "integer"
      },
      "score":{
        "type": "integer"
      },
      "brand":{
        "type": "keyword",
        "copy_to": "all"
      },
      "city":{
        "type": "keyword"
      },
      "starName":{
        "type": "keyword"
      },
      "business":{
        "type": "keyword",
        "copy_to": "all"
      },
      "location":{
        "type": "geo_point"
      },
      "pic":{
        "type": "keyword",
        "index": false
      },
      "all":{
        "type": "text",
        "analyzer": "text_anlyzer",
        "search_analyzer": "ik_smart"
      },
      "suggestion":{
          "type": "completion",
          "analyzer": "completion_analyzer"
      }
    }
  }
}

b) Add the suggestion field to the HotelDoc class

The suggestion field (an array containing multiple fields, which can be represented by a List here) contains brand and business.

Ps: name and all can be segmented into words, but the auto-completed brand and business cannot be segmented into words. Different word segmenters must be used.

@Data
@NoArgsConstructor
public class HotelDoc {
    private Long id;
    private String name;
    private String address;
    private Integer price;
    private Integer score;
    private String brand;
    private String city;
    private String starName;
    private String business;
    private String location;
    private String pic;
    private Object distance;
    private Boolean isAD;
    private List<String> suggestion;

    public HotelDoc(Hotel hotel) {
        this.id = hotel.getId();
        this.name = hotel.getName();
        this.address = hotel.getAddress();
        this.price = hotel.getPrice();
        this.score = hotel.getScore();
        this.brand = hotel.getBrand();
        this.city = hotel.getCity();
        this.starName = hotel.getStarName();
        this.business = hotel.getBusiness();
        this.location = hotel.getLatitude() + ", " + hotel.getLongitude();
        this.pic = hotel.getPic();
        this.suggestion = new ArrayList<>();
        suggestion.add(brand);
        suggestion.add(business);
    }
}

c) Re-import the data into the hotel index database

Delete the hotel index library and then rebuild it (DSL in a). Synchronize all information from the database to ES through unit testing.

    @Test
    public void testBulkDocument() throws IOException {
        //1.获取酒店所有数据
        List<Hotel> hotelList = hotelService.list();
        //2.构造请求
        BulkRequest request = new BulkRequest();
        //3.准备参数
        for(Hotel hotel : hotelList) {
            //转化为文档(主要是地理位置)
            HotelDoc hotelDoc = new HotelDoc(hotel);
            String json = objectMapper.writeValueAsString(hotelDoc);
            request.add(new IndexRequest("hotel").id(hotel.getId().toString()).source(json, XContentType.JSON));
        }
        //4.发送请求
        client.bulk(request, RequestOptions.DEFAULT);
    }

d) Write DSL based on JavaRestClient

For example, content with the key "h" is automatically completed.

    @Test
    public void testSuggestion() throws IOException {
        //1.创建请求
        SearchRequest request = new SearchRequest("hotel");
        //2.准备参数
        request.source().suggest(new SuggestBuilder().addSuggestion(
            "testSuggestion",
                SuggestBuilders
                        .completionSuggestion("suggestion")
                        .prefix("h")
                        .skipDuplicates(true)
                        .size(10)
        ));
        //3.发送请求,接收响应
        SearchResponse search = client.search(request, RequestOptions.DEFAULT);
        //4.解析响应
        handlerResponse(search);
    }

This can be written corresponding to DSL statements.

The query results are processed as follows:

        //4.处理自动补全结果
        Suggest suggest = response.getSuggest();
        if(suggest != null) {
            CompletionSuggestion suggestion = suggest.getSuggestion("testSuggestion");
            for (CompletionSuggestion.Entry.Option option : suggestion.getOptions()) {
                String text = option.getText().toString();
                System.out.println(text);
            }
        }

This can be written corresponding to DSL statements.

The running results are as follows:

1.5. Dark horse tourism case

1.5.1. Demand

First, the autocomplete function of the search box.

The final effect is similar to Baidu's search box. For example, when we enter "byby", it will automatically complete the information about the byby keyword, as shown below:

1.5.2. Front-end docking

Entering in the search box will trigger the following request. Here the front end passes in a parameter key.

It is agreed here that what is returned is a List, and the content is all the information that is automatically completed.

1.5.3. Implement controller

Here, @RequestParam is used to receive the parameters passed in by the front end, and then the IhotelService interface is called for processing.

    @RequestMapping("/suggestion")
    public List<String> suggestion(@RequestParam("key") String prefix) {
        return hotelService.suggestion(prefix);
    }

1.5.4. Create interface and implement it.

Create a suggestion method in the IhotelService interface.

public interface IHotelService extends IService<Hotel> {

    PageResult search(RequestParams params);

    Map<String, List<String>> filters(RequestParams params);

    List<String> suggestion(String prefix);
}

Then implement this method in the implementation class HotelService of IhotelService.

The specific implementation is basically the same as the test case written before~ The point to note is that the completion keyword is not hard-coded, but the prefix passed in by the front end.

    @Override
    public List<String> suggestion(String prefix) {
        try {
            //1.创建请求
            SearchRequest request = new SearchRequest("hotel");
            //2.准备参数
            request.source().suggest(new SuggestBuilder().addSuggestion(
                    "mySuggestion",
                    SuggestBuilders
                            .completionSuggestion("suggestion")
                            .prefix(prefix)
                            .skipDuplicates(true)
                            .size(10)
            ));
            //3.发送请求,接收响应
            SearchResponse response = client.search(request, RequestOptions.DEFAULT);
            //4.解析响应(处理自动补全结果)
            Suggest suggest = response.getSuggest();
            List<String> suggestionList = new ArrayList<>();
            if(suggest != null) {
                CompletionSuggestion suggestion = suggest.getSuggestion("mySuggestion");
                for (CompletionSuggestion.Entry.Option option : suggestion.getOptions()) {
                    String text = option.getText().toString();
                    suggestionList.add(text);
                }
            }
            return suggestionList;
        } catch (IOException e) {
            System.out.println("[HotelService] 自动补全失败!prefix=" + prefix);
            e.printStackTrace();
            return null;
        }
    }
}

1.5.5. Effect display

Enter keywords and automatic completion will appear.

Guess you like

Origin blog.csdn.net/CYK_byte/article/details/133356413