Table of contents
4. Realize the automatic completion of the hotel search box
1. Modify the hotel mapping structure
3. Java API for automatic query completion
When the user enters a character in the search box, we should prompt the search item related to the character, as shown in the figure:
This function of prompting complete entries based on the letters entered by the user is automatic completion.
Because it needs to be inferred based on the pinyin letters, the pinyin word segmentation function is used.
1. Pinyin word breaker
To achieve completion based on letters, it is necessary to segment the document according to pinyin. There happens to be a pinyin word segmentation plugin for elasticsearch on GitHub. Address: https://github.com/medcl/elasticsearch-analysis-pinyin
Or mirror mirrors / medcl / elasticsearch-analysis-pinyin on gitCode · GitCode
The installation method is the same as the IK tokenizer, in three steps:
① Unzip
② Upload to the virtual machine, the plugin directory of elasticsearch
③Restart elasticsearch
④ test
For detailed installation steps, please refer to the installation process of IK tokenizer.
[ElasticSearch] (2) - Install elasticsearch
The test syntax is as follows:
POST /_analyze
{ "text": "I'm learning pinyin tokenizer", "analyzer": "pinyin" }
{
"tokens" : [
{
"token" : "wo",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 0
},
{
"token" : "wzxxpyfcq",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 0
},
{
"token" : "zai",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 1
},
{
"token" : "xue",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 2
},
{
"token" : "xi",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 3
},
{
"token" : "pin",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 4
},
{
"token" : "yin",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 5
},
{
"token" : "fen",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 6
},
{
"token" : "ci",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 7
},
{
"token" : "qi",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 8
}
]
}
2. Custom word breaker
The default pinyin word breaker divides each Chinese character into pinyin, but what we want is to form a set of pinyin for each entry. We need to customize the pinyin word breaker to form a custom word breaker.
The composition of the analyzer in elasticsearch consists of three parts:
Character filters: Process the text before the tokenizer. e.g. delete characters, replace characters
tokenizer: Cut the text into terms according to certain rules. For example, keyword is not participle; there is also ik_smart
tokenizer filter: further process the entries output by the tokenizer. For example, case conversion, synonyms processing, pinyin processing, etc.
When document word segmentation, the document will be processed by these three parts in turn:
The syntax for declaring a custom tokenizer is as follows:
PUT /test
{ "settings": { "analysis": { "analyzer": { // custom tokenizer "my_analyzer": { // tokenizer name "tokenizer": "ik_max_word", "filter": "py" } }, "filter": { // custom tokenizer filter "py": { // filter name "type": "pinyin", // filter type, here is pinyin "keep_full_pinyin": false, "keep_joined_full_pinyin" : true, "keep_original": true, "limit_first_letter_length": 16, "remove_duplicated_term": true, "none_chinese_pinyin_tokenize": false } }
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_analyzer",
"search_analyzer": "ik_smart"
}
}
}
}
test:
GET /test0/_analyze
{ "text": ["The weather is really nice today"], "analyzer": "my_analyzer" }
result:
summary
How to use Pinyin tokenizer?
①Download the pinyin tokenizer
② Unzip and put it in the plugin directory of elasticsearch
③Restart
How to customize the tokenizer?
① When creating an index library, configure it in settings, which can contain three parts
②character filter
③tokenizer
④filter
Precautions for pinyin word breaker?
In order to avoid searching for homophones, do not use the pinyin word breaker when searching
3. Autocomplete query
Elasticsearch provides Completion Suggester query to achieve automatic completion. This query will match terms beginning with the user input and return them. In order to improve the efficiency of the completion query, there are some constraints on the types of fields in the document:
-
The fields participating in the completion query must be of completion type.
-
The content of the field is generally an array formed by multiple entries for completion.
For example, an index library like this:
// Create index library
PUT test
{ "mappings": { "properties": { "title":{ "type": "completion" } } } }
Then insert the following data:
// 示例数据
POST test/_doc
{
"title": ["Sony", "WH-1000XM3"]
}
POST test/_doc
{
"title": ["SK-II", "PITERA"]
}
POST test/_doc
{
"title": ["Nintendo", "switch"]
}
The query DSL statement is as follows:
// Auto-completion query
GET /test/_search
{ "suggest": { "title_suggest": { "text": "s", // keyword "completion": { "field": "title", // supplement Full query field "skip_duplicates": true, // skip duplicate "size": 10 // get the first 10 results } } } }
4. Realize the automatic completion of the hotel search box
Now, our hotel index library has not set up a pinyin word breaker, and we need to modify the configuration in the index library. But we know that the index library cannot be modified, it can only be deleted and then recreated.
In addition, we need to add a field for auto-completion, and put the brand, suggestion, city, etc. into it as a prompt for auto-completion.
So, to summarize, the things we need to do include:
Create a new hotel index library structure and set a custom pinyin word breaker
Modify the name and all fields of the index library and use a custom tokenizer
The index library adds a new field suggestion, the type is completion type, using a custom tokenizer
Add a suggestion field to the HotelDoc class, which contains brand and business
Re-import data to the hotel library
1. Modify the hotel mapping structure
// 酒店数据索引库
PUT /hotel
{
"settings": {
"analysis": {
"analyzer": {
"text_anlyzer": {
"tokenizer": "ik_max_word",
"filter": "py"
},
"completion_analyzer": {
"tokenizer": "keyword",
"filter": "py"
}
},
"filter": {
"py": {
"type": "pinyin",
"keep_full_pinyin": false,
"keep_joined_full_pinyin": true,
"keep_original": true,
"limit_first_letter_length": 16,
"remove_duplicated_term": true,
"none_chinese_pinyin_tokenize": false
}
}
}
},
"mappings": {
"properties": {
"id":{
"type": "keyword"
},
"name":{
"type": "text",
"analyzer": "text_anlyzer",
"search_analyzer": "ik_smart",
"copy_to": "all"
},
"address":{
"type": "keyword",
"index": false
},
"price":{
"type": "integer"
},
"score":{
"type": "integer"
},
"brand":{
"type": "keyword",
"copy_to": "all"
},
"city":{
"type": "keyword"
},
"starName":{
"type": "keyword"
},
"business":{
"type": "keyword",
"copy_to": "all"
},
"location":{
"type": "geo_point"
},
"pic":{
"type": "keyword",
"index": false
},
"all":{
"type": "text",
"analyzer": "text_anlyzer",
"search_analyzer": "ik_smart"
},
"suggestion":{
"type": "completion",
"analyzer": "completion_analyzer"
}
}
}
}
2. HotelDoc entity
package com.elasticsearch.hotel.pojo; import lombok.Data; import lombok.NoArgsConstructor; import java.util.ArrayList; import java.util.Arrays; import java.util.Collections; import java.util.List; @Data @NoArgsConstructor public class HotelDoc { private Long id; private String name; private String address; private Integer price; private Integer score; private String brand; private String city; private String starName; private String business; private String location; private String pic; private Object distance; private Boolean isAD; private List<String> suggestion; public HotelDoc(Hotel hotel) { this.id = hotel.getId(); this.name = hotel.getName(); this.address = hotel.getAddress(); this.price = hotel.getPrice(); this.score = hotel.getScore(); this.brand = hotel.getBrand(); this.city = hotel.getCity(); this.starName = hotel.getStarName(); this.business = hotel.getBusiness(); this.location = hotel.getLatitude() + ", " + hotel.getLongitude(); this.pic = hotel.getPic(); // 组装suggestion if(this.business.contains("/")){ // business has multiple values, need to cut String[] arr = this.business.split("/"); // add elements this.suggestion = new ArrayList<>(); this.suggestion.add(this.brand); Collections.addAll(this.suggestion, arr); }else { this.suggestion = Arrays.asList(this.brand, this.business); } } }
3. Java API for automatic query completion
Core implementation method:
@Override
public List<String> getSuggestions(String prefix) {
try {
// 1.准备Request
SearchRequest request = new SearchRequest("hotel");
// 2.准备DSL
request.source().suggest(new SuggestBuilder().addSuggestion(
"suggestions",
SuggestBuilders.completionSuggestion("suggestion")
.prefix(prefix)
.skipDuplicates(true)
.size(10
));
// 3.发起请求
SearchResponse response = client.search(request, RequestOptions.DEFAULT);
// 4.解析结果
Suggest suggest = response.getSuggest();
// 4.1.根据补全查询名称,获取补全结果
CompletionSuggestion suggestions = suggest.getSuggestion("suggestions");
// 4.2.获取options
List<CompletionSuggestion.Entry.Option> options = suggestions.getOptions();
// 4.3.遍历
List<String> list = new ArrayList<>(options.size());
for (CompletionSuggestion.Entry.Option option : options) {
String text = option.getText().toString();
list.add(text);
}
return list;
} catch (IOException e) {
throw new RuntimeException(e);
}
}