[ES] Data aggregation & auto-completion


1. Data aggregation

Aggregations ( aggregations ) allow us to achieve statistics, analysis, and operations on data extremely conveniently. For example:

  • What brand of mobile phone is the most popular?
  • The average price, the highest price, the lowest price of these phones?
  • How are these phones selling monthly?

It is much more convenient to implement these statistical functions than the sql of the database, and the query speed is very fast, which can realize near real-time search effect.


1.1. Types of Aggregation

There are three common types of aggregation:

  • **Bucket** aggregation: used to group documents

    • TermAggregation: group by document field value, such as group by brand value, group by country
    • Date Histogram: Group by date ladder, for example, a week as a group, or a month as a group
  • **Metric** aggregation: used to calculate some values, such as: maximum value, minimum value, average value, etc.

    • Avg: Average
    • Max: find the maximum value
    • Min: Find the minimum value
    • Stats: Simultaneously seek max, min, avg, sum, etc.
  • **Pipeline** Aggregation: Aggregation based on the results of other aggregations

Note: The fields participating in the aggregation must be keyword, date, value, and Boolean


1.2.DSL realizes aggregation

Now, we want to count the hotel brands in all the data. In fact, we group the data according to the brand. At this point, aggregation can be done based on the name of the hotel brand, that is, Bucket aggregation.


1.2.1. Bucket aggregation syntax

The syntax is as follows:

GET /hotel/_search
{
    
    
  "size": 0,  // 设置size为0,结果中不包含文档,只包含聚合结果
  "aggs": {
    
     // 定义聚合
    "brandAgg": {
    
     //给聚合起个名字
      "terms": {
    
     // 聚合的类型,按照品牌值聚合,所以选择term
        "field": "brand", // 参与聚合的字段
        "size": 20 // 希望获取的聚合结果数量
      }
    }
  }
}

The result is shown in the figure:


1.2.2. Aggregation result sorting

By default, Bucket aggregation will count the number of documents in the Bucket, record it as _count, and sort in descending order of _count.

We can specify the order attribute to customize the sorting method of the aggregation:

GET /hotel/_search
{
    
    
  "size": 0, 
  "aggs": {
    
    
    "brandAgg": {
    
    
      "terms": {
    
    
        "field": "brand",
        "order": {
    
    
          "_count": "asc" // 按照_count升序排列
        },
        "size": 20
      }
    }
  }
}

1.2.3. Limit the scope of aggregation

By default, Bucket aggregation aggregates all documents in the index library, but in real scenarios, users will enter search conditions, so the aggregation must be the aggregation of search results. Then the aggregation has to be qualified.

We can limit the range of documents to be aggregated by adding query conditions:

GET /hotel/_search
{
    
    
  "query": {
    
    
    "range": {
    
    
      "price": {
    
    
        "lte": 200 // 只对200元以下的文档聚合
      }
    }
  }, 
  "size": 0, 
  "aggs": {
    
    
    "brandAgg": {
    
    
      "terms": {
    
    
        "field": "brand",
        "size": 20
      }
    }
  }
}

This time, the aggregated brands are significantly less:


1.2.4. Metric aggregation syntax

In the last class, we grouped hotels by brand to form buckets. Now we need to perform calculations on the hotels in the bucket to obtain the min, max, and avg values ​​of the user ratings for each brand.

This requires the use of Metric aggregation, such as stat aggregation: you can get results such as min, max, and avg.

The syntax is as follows:

GET /hotel/_search
{
    
    
  "size": 0, 
  "aggs": {
    
    
    "brandAgg": {
    
     
      "terms": {
    
     
        "field": "brand", 
        "size": 20
      },
      "aggs": {
    
     // 是brands聚合的子聚合,也就是分组后对每组分别计算
        "score_stats": {
    
     // 聚合名称
          "stats": {
    
     // 聚合类型,这里stats可以计算min、max、avg等
            "field": "score" // 聚合字段,这里是score
          }
        }
      }
    }
  }
}

This time the score_stats aggregation is a sub-aggregation nested inside the brandAgg aggregation. Because we need to calculate separately in each bucket.

In addition, we can also sort the aggregation results, for example, according to the average hotel score of each bucket:


1.2.5. Summary

aggs stands for aggregation, which is at the same level as query. What is the function of query at this time?

  • Scope the aggregated documents

The three elements necessary for aggregation:

  • aggregate name
  • aggregation type
  • aggregate field

Aggregate configurable properties are:

  • size: specify the number of aggregation results
  • order: specify the sorting method of aggregation results
  • field: specify the aggregation field

1.3. RestAPI implements aggregation


1.3.1. API Syntax

Aggregation conditions are at the same level as query conditions, so request.source() needs to be used to specify aggregation conditions.

Syntax for aggregate conditions:

The aggregation result is also different from the query result, and the API is also special. However, JSON is also parsed layer by layer:


1.3.2. Business requirements

Requirement: The brand, city and other information of the search page should not be hard-coded on the page, but obtained by aggregated hotel data in the index library:

analyze:

At present, the city list, star list, and brand list on the page are all hard-coded, and will not change as the search results change. But when the user's search conditions change, the search results will change accordingly.

For example: if a user searches for "Oriental Pearl", the searched hotel must be near the Shanghai Oriental Pearl Tower. Therefore, the city can only be Shanghai, and the information of Beijing, Shenzhen, and Hangzhou should not be displayed in the city list at this time.

That is to say, which cities are included in the search results, which cities should be listed on the page; which brands are included in the search results, which brands should be listed on the page.

How do I know which brands are included in my search results? How do I know which cities are included in my search results?

Use the aggregation function and Bucket aggregation to group the documents in the search results based on brands and cities, and you can know which brands and cities are included.

Because it is an aggregation of search results, the aggregation is a limited-range aggregation , that is to say, the limiting conditions of the aggregation are consistent with the conditions of the search document.

Looking at the browser, we can find that the front end has actually sent such a request:

The request parameters are exactly the same as those for searching documents .

The return value type is the final result to be displayed on the page:

The result is a Map structure:

  • key is a string, city, star, brand, price
  • value is a collection, such as the names of multiple cities

1.3.3. Business Realization

Add a method to cn.itcast.hotel.webthe package HotelController, following the requirements:

  • Request method:POST
  • Request path:/hotel/filters
  • Request parameters: RequestParams, consistent with the parameters of the search document
  • Return value type:Map<String, List<String>>

code:

    @PostMapping("filters")
    public Map<String, List<String>> getFilters(@RequestBody RequestParams params){
    
    
        return hotelService.getFilters(params);
    }

The getFilters method in IHotelService is called here, which has not been implemented yet.

cn.itcast.hotel.service.IHotelServiceDefine the new method in :

Map<String, List<String>> filters(RequestParams params);

cn.itcast.hotel.service.impl.HotelServiceImplement the method in :

@Override
public Map<String, List<String>> filters(RequestParams params) {
    
    
    try {
    
    
        // 1.准备Request
        SearchRequest request = new SearchRequest("hotel");
        // 2.准备DSL
        // 2.1.query
        buildBasicQuery(params, request);
        // 2.2.设置size
        request.source().size(0);
        // 2.3.聚合
        buildAggregation(request);
        // 3.发出请求
        SearchResponse response = client.search(request, RequestOptions.DEFAULT);
        // 4.解析结果
        Map<String, List<String>> result = new HashMap<>();
        Aggregations aggregations = response.getAggregations();
        // 4.1.根据品牌名称,获取品牌结果
        List<String> brandList = getAggByName(aggregations, "brandAgg");
        result.put("品牌", brandList);
        // 4.2.根据品牌名称,获取品牌结果
        List<String> cityList = getAggByName(aggregations, "cityAgg");
        result.put("城市", cityList);
        // 4.3.根据品牌名称,获取品牌结果
        List<String> starList = getAggByName(aggregations, "starAgg");
        result.put("星级", starList);

        return result;
    } catch (IOException e) {
    
    
        throw new RuntimeException(e);
    }
}

private void buildAggregation(SearchRequest request) {
    
    
    request.source().aggregation(AggregationBuilders
                                 .terms("brandAgg")
                                 .field("brand")
                                 .size(100)
                                );
    request.source().aggregation(AggregationBuilders
                                 .terms("cityAgg")
                                 .field("city")
                                 .size(100)
                                );
    request.source().aggregation(AggregationBuilders
                                 .terms("starAgg")
                                 .field("starName")
                                 .size(100)
                                );
}

private List<String> getAggByName(Aggregations aggregations, String aggName) {
    
    
    // 4.1.根据聚合名称获取聚合结果
    Terms brandTerms = aggregations.get(aggName);
    // 4.2.获取buckets
    List<? extends Terms.Bucket> buckets = brandTerms.getBuckets();
    // 4.3.遍历
    List<String> brandList = new ArrayList<>();
    for (Terms.Bucket bucket : buckets) {
    
    
        // 4.4.获取key
        String key = bucket.getKeyAsString();
        brandList.add(key);
    }
    return brandList;
}

2. Auto-completion

When the user enters a character in the search box, we should prompt the search item related to the character, as shown in the figure:

This function of prompting complete entries based on the letters entered by the user is automatic completion.

Because it needs to be inferred based on the pinyin letters, the pinyin word segmentation function is used.


2.1. Pinyin tokenizer

To achieve completion based on letters, it is necessary to segment the document according to pinyin. There happens to be a pinyin word segmentation plugin for elasticsearch on GitHub. Address: https://github.com/medcl/elasticsearch-analysis-pinyin

The installation package of the pinyin word breaker is also provided in the pre-class materials:

The installation method is the same as the IK tokenizer, in three steps:

① Decompression

② Upload to the virtual machine, the plugin directory of elasticsearch

③ Restart elasticsearch

④ Test

For detailed installation steps, please refer to the installation process of IK tokenizer.

The test usage is as follows:

POST /_analyze
{
    
    
  "text": "如家酒店还不错",
  "analyzer": "pinyin"
}

result:


2.2. Custom tokenizer

The default pinyin word breaker divides each Chinese character into pinyin, but what we want is to form a set of pinyin for each entry. We need to customize the pinyin word breaker to form a custom word breaker.

The composition of the analyzer in elasticsearch consists of three parts:

  • Character filters: Process the text before the tokenizer. e.g. delete characters, replace characters
  • tokenizer: Cut the text into terms according to certain rules. For example, keyword is not participle; there is also ik_smart
  • tokenizer filter: further process the entries output by the tokenizer. For example, case conversion, synonyms processing, pinyin processing, etc.

When document word segmentation, the document will be processed by these three parts in turn:

The syntax for declaring a custom tokenizer is as follows:

PUT /test
{
    
    
  "settings": {
    
    
    "analysis": {
    
    
      "analyzer": {
    
     // 自定义分词器
        "my_analyzer": {
    
      // 分词器名称
          "tokenizer": "ik_max_word",
          "filter": "py"
        }
      },
      "filter": {
    
     // 自定义tokenizer filter
        "py": {
    
     // 过滤器名称
          "type": "pinyin", // 过滤器类型,这里是pinyin
		  "keep_full_pinyin": false,
          "keep_joined_full_pinyin": true,
          "keep_original": true,
          "limit_first_letter_length": 16,
          "remove_duplicated_term": true,
          "none_chinese_pinyin_tokenize": false
        }
      }
    }
  },
  "mappings": {
    
    
    "properties": {
    
    
      "name": {
    
    
        "type": "text",
        "analyzer": "my_analyzer",
        "search_analyzer": "ik_smart"
      }
    }
  }
}

test:

Summarize:

How to use Pinyin tokenizer?

  • ①Download the pinyin tokenizer

  • ② Unzip and put it in the plugin directory of elasticsearch

  • ③Restart

How to customize the tokenizer?

  • ① When creating an index library, configure it in settings, which can contain three parts

  • ②character filter

  • ③tokenizer

  • ④filter

Precautions for pinyin word breaker?

  • In order to avoid searching for homophones, do not use the pinyin word breaker when searching

2.3. Autocomplete query

Elasticsearch provides Completion Suggester query to achieve automatic completion. This query will match terms beginning with the user input and return them. In order to improve the efficiency of the completion query, there are some constraints on the types of fields in the document:

  • The fields participating in the completion query must be of completion type.

  • The content of the field is generally an array formed by multiple entries for completion.

For example, an index library like this:

// 创建索引库
PUT test
{
    
    
  "mappings": {
    
    
    "properties": {
    
    
      "title":{
    
    
        "type": "completion"
      }
    }
  }
}

Then insert the following data:

// 示例数据
POST test/_doc
{
    
    
  "title": ["Sony", "WH-1000XM3"]
}
POST test/_doc
{
    
    
  "title": ["SK-II", "PITERA"]
}
POST test/_doc
{
    
    
  "title": ["Nintendo", "switch"]
}

The query DSL statement is as follows:

// 自动补全查询
GET /test/_search
{
    
    
  "suggest": {
    
    
    "title_suggest": {
    
    
      "text": "s", // 关键字
      "completion": {
    
    
        "field": "title", // 补全查询的字段
        "skip_duplicates": true, // 跳过重复的
        "size": 10 // 获取前10条结果
      }
    }
  }
}

2.4. Realize automatic completion of hotel search box

Now, our hotel index library has not set up a pinyin word breaker, and we need to modify the configuration in the index library. But we know that the index library cannot be modified, it can only be deleted and then recreated.

In addition, we need to add a field for auto-completion, and put brand, suggestion, city, etc. into it as a prompt for auto-completion.

So, to summarize, the things we need to do include:

  1. Modify the structure of the hotel index library and set a custom pinyin word breaker

  2. Modify the name and all fields of the index library and use a custom tokenizer

  3. The index library adds a new field suggestion, the type is completion type, using a custom tokenizer

  4. Add a suggestion field to the HotelDoc class, which contains brand and business

  5. Re-import data to the hotel library


2.4.1. Modify the hotel mapping structure

code show as below:

// 酒店数据索引库
PUT /hotel
{
    
    
  "settings": {
    
    
    "analysis": {
    
    
      "analyzer": {
    
    
        "text_anlyzer": {
    
    
          "tokenizer": "ik_max_word",
          "filter": "py"
        },
        "completion_analyzer": {
    
    
          "tokenizer": "keyword",
          "filter": "py"
        }
      },
      "filter": {
    
    
        "py": {
    
    
          "type": "pinyin",
          "keep_full_pinyin": false,
          "keep_joined_full_pinyin": true,
          "keep_original": true,
          "limit_first_letter_length": 16,
          "remove_duplicated_term": true,
          "none_chinese_pinyin_tokenize": false
        }
      }
    }
  },
  "mappings": {
    
    
    "properties": {
    
    
      "id":{
    
    
        "type": "keyword"
      },
      "name":{
    
    
        "type": "text",
        "analyzer": "text_anlyzer",
        "search_analyzer": "ik_smart",
        "copy_to": "all"
      },
      "address":{
    
    
        "type": "keyword",
        "index": false
      },
      "price":{
    
    
        "type": "integer"
      },
      "score":{
    
    
        "type": "integer"
      },
      "brand":{
    
    
        "type": "keyword",
        "copy_to": "all"
      },
      "city":{
    
    
        "type": "keyword"
      },
      "starName":{
    
    
        "type": "keyword"
      },
      "business":{
    
    
        "type": "keyword",
        "copy_to": "all"
      },
      "location":{
    
    
        "type": "geo_point"
      },
      "pic":{
    
    
        "type": "keyword",
        "index": false
      },
      "all":{
    
    
        "type": "text",
        "analyzer": "text_anlyzer",
        "search_analyzer": "ik_smart"
      },
      "suggestion":{
    
    
          "type": "completion",
          "analyzer": "completion_analyzer"
      }
    }
  }
}

2.4.2. Modify the HotelDoc entity

A field needs to be added in HotelDoc for automatic completion, and the content can be information such as hotel brand, city, business district, etc. As required for autocomplete fields, preferably an array of these fields.

So we add a suggestion field in HotelDoc, the type is List<String>, and then put information such as brand, city, business, etc. into it.

code show as below:

package cn.itcast.hotel.pojo;

import lombok.Data;
import lombok.NoArgsConstructor;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.List;

@Data
@NoArgsConstructor
public class HotelDoc {
    
    
    private Long id;
    private String name;
    private String address;
    private Integer price;
    private Integer score;
    private String brand;
    private String city;
    private String starName;
    private String business;
    private String location;
    private String pic;
    private Object distance;
    private Boolean isAD;
    private List<String> suggestion;

    public HotelDoc(Hotel hotel) {
    
    
        this.id = hotel.getId();
        this.name = hotel.getName();
        this.address = hotel.getAddress();
        this.price = hotel.getPrice();
        this.score = hotel.getScore();
        this.brand = hotel.getBrand();
        this.city = hotel.getCity();
        this.starName = hotel.getStarName();
        this.business = hotel.getBusiness();
        this.location = hotel.getLatitude() + ", " + hotel.getLongitude();
        this.pic = hotel.getPic();
        // 组装suggestion
        if(this.business.contains("/")){
    
    
            // business有多个值,需要切割
            String[] arr = this.business.split("/");
            // 添加元素
            this.suggestion = new ArrayList<>();
            this.suggestion.add(this.brand);
            Collections.addAll(this.suggestion, arr);
        }else {
    
    
            this.suggestion = Arrays.asList(this.brand, this.business);
        }
    }
}

2.4.3. Reimport

Re-execute the import data function written before, and you can see that the new hotel data contains suggestions:


2.4.4. Java API for auto-completion query

Before we learned the DSL of auto-completion query, but did not learn the corresponding Java API, here is an example:

The result of auto-completion is also quite special, the parsing code is as follows:


2.4.5. Realize the automatic completion of the search box

Looking at the front-end page, we can find that when we type in the input box, the front-end will initiate an ajax request:

The return value is a collection of completed entries, the type isList<String>

1) Add a new interface cn.itcast.hotel.webunder the package HotelControllerto receive new requests:

@GetMapping("suggestion")
public List<String> getSuggestions(@RequestParam("key") String prefix) {
    
    
    return hotelService.getSuggestions(prefix);
}

2) Add the method in cn.itcast.hotel.servicethe package :IhotelService

List<String> getSuggestions(String prefix);

3) cn.itcast.hotel.service.impl.HotelServiceImplement the method in:

@Override
public List<String> getSuggestions(String prefix) {
    
    
    try {
    
    
        // 1.准备Request
        SearchRequest request = new SearchRequest("hotel");
        // 2.准备DSL
        request.source().suggest(new SuggestBuilder().addSuggestion(
            "suggestions",
            SuggestBuilders.completionSuggestion("suggestion")
            .prefix(prefix)
            .skipDuplicates(true)
            .size(10)
        ));
        // 3.发起请求
        SearchResponse response = client.search(request, RequestOptions.DEFAULT);
        // 4.解析结果
        Suggest suggest = response.getSuggest();
        // 4.1.根据补全查询名称,获取补全结果
        CompletionSuggestion suggestions = suggest.getSuggestion("suggestions");
        // 4.2.获取options
        List<CompletionSuggestion.Entry.Option> options = suggestions.getOptions();
        // 4.3.遍历
        List<String> list = new ArrayList<>(options.size());
        for (CompletionSuggestion.Entry.Option option : options) {
    
    
            String text = option.getText().toString();
            list.add(text);
        }
        return list;
    } catch (IOException e) {
    
    
        throw new RuntimeException(e);
    }
}

Study Notes from Dark Horse Programmer

By – Suki 2023/4/9

Guess you like

Origin blog.csdn.net/Eumenides_Suki/article/details/130040311