[Elasticsearch] Search result processing and RestClient query documents

Table of contents

2. Search result processing

2.1. Sorting

2.1.1. Ordinary field sorting

2.1.2. Geographical coordinate sorting

2.2. Pagination

2.2.1. Basic pagination

2.2.2. Deep pagination problem

2.2.3. Summary

2.3. Highlight

2.3.1. Highlighting principle

2.3.2. Achieve highlighting

2.4. Summary

3. RestClient query document

3.1. Quick Start

3.1.1. Initiate query request

3.1.2. Parsing the response

3.1.3. Complete code

3.1.4. Summary

3.2. match query

3.3. Precise query

3.4. Boolean queries

3.5. Sorting, paging

3.6. Highlight

3.6.1. Highlight request build

3.6.2. Analysis of highlighted results

2. Search result processing

The search results can be processed or displayed in a way specified by the user.

2.1. Sorting

Elasticsearch sorts according to the correlation score (_score) by default, but also supports custom ways to sort search results . Field types that can be sorted include: keyword type, numeric type, geographic coordinate type, date type, etc.

2.1.1. Ordinary field sorting

The syntax for sorting by keyword, value, and date is basically the same.

Grammar :

GET /indexName/_search
{
  "query": {
    "match_all": {}
  },
  "sort": [
    {
      "FIELD": "desc" // sorting field, sorting method ASC, DESC
    }
  ]
}

The sorting condition is an array, that is, multiple sorting conditions can be written. According to the order of declaration, when the first condition is equal, then sort according to the second condition, and so on

Example :

Requirement description: The hotel data is sorted in descending order of user ratings (score), and the same ratings are sorted in ascending order of price (price)

2.1.2. Geographical coordinate sorting

Geographic coordinate ordering is slightly different.

Grammar description :

GET /indexName/_search
{
  "query": {
    "match_all": {}
  },
  "sort": [
    {
      "_geo_distance" : {
          "FIELD" : "latitude, longitude", // field name of geo_point type in the document, target coordinate point
          "order" : "asc", // sorting method
          "unit" : "km" // distance unit for sorting
      }
    }
  ]
}

The meaning of this query is:

  • Specify a coordinate as the target point

  • Calculate the distance from the coordinates of the specified field (must be geo_point type) to the target point in each document

  • Sort by distance

Example:

Requirement description: Realize sorting the hotel data in ascending order according to the distance to your location coordinates

Tip: The way to get the latitude and longitude of your location: Get the latitude and longitude of mouse click-Map Properties-Example Center-JS API 2.0 Example | Gaode Map API

Suppose my location is: 31.034661, 121.612282, looking for the nearest hotel around me.

2.2. Pagination

Elasticsearch only returns top10 data by default. And if you want to query more data, you need to modify the paging parameters. In elasticsearch, modify the from and size parameters to control the paging results to be returned:

  • from: start from the first few documents

  • size: how many documents to query in total

similar to mysqllimit ?, ?

2.2.1. Basic pagination

The basic syntax of pagination is as follows:

GET /hotel/_search
{
  "query": {
    "match_all": {}
  },
  "from": 0, // The starting position of pagination, the default is 0
  "size": 10, // total number of documents expected to be retrieved
  "sort": [
    {"price": "asc"}
  ]
}

2.2.2. Deep pagination problem

Now, I want to query the data of 990~1000, the query logic should be written as follows:

GET /hotel/_search
{
  "query": {
    "match_all": {}
  },
  "from": 990, // The starting position of pagination, the default is 0
  "size": 10, // total number of documents expected to be retrieved
  "sort": [
    {"price": "asc"}
  ]
}

Here is the data starting from query 990, that is, the 990th to 1000th data.

However, when paging inside elasticsearch, you must first query 0~1000 entries, and then intercept the 10 entries of 990~1000:

Query TOP1000, if es is a single-point mode, this does not have much impact.

But elasticsearch must be a cluster in the future. For example, my cluster has 5 nodes, and I want to query TOP1000 data. It is not enough to query 200 items per node.

Because the TOP200 of node A may be ranked beyond 10,000 on another node.

Therefore, if you want to obtain the TOP1000 of the entire cluster, you must first query the TOP1000 of each node. After summarizing the results, re-rank and re-intercept the TOP1000.

So what if I want to query the data of 9900~10000? Do we need to query TOP10000 first? Then each node has to query 10,000 entries? aggregated into memory?

When the query paging depth is large, there will be too much summary data, which will put a lot of pressure on the memory and CPU. Therefore, elasticsearch will prohibit requests with from+ size exceeding 10,000.

For deep paging, ES provides two solutions, official documents :

  • search after: sorting is required when paging, the principle is to query the next page of data starting from the last sorting value. The official recommended way to use.

  • scroll: The principle is to form a snapshot of the sorted document ids and store them in memory. It is officially deprecated.

2.2.3. Summary

Common implementation schemes and advantages and disadvantages of pagination query:

  • from + size

    • Advantages: Support random page turning

    • Disadvantages: deep paging problem, the default query upper limit (from + size) is 10000

    • Scenario: Random page-turning searches such as Baidu, JD.com, Google, and Taobao

  • after search

    • Advantages: no query upper limit (the size of a single query does not exceed 10000)

    • Disadvantage: can only query backward page by page, does not support random page turning

    • Scenario: Search without random page turning requirements, such as mobile phone scrolling down to turn pages

  • scroll

    • Advantages: no query upper limit (the size of a single query does not exceed 10000)

    • Disadvantages: There will be additional memory consumption, and the search results are not real-time

    • Scenario: Acquisition and migration of massive data. It is not recommended starting from ES7.1. It is recommended to use the after search solution.

2.3. Highlight

2.3.1. Highlighting principle

What is highlighting?

When we search on Baidu and JD.com, the keywords will turn red, which is more eye-catching. This is called highlighting:

The implementation of highlighting is divided into two steps:

  • 1) Add a label to all keywords in the document, such as <em>label

  • 2) The page <em>writes CSS styles for the tags

2.3.2. Achieve highlighting

Highlighted syntax :

GET /hotel/_search
{
  "query": {
    "match": {
      "FIELD": "TEXT" // query condition, highlight must use full-text search query
    }
  },
  "highlight": {
    "fields": { // Specify the fields to highlight
      "FIELD": {
        "pre_tags": "<em>", // pre-tags used to mark highlighted fields
        "post_tags": "</em>" // Post tags used to mark highlighted fields
      }
    }
  }
}

Notice:

  • Highlighting is for keywords, so the search conditions must contain keywords , not range queries.

  • By default, the highlighted field must be the same as the field specified by the search , otherwise it cannot be highlighted

  • If you want to highlight non-search fields, you need to add an attribute: required_field_match=false

Example :

2.4. Summary

The query DSL is a large JSON object with the following properties:

  • query: query condition

  • from and size: paging conditions

  • sort: sorting conditions

  • highlight: highlight condition

Example:

3. RestClient query document

Document query is also applicable to the RestHighLevelClient object learned yesterday. The basic steps include:

  • 1) Prepare the Request object

  • 2) Prepare request parameters

  • 3) Initiate a request

  • 4) Parse the response

3.1. Quick Start

Let's take the match_all query as an example

3.1.1. Initiate query request

Code interpretation:

  • The first step is to create SearchRequestan object and specify the index library name

  • The second step is to use request.source()the construction of DSL, which can include query, paging, sorting, highlighting, etc.

    • query(): Represents the query condition, using QueryBuilders.matchAllQuery()the DSL to construct a match_all query

  • The third step is to use client.search() to send a request and get a response

There are two key APIs here. One is request.source()that it contains all functions such as query, sorting, paging, highlighting, etc.:

The other is QueryBuildersthat it contains various queries such as match, term, function_score, bool, etc.:

3.1.2. Parsing the response

Analysis of the response result:

The result returned by elasticsearch is a JSON string, the structure contains:

  • hits: the result of the hit

    • total: The total number of entries, where value is the specific total entry value

    • max_score: the relevance score of the highest scoring document across all results

    • hits: An array of documents for search results, each of which is a json object

      • _source: the original data in the document, also a json object

Therefore, we parse the response result, which is to parse the JSON string layer by layer. The process is as follows:

  • SearchHits: Obtained through response.getHits(), which is the outermost hits in JSON, representing the result of the hit

    • SearchHits#getTotalHits().value: Get the total number of information

    • SearchHits#getHits(): Get the SearchHit array, which is the document array

      • SearchHit#getSourceAsString(): Get the _source in the document result, which is the original json document data

3.1.3. Complete code

The complete code is as follows:

@Test
void testMatchAll() throws IOException {
    // 1. Prepare Request
    SearchRequest request = new SearchRequest("hotel");
    // 2. Prepare DSL
    request.source()
        .query(QueryBuilders.matchAllQuery());
    // 3. Send request
    SearchResponse response = client.search(request, RequestOptions.DEFAULT);
    // 4. Parse the response
    handleResponse(response);
}
private void handleResponse(SearchResponse response) {
    // 4. Parse the response
    SearchHits searchHits = response.getHits();
    // 4.1. Get the total number of items
    long total = searchHits.getTotalHits().value;
    System.out.println("A total of searched" + total + "data");
    // 4.2. Document array
    SearchHit[] hits = searchHits.getHits();
    // 4.3. Traverse
    for (SearchHit hit : hits) {
        // get document source
        String json = hit.getSourceAsString();
        // deserialize
        HotelDoc hotelDoc = JSON.parseObject(json, HotelDoc.class);
        System.out.println("hotelDoc = " + hotelDoc);
    }
}

3.1.4. Summary

The basic steps of a query are:

  1. Create a SearchRequest object

  2. Prepare Request.source(), which is DSL.

    ① QueryBuilders to build query conditions

    ② Pass in the query() method of Request.source()

  3. send request, get result

  4. Parsing results (refer to JSON results, from outside to inside, parse layer by layer)

3.2. match query

The match and multi_match queries of full-text search are basically the same as the API of match_all. The difference is the query condition, which is the query part.

Therefore, the difference in the Java code is mainly the parameters in request.source().query(). Also use the methods provided by QueryBuilders:

The result parsing code is completely consistent and can be extracted and shared.

The complete code is as follows:

@Test
void testMatch() throws IOException {
    // 1. Prepare Request
    SearchRequest request = new SearchRequest("hotel");
    // 2. Prepare DSL
    request.source()
        .query(QueryBuilders.matchQuery("all", "Home Inn"));
    // 3. Send request
    SearchResponse response = client.search(request, RequestOptions.DEFAULT);
    // 4. Parse the response
    handleResponse(response);
}

3.3. Precise query

Exact queries are mainly two:

  • term: term exact match

  • range: range query

Compared with the previous query, the difference is also in the query condition, and everything else is the same.

The API for query condition construction is as follows:

3.4. Boolean queries

Boolean query is to combine other queries with must, must_not, filter, etc. The code example is as follows:

It can be seen that the difference between API and other queries is that the construction of query conditions, QueryBuilders, result parsing and other codes are completely unchanged.

The complete code is as follows:

@Test
void testBool() throws IOException {
    // 1. Prepare Request
    SearchRequest request = new SearchRequest("hotel");
    // 2. Prepare DSL
    // 2.1. Prepare BooleanQuery
    BoolQueryBuilder boolQuery = QueryBuilders.boolQuery();
    // 2.2. Add term
    boolQuery.must(QueryBuilders.termQuery("city", "Hangzhou"));
    // 2.3. Add range
    boolQuery.filter(QueryBuilders.rangeQuery("price").lte(250));
    request.source().query(boolQuery);
    // 3. Send request
    SearchResponse response = client.search(request, RequestOptions.DEFAULT);
    // 4. Parse the response
    handleResponse(response);
}

3.5. Sorting, paging

The sorting and paging of search results are parameters at the same level as query, so they are also set using request.source().

The corresponding APIs are as follows:

Full code example:

@Test
void testPageAndSort() throws IOException {
    // page number, size per page
    int page = 1, size = 5;
    // 1. Prepare Request
    SearchRequest request = new SearchRequest("hotel");
    // 2. Prepare DSL
    // 2.1.query
    request.source().query(QueryBuilders.matchAllQuery());
    // 2.2. sort sort
    request.source().sort("price", SortOrder.ASC);
    // 2.3. Paging from, size
    request.source().from((page - 1) * size).size(5);
    // 3. Send request
    SearchResponse response = client.search(request, RequestOptions.DEFAULT);
    // 4. Parse the response
    handleResponse(response);
}

3.6. Highlight

The highlighted code is quite different from the previous code, there are two points:

  • Query DSL: In addition to query conditions, you also need to add highlight conditions, which are also at the same level as query.

  • Result parsing: In addition to parsing the _source document data, the result also needs to parse the highlighted result

3.6.1. Highlight request build

The construction API of the highlight request is as follows:

The above code omits the query condition part, but don’t forget: the highlight query must use full-text search and search keywords, so that keywords can be highlighted in the future.

The complete code is as follows:

@Test
void testHighlight() throws IOException {
    // 1. Prepare Request
    SearchRequest request = new SearchRequest("hotel");
    // 2. Prepare DSL
    // 2.1.query
    request.source().query(QueryBuilders.matchQuery("all", "Home Inn"));
    // 2.2. Highlight
    request.source().highlighter(new HighlightBuilder().field("name").requireFieldMatch(false));
    // 3. Send request
    SearchResponse response = client.search(request, RequestOptions.DEFAULT);
    // 4. Parse the response
    handleResponse(response);
}

3.6.2. Analysis of highlighted results

The highlighted results and the query document results are separated by default and not together.

So parsing the highlighted code requires additional processing:

Code interpretation:

  • Step 1: Get the source from the result. hit.getSourceAsString(), this part is the non-highlighted result, json string. It also needs to be deserialized into a HotelDoc object

  • Step 2: Obtain the highlighted result. hit.getHighlightFields(), the return value is a Map, the key is the highlight field name, and the value is the HighlightField object, representing the highlight value

  • Step 3: Obtain the highlighted field value object HighlightField from the map according to the highlighted field name

  • Step 4: Get Fragments from HighlightField and convert them to strings. This part is the real highlighted string

  • Step 5: Replace non-highlighted results in HotelDoc with highlighted results

The complete code is as follows:

private void handleResponse(SearchResponse response) {
    // 4. Parse the response
    SearchHits searchHits = response.getHits();
    // 4.1. Get the total number of items
    long total = searchHits.getTotalHits().value;
    System.out.println("A total of searched" + total + "data");
    // 4.2. Document array
    SearchHit[] hits = searchHits.getHits();
    // 4.3. Traverse
    for (SearchHit hit : hits) {
        // get document source
        String json = hit.getSourceAsString();
        // deserialize
        HotelDoc hotelDoc = JSON.parseObject(json, HotelDoc.class);
        // get highlighted result
        Map<String, HighlightField> highlightFields = hit.getHighlightFields();
        if (!CollectionUtils.isEmpty(highlightFields)) {
            // Get the highlighted result according to the field name
            HighlightField highlightField = highlightFields.get("name");
            if (highlightField != null) {
                // Get the highlight value
                String name = highlightField.getFragments()[0].string();
                // overwrite non-highlighted results
                hotelDoc.setName(name);
            }
        }
        System.out.println("hotelDoc = " + hotelDoc);
    }
}

Guess you like

Origin blog.csdn.net/weixin_45481821/article/details/131739939