Spring Cloud Series (16) [Distributed Search Engines] - DSL Query and Relevance Calculation Learning (Part)

In the SpringCloud series (fifteen) [Distributed Search Engine] - Learning and using the RestClient client API in combination with actual application scenarios, we have already had a preliminary understanding of RestClient and stored some data, but this is not the purpose of our learning ElasticSearch. ElasticSearch is best at data search and analysis, so this blog will demonstrate the data search function of ElasticSearch.

① DSL query for documents

Common query types:

  • Query all: query all data, such as match_all;
  • Full-text search query (full text): Mainly use the word segmenter to segment the user's input content, and then go to the inverted index database for matching query, such as match_query / multi_match_query;
  • Precise query: Query data based on precise entries, generally used to query date/value and other types of fields, such as ids / range / term;
  • Geographic query (geo): Usually query based on latitude and longitude, such as geo_distance / geo_bounding_box;
  • Compound query (compound): You can combine the above query types and combine query conditions, such as bool / function_score.

All of the following query syntaxes are basically the same:

GET /索引库名/_search
{
    
    
  "query": {
    
    
    "查询类型": {
    
    
      "查询条件": "条件值"
    }
  }
}

1.1 Query all

GET /hotel/_search
{
    
    
  "query": {
    
    
    "match_all": {
    
    }
  }
}

Because it is querying all the data, there is no query condition, the query type is match_all, and it is generally not used to query all, and the usage scenario is limited to testing;

1.2 Full text search query

insert image description here

There are many usage scenarios for full-text search queries, and they are often used in our lives. For example, when buying shoes and clothes on Taobao, you search for the name of a certain brand or the name of an item. That is to say, you need to match the entry in the index database. Therefore, the fields participating in the search must also be text-type fields that can be segmented. Through this example, the basic process of full-text search query can be obtained as follows:

  • Segment the searched content to get the entry;
  • According to the entry, match it in the inverted index library to get the document id;
  • Find the document according to the document id, and then return to the page.

The query about full-text search mainly includes match / multi_match, one is a single-field query, and the other is a multi-field query. The multi-field query means that any field that meets the conditions meets the query conditions. Examples are as follows:

match:

GET /hotel/_search
{
    
    
  "query": {
    
    
    "match": {
    
    
      "all":"喜来登"
    }
  }
}

insert image description here

multi_match:

GET /hotel/_search
{
    
    
  "query": {
    
    
    "multi_match": {
    
    
      "query":"上海喜来登",
      "fields": ["name","business"]
    }
  }
}

insert image description here
Note here: multi_match is based on multiple fields to query, the more fields involved, the lower the efficiency of the query, so generally use match query.

1.3 Precise query

There are also many usage scenarios for precise query, such as querying data within a certain date range, or accurately querying data in a certain region;

  • term: query based on the exact value of the term, such as querying the Sheraton Hotel in the Beijing area, the filtered data is only the Sheraton Hotel in the Beijing area;
  • range: Query according to the value range, such as querying the data of hotels with more than 500 yuan.

term query:

GET /hotel/_search
{
    
    
  "query": {
    
    
    "term": {
    
    
      "city": {
    
    
        "value": "上海"
      }
    }
  }
}

insert image description here
Note here: the entry must be precise, and cannot be a phrase composed of multiple words. If it is Beijing and Shanghai, there will be no search results, as shown below:
insert image description here

range query:

GET /hotel/_search
{
    
    
  "query": {
    
    
    "range": {
    
    
      "price": {
    
    
        "gte": "2000",
        "lte":"5000"
      }
    }
  }
}

The inquired is a hotel with a price of 2,000 yuan to 5,000 yuan;
insert image description here

1.4 Geographic coordinate query


When going out to play, take a taxi or book a hotel, it is often necessary to locate nearby express trains and hotels. The query of geographic coordinates can realize such a function. One is to query according to the latitude and longitude of geographic coordinates, such as querying according to the rectangular range:

GET /hotel/_search
{
    
    
  "query": {
    
    
    "geo_bounding_box": {
    
    
      "FIELD": {
    
    
        "top_left": {
    
     
          "lat": 31.35786,
          "lon": 121.59324
        },
        "bottom_right": {
    
    
          "lat": 31.35493,
          "lon": 121.59838
        }
      }
    }
  }
}

Here, firstly, the coordinates of the point in the upper left corner and the point in the lower right corner must be determined, which is relatively complicated, but there is a simple way to query according to the distance, and query all the data whose specified center point is less than a certain distance value; that is to say, with my current position as the center, all arcs within a certain distance from me meet the conditions, as shown below:

GET /hotel/_search
{
    
    
  "query": {
    
    
    "geo_distance": {
    
    
      "distance": "5km",
      "location": "39.94076,116.46099"
    }
  }
}

Here I am looking for all the data that is 5 km away from Sanlitun with Sanlitun as the center point:
insert image description here

1.5 Compound query

  Compound queries can combine simple queries to implement more complex search logic, mainly in the following two types:

  • function score: Calculation function query, control the ranking of documents by controlling the score of document relevance;
  • bool query: Boolean query, which mainly uses logical relationships to combine multiple queries to implement complex search logic.

1.5.1 Relevance score

  When we use match to query, the document results will be scored according to the degree of relevance to the search term, and the returned results will also be sorted in descending order of the score; the scoring algorithm used by ElasticSearch early is the TF-IDF algorithm. The score algorithm is improved to BM25, BM25 will have an upper limit for the score of a single entry, and the curve will be smoother. The formula is as follows
insert image description here
:
insert image description here

1.5.2 Syntax

GET /hotel/_search
{
    
    
  "query": {
    
    
    "function_score": {
    
    
      "query": {
    
    
        "match": {
    
    
          "all": "北京"
        }
      },
      "functions": [
        {
    
    
          "filter": {
    
    
            "term": {
    
    
              "id": "1"
            }
          },
          "weight": 10
        }
      ],
      "boost_mode": "multiply"
    }
  }
}

insert image description here


As shown in the figure, it can be seen that the score values ​​are in descending order. The specific syntax is as follows: The
insert image description here
specific process is as follows:

  • Query and search documents according to the original conditions, and calculate the relevance score, that is, query score (original score);
  • Filter out documents that do not meet the conditions according to the filter conditions;
  • Obtain the function score (function score) based on the calculation function operation;
  • The original score (query score) and function score (function score) are calculated based on the operation mode, and the final result of the correlation score is obtained.

For example: to give Beijing Sheraton the top ranking, as follows:
the original query, the score is 2.6944847;
insert image description here
after adding the calculation function, the score is 4.6944847, as follows:
insert image description here

1.5.3 Boolean queries

insert image description here

  Boolean queries can be used in many scenarios. For example, when searching for Adidas on Taobao, we can select and filter according to shoe size/gender, etc.; because each field is different, and the query conditions or methods are also different, so there must be multiple different queries. To combine these queries, Boolean queries are used; in short, Boolean queries are also a combination of one or more words, and each word is a subquery. The combination methods are
 :

  • must: must match each subquery, similar to "and";
  • should: Selective matching subquery, similar to "or";
  • must_not: Must not match, do not participate in scoring, similar to "not";
  • filter: Must match, not involved in scoring.

example:

GET /hotel/_search
{
    
    
  "query": {
    
    
    "bool": {
    
    
      "must": [
        {
    
    "term": {
    
    "city": "北京"}}
      ],
      "should": [
        {
    
    "term": {
    
    "brand":"希尔顿"}},
        {
    
    "term": {
    
    "brand":"喜来登"}}
      ],
      "must_not": [
        {
    
    "range": {
    
    "price": {
    
    "lte": 1200} }}
      ],
      "filter": [
        {
    
    "range": {
    
    
          "score": {
    
    
            "gte": 47
          }
        }},
        {
    
    "geo_distance": {
    
    
          "distance": "50km",
          "location": {
    
    
            "lat": 39.91979,
            "lon": 116.41804
          }
        }}
      ]
    }
  }
}

Code Interpretation:
insert image description here


What needs to be noted here is that during the search process, the more fields involved in typing, the worse the query performance will be. Therefore, when querying with multiple conditions, you need to pay attention to the following two points:

  • The keyword search in the search box is a full-text search query, use the must query to participate in the calculation;
  • Other filter conditions use filter query and do not participate in score calculation.

② Processing of search results

③ RestClient query document

④ case

Guess you like

Origin blog.csdn.net/Onion_521257/article/details/129389228