ElasticSearch - DSL query document syntax, as well as deep paging problems and solutions

Table of contents

1. DSL query document syntax

Preface

1.1. Basic syntax of DSL Query

1.2. Full text search query

1.2.1. match query

1.2.2、multi_match

1.3. Accurate query

1.3.1, term query

1.3.2, range query

1.4. Geographical query

1.4.1、geo_bounding_box

1.4.2、geo_distance

1.5. Compound query

1.5.1. Relevance score calculation

1.5.2、function_score

1.5.3、boolean query

1.6. Search result processing

1.6.1. Sorting

1.6.2, paging

Deep paging problem

Solution to deep paging problem

1.6.3. Highlight

1.6.4. Search syntax summary syntax


1. DSL query document syntax


Preface

The case in this article continues the hotel data from the previous chapter.

1.1. Basic syntax of DSL Query

The basic query syntax is as follows:

GET /索引库名/_search
{
  "query": {
    "查询类型": {
      "查询条件": "条件值"
    }
  }
}

For example, query all hotel data (this is generally not done in a production environment, because the amount of data may be very large, so the query is very time-consuming, so it is generally used for testing)

GET /hotel/_search
{
  "query": {
    "match_all": {
        //由于这里是查询所有数据,因此没有查询条件
    }
  }
}

1.2. Full text search query

The full-text search query will segment the user input content through a word segmenter, and then match it in the inverted index database. It is often used for searches in search boxes, such as Baidu search box.

1.2.1. match query

One of the full-text search queries, the user input content will be segmented into words, and then searched in the inverted index database. The syntax is as follows:

GET /索引库名/_search
{
  "query": {
    "match": {
      "字段名": "TEXT文本内容"
    }
  }
}

For example, query data related to "Home Inn".

GET /hotel/_search
{
  "query": {
    "match": {
      "all": "北京如家酒店"
    }
  }
}

Ps: all Multiple fields are copied here through copy_to, such as name, city.... (details are introduced in the previous chapter) 

1.2.2、multi_match

Similar to match query, except that multiple fields are allowed to be queried at the same time. 

Ps: Multiple field queries also mean low query performance. It is recommended to use copy_to to copy the indexes of other fields to one field, which can improve query efficiency.

The syntax is as follows:

GET /indexName/_search
{
  "query": {
    "multi_match": {
      "query": "TEXT",
      "fields": ["字段名1", " 字段名2"]
    }
  }
}

For example, query hotel data based on the name, and city fields.

GET /hotel/_search
{
  "query": {
    "multi_match": {
      "query": "上海如家酒店",
      "fields": ["name", "city"]
    }
  }
}

1.3. Accurate query

Precise queries generally look for keyword, numerical, date, boolean and other types of fields, so the search conditions will not be segmented.

For example, when you buy something on Taobao, you need to filter through some information, such as sales volume, reputation, price in ascending order... These words are usually fixed to a button, and clicking on them will help you filter.

1.3.1, term query

The term query is mainly based on the exact value of the term, and generally searches for keyword type, numerical type, Boolean type, and date type fields.

The syntax is as follows:

// term查询
GET /索引名/_search
{
  "query": {
    "term": {
      "字段名": {
        "value": "字段值"
      }
    }
  }
}

或者简化为

GET /索引名/_search
{
  "query": {
    "term": {
      "字段名": "字段值"
    }
  }
}

For example, I want to search for all hotels in Shanghai (city is the keyword here).

GET /hotel/_search
{
  "query": {
    "term": {
      "city": {
        "value": "上海"
      }
    }
  }
}

或者

GET /hotel/_search
{
  "query": {
    "term": {
      "city": "上海"
    }
  }
}

1.3.2, range query

Query based on the range of values, which can be a range of values ​​or dates.

The syntax is as follows:

// range查询
GET /indexName/_search
{
  "query": {
    "range": {
      "字段名": {
        "gte": 10,  //大于等于 10
        "lte": 20   //小于等于 20
      }
    }
  }
}

或者

GET /indexName/_search
{
  "query": {
    "range": {
      "字段名": {
        "gt": 10,  //大于 10
        "lt": 20   //小于 20
      }
    }
  }
}

For example, I want to query hotels with prices greater than or equal to 161 and less than or equal to 300.

1.4. Geographical query

Query based on latitude and longitude, for example 

  • Didi: Search nearby rental properties.
  • Ctrip: Search for nearby hotels.
  • WeChat: Search for people nearby.

1.4.1、geo_bounding_box

geo_bounding_box is used to query all documents whose geo_point value is within a certain rectangular range.

// geo_bounding_box查询
GET /索引库名/_search
{
  "query": {
    "geo_bounding_box": {
      "字段名": {
        "top_left": {
          "lat": 31.1, //纬度
          "lon": 121.5 //经度
        },
        "bottom_right": {
          "lat": 30.9,
          "lon": 121.7
        }
      }
    }
  }
}

Ps: This is not very commonly used

1.4.2、geo_distance

Query all documents whose center point is less than a certain distance value.

The syntax is as follows:

// geo_distance 查询
GET /索引库名/_search
{
  "query": {
    "geo_distance": {
      "distance": "15km",//半径长度
      "FIELD": "31.21,121.5" //纬度,经度
    }
  }
}

For example, query the hotels within a circle with latitude 31.21, longitude 121.5, and radius 10km.

1.5. Compound query

Compound queries can combine other simple queries to implement more complex search logic.

For example, function score (score function query) can control the document relevance score and document ranking. For example, if you search for "infertility" on Baidu, the specified advertisement must be~ Why? The money I gave you is enough~

1.5.1. Relevance score calculation

When we use match query, the query results will be scored according to the relevance to the search term (_score), and the returned results will be sorted in descending order by score.

Before elasticsearch 5.0, using the TF-IDF algorithm would get larger and larger as word frequency increases.

After elasticsearch 5.0, using the BM25 algorithm, the score will increase as word frequency increases, but the growth curve will tend to be horizontal.

1.5.2、function_score

Using function_score, you can modify the relevance score of documents and sort them according to the new score.

Since the syntax here is relatively complicated, let’s give an example first, as follows:

GET /hotel/_search
{
  "query": {
    "function_score": {
      "query": {"match": {"city": "上海"}},
      "functions": [
        {
          "filter": {"term": {"id": "1"}},
          "weight": 10
        }
      ],
      "boost_mode": "multiply"
    }
  }
}
  • "query": {"match": {"city": "Shanghai"}}: original query conditions, search documents and score them based on relevance.
  • "filter": {"term": {"id": "1"}}: Filter conditions. Only documents that meet the conditions will be re-scored.
  • "weight": 10: The scoring function will be calculated with the query score in the future to obtain a new score. Common scoring functions include
    • weight: Give a constant value as the result of the function.
    • field_value_factor: Use a field value in the document as the function result.
    • random_score: Randomly generates a value as the result of the function.
    • script_score: Custom calculation formula, the formula result is used as the function result.
  • "boost_mode": "multiply": weighted mode, defines the operation method of function score and query score, including
    • multiply (default): multiply the two.
    • replace:sum、avg、max、min.

For example, rank hotels under the brand "Home Inn" higher.

Then you only need to clarify the following points here:

  • Brand: "Home Inn"
  • Calculation function: weight is enough.
  • Weighted: summation.
GET /hotel/_search 
{
  "query": {
    "function_score": {
      "query": {"match_all": {}},
      "functions": [ //算分函数
        {
          "filter": {"term": { //需要满足的条件: 品牌必须是如家
            "brand": "如家"
          }},
          "weight": 3 //算分权重为 3
        }
      ],
      "boost_mode": "sum" //加和算分
    }
  }
}

1.5.3、boolean query

A Boolean query is a combination of one or more query clauses. Subqueries can be combined in the following ways:

  • must: query condition that must match, similar to "and".
  • should: Query condition for selective matching, similar to "or".
  • must_not: must not match, does not participate in scoring, similar to "not".
  • filter: must match and does not participate in scoring.

For example, if the query city is Shanghai, the brand can be "Crown Plaza" or "Ramada", the price must be no less than or equal to 500, and the rating must be no less than or equal to 45.

GET /hotel/_search
{
  "query": {
    "bool": {
      "must": [
        {"term": {
          "city": "上海"
          
        }}
      ],
      "should": [
        {"term": {"brand": "皇冠假日"}},
        {"term": {"brand": "华美达"}}
      ],
      "must_not": [
        {"range": {
          "price": {
            "lte": 500
          }
        }}
      ],
      "filter": [
        {"range": {
          "score": {
            "gte": 45
          }
        }}
      ]
    }
  }
}

1.6. Search result processing

1.6.1. Sorting

es supports sorting search results. The default is to sort according to the relevance score (_score). The field types that can be sorted are (not word segmented): keyword type, numerical type, geographical coordinate type, and date type.

The syntax is as follows:

GET /索引库名/_search
{
  "query": {
    "match_all": {} //搜索内容
  },
  "sort": [
    {
      "字段名": "desc"  // 排序字段和排序方式ASC、DESC
    }
  ]
}

If you want to sort by latitude and longitude, the syntax is as follows:

GET /索引库名/_search
{
  "query": {
    "match_all": {}
  },
  "sort": [
    {
      "_geo_distance" : {
          "字段名" : "纬度,经度",
          "order" : "asc",
          "unit" : "km"
      }
    }
  ]
}

For example, sort hotel user reviews in descending order, and those with the same reviews are sorted in ascending order by price.

GET /hotel/_search
{
  "query": {
    "match_all": {}
  },
  "sort": [
    {
      "score": {
        "order": "desc"
      },
      "price": {
        "order": "asc"
      }
    }
  ]
}

1.6.2, paging

By default, elasticsearch will only return the top 10 data. If you want to query more data, you need to modify the paging parameters.

In es, the returned paging results are controlled by the parameters from (offset, starting position of paging, default is 0) and size (total number of documents expected to be obtained), which is essentially the same as limit in mysql.

For example, get 10 pieces of data from 20 to 29.

GET /hotel/_search
{
  "query": {
    "match_all": {}
  },
  "from": 20, // 分页开始的位置,默认为0
  "size": 10, // 期望获取的文档总数
  "sort": [
    {"price": "asc"}
  ]
}

 

Deep paging problem

ES supports distribution, so there will be problems with deep paging.

For example, after sorting by price, get the data from = 990, size = 10.

1. First sort and query the first 1000 documents on each data shard.

2. Aggregate the data of all nodes and reorder it to select the top 1000 documents.

3. Finally, from these 1000 documents, select the 10 data starting from 990.

For example, it is like you are in school and there are 10 grades (10 shards). You need to find the top 100 students from these 10 grades. Since the ranking of the exam is based on class, Therefore, you need to find the top 100 students in each class, so 10 grades mean that you need to take out 1,000 students, and then let them take another test to get the top 10 students.

Then if the number of search pages is too deep, or the result set is too large, the memory and CPU consumption will be higher, so es sets the result set query to display 10000.

Solution to deep paging problem

There are two ways to query data without an upper limit:

1. Search after (official recommendation): Sorting is required during paging. That is to say, the data is sorted first, and then when this paging is completed, the next page of data is queried starting from the next sorting value completed this time. 

However, the disadvantage of this method is that it can only query backward page by page and does not support random page turning.

Suitable for searches that do not require random page turning, such as page scrolling on mobile phones.

2. scroll (after es 7.1, it is officially no longer recommended): The principle is to form a snapshot of the sorted data and save it in memory. 

The disadvantage is obvious, that is, it consumes extra memory.

1.6.3. Highlight

Highlighting means highlighting the search keywords in the search results. For example, if you search for "infertility" in Baidu, all content in the search results that appears with the Java keyword will be highlighted in red.

The principle is as follows:

  1. Add HTML tag processing to keywords in search results.
  2. Add css styles to tags on the page.

The syntax is as follows:

GET /索引库/_search
{
  "query": {
    "match": {
      "字段名": "要搜索的文本" //注意!默认情况下,搜索的字段必须要于高亮的字段一致,否则不会高亮
    }
  },
  "highlight": {
    "fields": { // 指定要高亮的字段
      "字段名": {
        "pre_tags": "<em>",  // 用来标记高亮字段的前置标签
        "post_tags": "</em>" // 用来标记高亮字段的后置标签
      }
    }
  }
}

For example, I want to search for the brand "Home Inn" and highlight the "Home Inn" field. 

Ps: By default, the field searched in es must be consistent with the highlighted field, otherwise it will not be highlighted.

But if you just want the search field to be inconsistent with the highlighted field, you can add the "require_field_match": "false" attribute, as follows:

1.6.4. Search syntax summary syntax

The syntax is as follows:

GET /hotel/_search
{
  "query": {
    "match": {
      "brand": "如家"
    }
  },
  "from": 20, // 分页开始的位置
  "size": 10, // 期望获取的文档总数
  "sort": [ 
    {  "price": "asc" }, // 升序排序
    {
      "_geo_distance" : { // 距离排序
          "location" : "31.04,121.61", 
          "order" : "asc",
          "unit" : "km"
      }
    }
  ],
  "highlight": {
    "fields": { // 高亮字段
      "name": {
        "require_field_match": "false", //高亮字段不受查询字段限制
        "pre_tags": "<em>",  // 用来标记高亮字段的前置标签
        "post_tags": "</em>" // 用来标记高亮字段的后置标签
      }
    }
  }
}

Guess you like

Origin blog.csdn.net/CYK_byte/article/details/133280895