1. DSL query document
Elasticsearch queries are still implemented based on JSON-style DSL.
1.1, DSL query classification
Elasticsearch provides a JSON-based DSL ( Domain Specific Language ) to define queries. Common query types include:
-
Query all : Query all data, for general testing. For example: match_all
-
Full-text search (full text) query : Use the word segmenter to segment the user input content, and then match it in the inverted index database. For example:
-
match_query
-
multi_match_query
-
-
Precise query : Find data based on precise entry values, generally searching for keyword, numeric, date, boolean and other types of fields. For example:
-
ids
-
range
-
term
-
-
Geographic (geo) query : query based on latitude and longitude. For example:
-
geo_distance
-
geo_bounding_box
-
-
Compound (compound) query : compound query can combine the above-mentioned various query conditions and merge query conditions. For example:
-
bool
-
function_score
-
The query syntax is basically the same:
GET /indexName/_search
{
"query": {
"查询类型": {
"查询条件": "条件值"
}
}
}
Let's take the query all as an example, where:
-
The query type is match_all
-
no query condition
// 查询所有
GET /indexName/_search
{
"query": {
"match_all": {
}
}
}
Other queries are nothing more than changes in query types and query conditions .
1.2. Full-text search query
scenes to be used
The basic process of full-text search query is as follows:
Segment the content of the user's search and get the entry
According to the entry to match in the inverted index library, get the document id
Find the document according to the document id and return it to the user
The more common scenarios include:
Mall's input box search
Baidu input box search
For example, Jingdong:
Because the entries are used to match, the fields participating in the search must also be text-type fields that can be segmented.
basic grammar
Common full-text search queries include:
match query: single field query
multi_match query: multi-field query, any field meets the conditions even if it meets the query conditions
The match query syntax is as follows:
GET /indexName/_search { "query": { "match": { "FIELD": "TEXT" } } }
The mulit_match syntax is as follows:
GET /indexName/_search { "query": { "multi_match": { "query": "TEXT", "fields": ["FIELD1", " FIELD12"] } } }
example
Example of match query:
Example of a multi_match query:
It can be seen that the results of the two queries are the same, why? .
Because we copied the brand, name, and business values into the all field using copy_to. So you search based on three fields, and of course the same effect as searching based on all fields.
However, the more search fields, the greater the impact on query performance, so it is recommended to use copy_to and then single-field query.
Summarize
What is the difference between match and multi_match?
match: query based on a field
multi_match: Query based on multiple fields, the more fields involved in the query, the worse the query performance
1.3. Accurate query
Precise query is generally to search for keyword, value, date, boolean and other types of fields. Therefore, the word segmentation of the search conditions will not be performed. The common ones are:
-
term: query based on the exact value of the term
-
range: query based on the range of values
term query
Because the field search for exact query is a field without word segmentation, the query condition must also be an entry without word segmentation . When querying, only when the content entered by the user exactly matches the automatic value is considered to meet the condition. If the user enters too much content, the data cannot be searched.
Grammar description:
// term查询 GET /indexName/_search { "query": { "term": { "FIELD": { "value": "VALUE" } } } }
Example:
When I search for exact terms, I can correctly query the results:
However, when the content of my search is not an entry, but a phrase formed by multiple words, it cannot be searched:
range query
Range query is generally used when performing range filtering on numeric types. For example, do price range filtering.
Basic syntax:
// range查询 GET /indexName/_search { "query": { "range": { "FIELD": { "gte": 10, // 这里的gte代表大于等于,gt则代表大于 "lte": 20 // lte代表小于等于,lt则代表小于 } } } }
Example:
Summarize
What are the common types of precise query?
Term query: Exact match based on terms, general search keyword type, numeric type, Boolean type, date type fields
range query: query based on the range of values, which can be ranges of values and dates
1.4. Geographical coordinate query
The so-called geographic coordinate query is actually a query based on latitude and longitude. Official documents: Geo queries | Elasticsearch Guide [8.7] | Elastic
Common usage scenarios include:
-
Ctrip: Search Hotels Near Me
-
Didi: Find taxis near me
-
WeChat: Search People Near Me
Nearby hotels:
Nearby cars:
Rectangular range query
Rectangular range query, that is, geo_bounding_box query, queries all documents whose coordinates fall within a certain rectangular range:
When querying, you need to specify the coordinates of the upper left and lower right points of the rectangle, and then draw a rectangle, and all points that fall within the rectangle are eligible points.
The syntax is as follows:
// geo_bounding_box查询 GET /indexName/_search { "query": { "geo_bounding_box": { "FIELD": { "top_left": { // 左上点 "lat": 31.1, "lon": 121.5 }, "bottom_right": { // 右下点 "lat": 30.9, "lon": 121.7 } } } } }
nearby query
Nearby query, also called distance query (geo_distance): query all documents whose specified center point is less than a certain distance value.
In other words, find a point on the map as the center of the circle, draw a circle with the specified distance as the radius, and the coordinates falling within the circle are considered eligible:
Grammar description:
// geo_distance 查询 GET /indexName/_search { "query": { "geo_distance": { "distance": "15km", // 半径 "FIELD": "31.21,121.5" // 圆心 } } }
Example:
Let's search for hotels within 15km near Lujiazui:
A total of 47 hotels were found.
Then shorten the radius to 3 km:
It can be found that the number of searched hotels has been reduced to 5.
1.5. Compound query
Compound query: A compound query can combine other simple queries to implement more complex search logic. There are two common ones:
-
fuction score: Calculation function query, which can control the calculation of document relevance and control the ranking of documents
-
bool query: Boolean query, using logical relationships to combine multiple other queries to achieve complex searches
1.5.1. Correlation score
When we use match query, the document results will be scored (_score) according to the relevance to the search term, and the returned results will be sorted in descending order of the score.
For example, if we search for "Hongqiao Home Inn", the results are as follows:
[
{
"_score" : 17.850193,
"_source" : {
"name" : "虹桥如家酒店真不错",
}
},
{
"_score" : 12.259849,
"_source" : {
"name" : "外滩如家酒店真不错",
}
},
{
"_score" : 11.91091,
"_source" : {
"name" : "迪士尼如家酒店真不错",
}
}
]
In elasticsearch, the early scoring algorithm is the TF-IDF algorithm, the formula is as follows:
In the later version 5.1 upgrade, elasticsearch improved the algorithm to BM25 algorithm, the formula is as follows:
The TF-IDF algorithm has a flaw, that is, the higher the term frequency, the higher the document score, and a single term has a greater impact on the document. However, BM25 will have an upper limit for the score of a single entry, and the curve will be smoother:
Summary: elasticsearch will score according to the relevance of terms and documents. There are two algorithms:
-
TF-IDF algorithm
-
BM25 algorithm, the algorithm adopted after version 5.1 of elasticsearch
1.5.2. Calculation function query
Scoring based on relevance is a reasonable requirement, but reasonable ones are not necessarily what product managers need .
Taking Baidu as an example, in your search results, it is not that the higher the relevance, the higher the ranking, but the higher the ranking is for who pays more.
If you want to calculate the control correlation score, you need to use the function score query in elasticsearch.
Grammar Description
The function score query contains four parts:
Original query condition: query part, search for documents based on this condition, and score the document based on the BM25 algorithm, the original score (query score)
Filter condition : the filter part, documents that meet this condition will be recalculated
Calculation function : Documents that meet the filter conditions need to be calculated according to this function, and the obtained function score (function score), there are four functions
weight: the result of the function is a constant
field_value_factor: Use a field value in the document as the function result
random_score: Use random numbers as the result of the function
script_score: custom scoring function algorithm
Calculation mode : the result of the calculation function, the correlation calculation score of the original query, and the calculation method between the two, including:
multiply: Multiply
replace: replace query score with function score
Others, such as: sum, avg, max, min
The operation process of function score is as follows:
1) Query and search documents according to the original conditions , and calculate the relevance score, called the original score (query score)
2) According to filter conditions , filter documents
3) For documents that meet the filter conditions , the function score is obtained based on the calculation of the score function
4) The original score (query score) and function score (function score) are calculated based on the operation mode , and the final result is obtained as a correlation score.
So the key points here are:
Filter conditions: determine which documents have their scores modified
Scoring function: the algorithm to determine the score of the function
Calculation mode: determine the final calculation result
2) Example
Requirements: Rank hotels with the brand "Home Inn" higher
Translate this requirement into the four points mentioned before:
Original condition: Uncertain, can change arbitrarily
Filter condition: brand = "Home Inn"
Calculation function: It can be simple and rude, and directly give a fixed calculation result, weight
Operation mode: such as summation
So the final DSL statement is as follows:
GET /hotel/_search { "query": { "function_score": { "query": { .... }, // 原始查询,可以是任意条件 "functions": [ // 算分函数 { "filter": { // 满足的条件,品牌必须是如家 "term": { "brand": "如家" } }, "weight": 2 // 算分权重为2 } ], "boost_mode": "sum" // 加权模式,求和 } } }
Test, when the calculation function is not added, Home Inn's score is as follows:
After adding the scoring function, the score of Home Inn has increased:
3) Summary
What are the three elements defined by function score query?
Filter criteria: which documents should be added points
Calculation function: how to calculate function score
Weighting method: how to calculate function score and query score
1.5.3, Boolean query
A Boolean query is a combination of one or more query clauses, each of which is a subquery . Subqueries can be combined in the following ways:
-
must: must match each subquery, similar to "and"
-
should: Selective matching subquery, similar to "or"
-
must_not: must not match, does not participate in scoring , similar to "not"
-
filter: must match, do not participate in scoring
For example, when searching for hotels, in addition to keyword search, we may also filter based on fields such as brand, price, and city:
Each different field has different query conditions and methods, and must be multiple different queries. To combine these queries, you must use bool queries.
It should be noted that when searching, the more fields involved in scoring, the worse the query performance will be . Therefore, it is recommended to do this when querying with multiple conditions:
-
The keyword search in the search box is a full-text search query, use must query, and participate in scoring
-
For other filter conditions, use filter query. Do not participate in scoring
syntax example
GET /hotel/_search { "query": { "bool": { "must": [ {"term": {"city": "上海" }} ], "should": [ {"term": {"brand": "皇冠假日" }}, {"term": {"brand": "华美达" }} ], "must_not": [ { "range": { "price": { "lte": 500 } }} ], "filter": [ { "range": {"score": { "gte": 45 } }} ] } } }
example
Requirement: Search for hotels whose name contains "Home Inn", the price is not higher than 400, and within 10km around the coordinates 31.21, 121.5.
analyze:
Name search is a full-text search query and should be involved in scoring. put in must
If the price is not higher than 400, use range to query, which belongs to the filter condition and does not participate in the calculation of points. put in must_not
Within the range of 10km, use geo_distance to query, which belongs to the filter condition and does not participate in the calculation of points. put in filter
summary
How many logical relationships does bool query have?
must: conditions that must be matched, can be understood as "and"
should: The condition for selective matching, which can be understood as "or"
must_not: conditions that must not match, do not participate in scoring
filter: conditions that must be matched, do not participate in scoring
2. Search result processing
The search results can be processed or displayed in a way specified by the user.
2.1. Sorting
Elasticsearch sorts according to the correlation score (_score) by default, but also supports custom ways to sort search results . Field types that can be sorted include: keyword type, numeric type, geographic coordinate type, date type, etc.
Ordinary field sorting
The syntax for sorting by keyword, value, and date is basically the same.
Grammar :
GET /indexName/_search { "query": { "match_all": {} }, "sort": [ { "FIELD": "desc" // 排序字段、排序方式ASC、DESC } ] }
The sorting condition is an array, that is, multiple sorting conditions can be written. According to the order of declaration, when the first condition is equal, then sort according to the second condition, and so on
Example :
Requirement description: The hotel data is sorted in descending order of user ratings (score), and the same ratings are sorted in ascending order of price (price)
Sort by geographic coordinates
Geographic coordinate ordering is slightly different.
Grammar description :
GET /indexName/_search { "query": { "match_all": {} }, "sort": [ { "_geo_distance" : { "FIELD" : "纬度,经度", // 文档中geo_point类型的字段名、目标坐标点 "order" : "asc", // 排序方式 "unit" : "km" // 排序的距离单位 } } ] }
The meaning of this query is:
Specify a coordinate as the target point
Calculate the distance from the coordinates of the specified field (must be geo_point type) to the target point in each document
Sort by distance
Example:
Requirement description: Realize sorting the hotel data in ascending order according to the distance to your location coordinates
Suppose my location is: 31.034661, 121.612282, looking for the nearest hotel around me.
2.2. Pagination
Elasticsearch only returns top10 data by default. And if you want to query more data, you need to modify the paging parameters. In elasticsearch, modify the from and size parameters to control the paging results to be returned:
-
from: start from the first few documents
-
size: how many documents to query in total
similar to mysql limit ?, ?
basic pagination
The basic syntax of pagination is as follows:
GET /hotel/_search { "query": { "match_all": {} }, "from": 0, // 分页开始的位置,默认为0 "size": 10, // 期望获取的文档总数 "sort": [ {"price": "asc"} ] }
Deep pagination problem
Now, I want to query the data of 990~1000, the query logic should be written as follows:
GET /hotel/_search { "query": { "match_all": {} }, "from": 990, // 分页开始的位置,默认为0 "size": 10, // 期望获取的文档总数 "sort": [ {"price": "asc"} ] }
Here is the data starting from query 990, that is, the 990th to 1000th data.
However, when paging inside elasticsearch, you must first query 0~1000 entries, and then intercept the 10 entries of 990~1000:
Query TOP1000, if es is a single-point mode, this does not have much impact.
But elasticsearch must be a cluster in the future. For example, my cluster has 5 nodes, and I want to query TOP1000 data. It is not enough to query 200 items per node.
Because the TOP200 of node A may be ranked beyond 10,000 on another node.
Therefore, if you want to obtain the TOP1000 of the entire cluster, you must first query the TOP1000 of each node. After summarizing the results, re-rank and re-intercept the TOP1000.
So what if I want to query the data of 9900~10000? Do we need to query TOP10000 first? Then each node has to query 10,000 entries? aggregated into memory?
When the query paging depth is large, there will be too much summary data, which will put a lot of pressure on the memory and CPU. Therefore, elasticsearch will prohibit requests with from+ size exceeding 10,000.
For deep paging, ES provides two solutions, official documents :
search after: sorting is required when paging, the principle is to query the next page of data starting from the last sorting value. The official recommended way to use.
scroll: The principle is to form a snapshot of the sorted document ids and store them in memory. It is officially deprecated.
summary
Common implementation schemes and advantages and disadvantages of pagination query:
from + size
:
Advantages: Support random page turning
Disadvantages: deep paging problem, the default query upper limit (from + size) is 10000
Scenario: Random page-turning searches such as Baidu, JD.com, Google, and Taobao
after search
:
Advantages: no query upper limit (the size of a single query does not exceed 10000)
Disadvantage: can only query backward page by page, does not support random page turning
Scenario: Search without random page turning requirements, such as mobile phone scrolling down to turn pages
scroll
:
Advantages: no query upper limit (the size of a single query does not exceed 10000)
Disadvantages: There will be additional memory consumption, and the search results are not real-time
Scenario: Acquisition and migration of massive data. It is not recommended starting from ES7.1. It is recommended to use the after search solution.
2.3. Highlight
Highlighting principle
What is highlighting?
When we search on Baidu and JD.com, the keywords will turn red, which is more eye-catching. This is called highlighting:
The implementation of highlighting is divided into two steps:
1) Add a label to all keywords in the document, such as
<em>
label2) The page
<em>
writes CSS styles for the tags
achieve highlighting
Highlighted syntax:
GET /hotel/_search { "query": { "match": { "FIELD": "TEXT" // 查询条件,高亮一定要使用全文检索查询 } }, "highlight": { "fields": { // 指定要高亮的字段 "FIELD": { "pre_tags": "<em>", // 用来标记高亮字段的前置标签 "post_tags": "</em>" // 用来标记高亮字段的后置标签 } } } }
Notice:
Highlighting is for keywords, so the search conditions must contain keywords , not range queries.
By default, the highlighted field must be the same as the field specified by the search , otherwise it cannot be highlighted
If you want to highlight non-search fields, you need to add an attribute: required_field_match=false
Example :
Summarize
The query DSL is a large JSON object with the following properties:
query: query condition
from and size: paging conditions
sort: sorting conditions
highlight: highlight condition
Example:
3. RestClient query documents
Document query is also applicable to the RestHighLevelClient object learned yesterday. The basic steps include:
-
1) Prepare the Request object
-
2) Prepare request parameters
-
3) Initiate a request
-
4) Parse the response
3.1. Quick start
Let's take the match_all query as an example
Initiate a query request
Code interpretation:
The first step is to create
SearchRequest
an object and specify the index library nameThe second step is to use
request.source()
the construction of DSL, which can include query, paging, sorting, highlighting, etc.
query()
: Represents the query condition, usingQueryBuilders.matchAllQuery()
the DSL to construct a match_all queryThe third step is to use client.search() to send a request and get a response
There are two key APIs here. One is
request.source()
that it contains all functions such as query, sorting, paging, highlighting, etc.:
The other is
QueryBuilders
that it contains various queries such as match, term, function_score, bool, etc.:
parse the response
Analysis of the response result:
The result returned by elasticsearch is a JSON string, the structure contains:
hits
: the result of the hit
total
: The total number of entries, where value is the specific total entry value
max_score
: the relevance score of the highest scoring document across all results
hits
: An array of documents for search results, each of which is a json object
_source
: the original data in the document, also a json objectTherefore, we parse the response result, which is to parse the JSON string layer by layer. The process is as follows:
SearchHits
: Obtained through response.getHits(), which is the outermost hits in JSON, representing the result of the hit
SearchHits#getTotalHits().value
: Get the total number of information
SearchHits#getHits()
: Get the SearchHit array, which is the document array
SearchHit#getSourceAsString()
: Get the _source in the document result, which is the original json document data
full code
The complete code is as follows:
@Test void testMatchAll() throws IOException { // 1.准备Request SearchRequest request = new SearchRequest("hotel"); // 2.准备DSL request.source() .query(QueryBuilders.matchAllQuery()); // 3.发送请求 SearchResponse response = client.search(request, RequestOptions.DEFAULT); // 4.解析响应 handleResponse(response); } private void handleResponse(SearchResponse response) { // 4.解析响应 SearchHits searchHits = response.getHits(); // 4.1.获取总条数 long total = searchHits.getTotalHits().value; System.out.println("共搜索到" + total + "条数据"); // 4.2.文档数组 SearchHit[] hits = searchHits.getHits(); // 4.3.遍历 for (SearchHit hit : hits) { // 获取文档source String json = hit.getSourceAsString(); // 反序列化 HotelDoc hotelDoc = JSON.parseObject(json, HotelDoc.class); System.out.println("hotelDoc = " + hotelDoc); } }
summary
The basic steps of a query are:
Create a SearchRequest object
Prepare Request.source(), which is DSL.
① QueryBuilders to build query conditions
② Pass in the query() method of Request.source()
send request, get result
Parsing results (refer to JSON results, from outside to inside, parse layer by layer)
3.2, match query
The match and multi_match queries of full-text search are basically the same as the API of match_all. The difference is the query condition, which is the query part.
Therefore, the difference in the Java code is mainly the parameters in request.source().query(). Also use the methods provided by QueryBuilders:
The result parsing code is completely consistent and can be extracted and shared.
The complete code is as follows:
@Test
void testMatch() throws IOException {
// 1.准备Request
SearchRequest request = new SearchRequest("hotel");
// 2.准备DSL
request.source()
.query(QueryBuilders.matchQuery("all", "如家"));
// 3.发送请求
SearchResponse response = client.search(request, RequestOptions.DEFAULT);
// 4.解析响应
handleResponse(response);
}
3.3. Precise query
Exact queries are mainly two:
-
term: term exact match
-
range: range query
Compared with the previous query, the difference is also in the query condition, and everything else is the same.
The API for query condition construction is as follows:
3.4. Boolean query
Boolean query is to combine other queries with must, must_not, filter, etc. The code example is as follows:
It can be seen that the difference between API and other queries is that the construction of query conditions, QueryBuilders, result parsing and other codes are completely unchanged.
The complete code is as follows:
@Test
void testBool() throws IOException {
// 1.准备Request
SearchRequest request = new SearchRequest("hotel");
// 2.准备DSL
// 2.1.准备BooleanQuery
BoolQueryBuilder boolQuery = QueryBuilders.boolQuery();
// 2.2.添加term
boolQuery.must(QueryBuilders.termQuery("city", "杭州"));
// 2.3.添加range
boolQuery.filter(QueryBuilders.rangeQuery("price").lte(250));
request.source().query(boolQuery);
// 3.发送请求
SearchResponse response = client.search(request, RequestOptions.DEFAULT);
// 4.解析响应
handleResponse(response);
}
3.5, sorting, paging
The sorting and paging of search results are parameters at the same level as query, so they are also set using request.source().
The corresponding APIs are as follows:
Full code example:
@Test
void testPageAndSort() throws IOException {
// 页码,每页大小
int page = 1, size = 5;
// 1.准备Request
SearchRequest request = new SearchRequest("hotel");
// 2.准备DSL
// 2.1.query
request.source().query(QueryBuilders.matchAllQuery());
// 2.2.排序 sort
request.source().sort("price", SortOrder.ASC);
// 2.3.分页 from、size
request.source().from((page - 1) * size).size(5);
// 3.发送请求
SearchResponse response = client.search(request, RequestOptions.DEFAULT);
// 4.解析响应
handleResponse(response);
}
3.6. Highlight
The highlighted code is quite different from the previous code, there are two points:
-
Query DSL: In addition to query conditions, you also need to add highlight conditions, which are also at the same level as query.
-
Result parsing: In addition to parsing the _source document data, the result also needs to parse the highlighted result
Highlight request build
The construction API of the highlight request is as follows:
The above code omits the query condition part, but don’t forget: the highlight query must use full-text search and search keywords, so that keywords can be highlighted in the future.
The complete code is as follows:
@Test void testHighlight() throws IOException { // 1.准备Request SearchRequest request = new SearchRequest("hotel"); // 2.准备DSL // 2.1.query request.source().query(QueryBuilders.matchQuery("all", "如家")); // 2.2.高亮 request.source().highlighter(new HighlightBuilder().field("name").requireFieldMatch(false)); // 3.发送请求 SearchResponse response = client.search(request, RequestOptions.DEFAULT); // 4.解析响应 handleResponse(response); }
Analysis of highlighted results
The highlighted results and the query document results are separated by default and not together.
So parsing the highlighted code requires additional processing:
Code interpretation:
Step 1: Get the source from the result. hit.getSourceAsString(), this part is the non-highlighted result, json string. It also needs to be deserialized into a HotelDoc object
Step 2: Obtain the highlighted result. hit.getHighlightFields(), the return value is a Map, the key is the highlight field name, and the value is the HighlightField object, representing the highlight value
Step 3: Obtain the highlighted field value object HighlightField from the map according to the highlighted field name
Step 4: Get Fragments from HighlightField and convert them to strings. This part is the real highlighted string
Step 5: Replace non-highlighted results in HotelDoc with highlighted results
The complete code is as follows:
private void handleResponse(SearchResponse response) { // 4.解析响应 SearchHits searchHits = response.getHits(); // 4.1.获取总条数 long total = searchHits.getTotalHits().value; System.out.println("共搜索到" + total + "条数据"); // 4.2.文档数组 SearchHit[] hits = searchHits.getHits(); // 4.3.遍历 for (SearchHit hit : hits) { // 获取文档source String json = hit.getSourceAsString(); // 反序列化 HotelDoc hotelDoc = JSON.parseObject(json, HotelDoc.class); // 获取高亮结果 Map<String, HighlightField> highlightFields = hit.getHighlightFields(); if (!CollectionUtils.isEmpty(highlightFields)) { // 根据字段名获取高亮结果 HighlightField highlightField = highlightFields.get("name"); if (highlightField != null) { // 获取高亮值 String name = highlightField.getFragments()[0].string(); // 覆盖非高亮结果 hotelDoc.setName(name); } } System.out.println("hotelDoc = " + hotelDoc); } }