Optimization methods at the query syntax level
1. If it is only documented doc_ic
, it can be configured"_source": false
If we only need the document doc_id
and not _source
any fields in the document, then we can add configuration "_source": false
. At this point, ES will only need to execute the query phase of the query instead of the fetch phase, thus greatly speeding up the query.
before fixing:
GET /my-index-000001/_search?
{
"query": {
"match_all": {
}
},
"_source": ""
}
After modification:
GET /my-index-000001/_search?
{
"query": {
"match_all": {
}
},
"_source": false
}
2. Change query to filter statement
Use FilterContext instead of QueryConext, because the performance of the filter query clause is better than that of the query query clause. The filter query clause does not need to calculate the correlation score, but the query query clause needs to calculate the correlation branch. The result of the filter query clause can be cache.
before fixing:
GET /my-index-000001/_search?
{
"query": {
"term": {
"field_name": "field_value"
}
}
}
After modification:
GET /my-index-000001/_search?
{
"query": {
"bool": {
"filter": {
"term": {
"field_name": "field_value"
}
}
}
}
}
3. On the premise of not timing out, increase the number of records obtained by each scroll
We can size
make each scroll return more data by increasing it, thereby reducing the number of query, fetch, and response stages of the query and improving efficiency. But you need to pay attention to increase the timeout to avoid timeout because each scroll will return more data.
before fixing:
GET /my-index-000001/_search?
{
"query": {
"match_all": {
}
},
"size": 1000
}
After modification:
GET /my-index-000001/_search?
{
"query": {
"match_all": {
}
},
"size": 10000
}
4. _doc
Sort
In the official documentation of ElasticSearch, it is explained _doc
that it means sorting by index order, which is the most efficient sorting method. If you don't care about the order in which documents are returned, you should _doc
sort to improve query performance. This works especially well scroll
when .
Document address: https://www.elastic.co/guide/en/elasticsearch/reference/8.6/sort-search-results.html
before fixing:
{
"query": {
"bool": {
"filter": {
"term": {
"field_name": "field_value"
}
}
}
}
}
After modification:
{
"query": {
"bool": {
"filter": {
"term": {
"field_name": "field_value"
}
}
}
},
"sort": ["_doc"]
}
5. Reduce unnecessary query fields (use _source
filter )
By reducing unnecessary fields, the time consumption of the fetch phase of the query can be effectively reduced.
before fixing:
GET /my-index-000001/_search?
{
"query": {
"match_all": {
}
}
}
After modification:
GET /my-index-000001/_search?
{
"query": {
"match_all": {
}
},
"_source": ["need-field-1", "need-field-2"]
}
6. Avoid fuzzy matching
7. Use filter_path to filter the returned results
By adding filter_path
, the network IO usage can be reduced. It should be noted that if scroll is used, _scroll_id
the field .
Document address: https://www.elastic.co/guide/en/elasticsearch/reference/8.7/common-options.html#common-options-response-filtering
8. scroll scan (before version 2.1)
It is no longer supported after version 2.1.
9. search after
After testing, in the environment of version 7.10.2, when PIT is not used: when _doc
sorting , the full query speed of search after is basically the same as that of scroll, but a small amount of data may be missed; _id
when sorting is used, The full query speed of search after is significantly slower than the full query speed of scroll? (The above test results are inconsistent with the results of some articles, which need to be further analyzed)
Document address: https://www.elastic.co/guide/en/elasticsearch/reference/8.7/paginate-search-results.html#search-after
response = es_client.search(
index="my-index-0001",
size=10000,
body={
"query": {
...}, "sort": ["_doc"]}
)
while response["hits"]["hits"]:
last = None
for item in response["hits"]["hits"]:
last = item["sort"]
# ...... 处理逻辑
response = es_client.search(
index="my-index-0001",
size=10000,
body={
"query": {
...}, "search_after": last, "sort": ["_doc"]}
)
10. search after + PIT (concurrent method)
Added to X-Pack in version 7.10; added in version 7.11.
Document address: https://www.elastic.co/guide/en/elasticsearch/reference/8.7/point-in-time-api.html
11. slice scroll (concurrent method)
Concurrent scroll can be supported through slice scroll.
But if there is no suitable field as the slice field, if the number of slices exceeds the number of shards in the index, then ES will take longer O ( N ) O(N )O ( N ) time complexity and space complexity to complete the split, and this process can only be completed after a considerable proportion of queries have been performed. After testing, in the environment of version 7.10.2, when the number of slices exceeds the number of shards, ES needs to query about 60% - 70% to complete the splitting process. Before the splitting is completed, the sum of the scroll speeds of all processes is equal to The speed of single-process scroll is basically the same.
Document address: https://www.elastic.co/guide/en/elasticsearch/reference/8.7/paginate-search-results.html#slice-scroll
Optimization methods in index design
1. Change the string format field to number or date format
Because of the indexing method, range
the efficiency of filtering is very low for fields of string type; while range
the efficiency of filtering is very high for fields of number and date types. Therefore, if the meaning of a field is a number or a date, it should not be stored as a string type.
2. Lower the nesting level
Nesting levels can "field-1": {"field-1-1": 1, "field-1-2": 2}
be reduced by flattening deeply nested (for example) fields.
3. Reduce refresh frequency
If the timeliness of the search is not high, you can extend the refresh cycle to reduce the number of refreshes, but it also means higher memory usage.
4. Reduce the number of replicas
At the cost of reducing availability, reduce the number of replicas and increase the speed of index writing.