Es search optimization

1. the file system cache to reserve enough memory

filesystem cache bigger the better, in order to make searching faster, ES heavily dependent filesystem cache. Generally, at least generally available memory as filesystem cache, so that ES can maintain the hot region index in memory

 

2. Use faster hardware

Search generally I / O bound, in which case, you need:

 

Allocate more memory for the filesystem cache.

Using SSD drive.

The use of local storage, do not use NFS, SMB and other remote file systems.

3. document model

Documents need to use the appropriate type, so that the search operations consume fewer resources while. Avoid the use of join operations, nested queries will slow times, the number of parent-child relationship may make the query slower times.

 

4. Pre-index data

For some query patterns in optimizing the index of the data. For example, if all the documents have a price field, and most of the polymerization range query on a fixed range, the range can be "pre-indexing" (pre Index) to the index, and the terms used to accelerate the polymerization rate of polymerization.

 

For example, the document was originally like this:

 

PUT index/type/1

{

"name":"苹果",

"price":13

}

 

The search using the following method:

 

GET index/type/_search

{

"aggs":{

"price_ranges":{

"range":{

"field":"price",

"ranges":[

{"to":10},

{"from":10,"to":50},

{"from":50,"to":100},

{"from":100,"to":150}

]

}

}

}

}

 

So we have optimized the document is enriched in the establishment of the index, increased price_range field, mapping type for the keyword:

 

PUT index/_mapping/type

{

"properties":{

"price_range":{

"type":"keyword"

}

}

}

 

PUT index/type/1

{

"name":"苹果",

"price":12,

"price_range":"10-50"

}

Next, the search request may aggregate this new field, rather than in the price field range polymerization.

 

GET index/type/_search

{

"aggs":{

"price_ranges":{

"terms":{

"field":"price_range"

}

}

}

}

5. Field Mapping

Some fields of digital content, but that does not mean necessarily have to use the numeric fields. Generally, store an identifier field, better use keyword.

 

6. Avoid using scripts

In general, you should avoid using scripts. If you must use, it should give priority painless and expressions.

 

7. Search Optimization

When using the date range search, use now query cache is usually not because of the matched range has been changing. However, from the user experience point of view, to switch to a full date it is usually acceptable, so you can make better use of the query cache.

 

For example, the following query:

 

PUT index/type/1

{

"createAt":"2020-01-04 15:31:23.369"

}

 

GET index/type/_search

{

"query":{

"constant_score":{

"filter":{

"range":{

"createAt":{

"gte":"now-1h",

"lte":"now"

}

}

}

}

}

}

 

Can be replaced with the following query:

 

GET index/type/_search

{

"query":{

"constant_score":{

"filter":{

"range":{

"createAt":{

"gte":"now-1h/m",

"lte":"now/m"

}

}

}

}

}

}

 

In this example, we will be rounded up to date minutes, so if and when the time is 15:32:23, then the range will match the query value createAt field in 14: everything between 32: 32-15. If multiple users simultaneously query the query contains a range of running, the query cache can speed up queries. For the longer time interval rounding get, the more helpful query cache. Note, however, too high rounding may also damage the user experience.

 

8. Therefore, the implementation of force-merge read-only

Retired as read-only indexes that force merge, the Lucene index combined into a single segment, can improve the query speed. When there is a plurality of Lucene index segments, each segment may perform a separate search results then combined and the combined force of a read-only Lucene index segment only can optimize the search process, the recovery rate of the index is also good.

 

Index based on the date of polling old data generally will not be updated. You should avoid sustained write a fixed index, and then associated with the alias, or use the index wildcards. Thus, a point in time can be selected from cold day the index of force-merge, Shrink other operations.

 

8.1 force-merge operation

$curl -X POST "http://127.0.0.1:9200/index/_forcemerge?max_num_segments=1"

1

All merge as follows

 

$curl -X POST "http://127.0.0.1:9200/_forcemerge?max_num_segments=1"

1

max_num_segments: Sets the maximum number of segment, the smaller the number, the more obvious improve query speed, but the longer merge consuming.

 

8.2 Shrink operation

PUT source_index/_setting

{

"settings":{

"index.routing.allocation.require._name":"node-1",

"index.blocks.write":true

}

}

index.blocks.write: Set the index to read only.

 

Narrow Index

 

Execution shrink:

 

POST source_index/_shrink/target_index

{

"settings":{

"index.number_of_replicas":1,

"index.number_of_shards":1,

"index.codec":"best_compression"

},

"aliases":{

"my_search_indices":{}

}

}

 

9. preheating Global ID (global ordinals)

Global number is a data structure used to run the polymerization in the keyword field terms. It uses a numerical value to represent a string value in the field, and assign a value for each bucket, which requires a build process blobal ordinals and the bucket. By default, they are constructed delayed, because the ES does not know which fields will be used for the polymerization terms, which field is not. ES can tell preloaded global ordinal configure mappings when refreshing:

 

PUT index

{

"mappings":{

"type":{

"properties":{

"city":{

"type":"keyword",

"eager_global_ordinals":true

}

}

}

}

}

10. execution hint (prompt execution)

There are two different terms polymerization mechanisms:

 

Aggregating data of each bucket (Map) field value by directly using

By using the global serial number field and assign a serial number for each global bucket

ES global_ordinals use keyword field as the default option, which uses dynamic allocation of global bucket number, so the number of field memory used in the polymerization results with a linear relationship. In most cases, the speed of this approach soon. When a query matches only a small amount of documents, consider using a map. By default, map used when running a script on the polymerization only, because they are not an ordinal number.

 

GET index/type/_search

{

"aggs":{

"city":{

"terms":{

"field":"city",

"execution_hint":"map"

}

}

}

}

11. preheat the file system cache

If the ES host restart, the file system cache will be empty, then the search will be slower, you can use index.store.preload set by specifying the file name extension tells the operating system displays which files should be loaded into memory.

 

For example, to elasticsearch.yml configuration file:

 

index.store.preload: ["nvd","dvd"]

1

Or when creating the index settings:

 

PUT index

{

"settings":{

"index.store.preload":["nvd","dvd"]

}

}

If the file system cache is not large enough, it can not save all the data, it is too many files pre-loaded into the file system will slow down the search, should be used with caution.

 

12. Use preference to optimize cache utilization

There are multiple caches can help improve search performance, such as the file system cache, the cache request or query cache.

 

However, all these caches are maintained at the node level, which means that if you run if you run the same request twice, there are one or more copies, and use the cycle (the default routing algorithm), then this request will be forwarded to two different a copy of fragmentation, preventing the node level caching help.

 

Since the user of a search application by one of the similar search request, the search request to make a similar fall in the same node, to help optimize the use of cache.

 

13. The search request regulation batched_reduce_size

This field is a parameter in the search request. By default, the polymerization operation is coordinating node needs to wait until all of the fragments are a result of performing retrieval using batched_reduce_size parameters may not wait for the whole portion of the sheet return results, but returns the result in a specified number of fragments can only first processing part (reduce). This avoids the coordinator node memory-intensive while waiting for the results of all, avoid OOM in extreme cases could lead to. This field defaults 512, is supported from ES 5.4.

 

14. A copy of the help Throughput

In addition to increasing resilience copy, but can also help improve throughput. When a search request is forwarded to a different copy of the poll.

 

15. Open the copy of the adaptive selection (ARS) to enhance the response speed ES

ARS formula is:

 

 

Each of the following meanings:

 

os (s): the node number of search requests is not completed

The number of nodes in the data system: n

R (s): EWMA response time in milliseconds

q (s): Search tasks waiting in the queue thread pool number EWMA

μ (s): EWMA time search service on the node data, in milliseconds.

ES this information can be roughly assess the health of the node pressure and fragmentation Replicas.

 

ARS is supported from version 6.1, is off by default, from the beginning ES 7.0, ARS will be enabled by default. You can use the following command to dynamically open:

 

PUT /_cluster/settings

{

"transient":{

"cluster.routing.use_adaptive_replica_selection":true

}

}

 

 

Published 277 original articles · won praise 65 · views 380 000 +

Guess you like

Origin blog.csdn.net/ailiandeziwei/article/details/104601899