Elasticsearch practice tuning

1. Memory

Both Elasticsearch and Lucene are written in Java, which means that we must pay attention to the heap memory settings.

The more heaps Elasticsearch can use, the more memory it can use for filters and other caches, which can further improve query performance.

Note, however, that too many heaps may cause garbage collection to pause for too long. Do not set the maximum heap memory value above the critical value used by the JVM for compressed object pointers (compressed oops). The exact critical value is different, but it should not exceed 32 GB .

Common memory configuration pit 1: Heap memory setting is too large

Example: Elasticsearch host machine: 64 GB memory, the heap memory can't be set to 64 GB.

However, this ignores another part of the heap that is a big memory user: the OS file cache.

Lucene aims to use the underlying operating system to cache data structures in memory. Lucene segments are stored in separate files.

Since segments are immutable, these files will never change. This makes them very easy to cache, and the underlying operating system is happy to keep the hot segments in memory to speed up access.

These segments include inverted index (for full-text search) and doc values ​​forward index (for aggregation). Lucene's performance depends on the interaction with the OS file cache.

If you allocate all available memory to Elasticsearch's heap, there will be no free space left in the OS file cache. This can severely affect performance.

The official standard recommendation is to allocate 50% of the available memory (no more than 32 GB, generally recommended maximum setting: 31 GB) to the Elasticsearch heap, and leave the remaining 50% to the Lucene cache .

image

Picture from the Internet

The Elasticsearch heap can be configured in the following ways:

  • Method 1: heap memory configuration file jvm.options

# Xms represents the initial size of total heap space
# Xmx represents the maximum size of total heap space
-Xms16g
-Xmx16g
  • Method 2: Start parameter setting

ES_JAVA_OPTS="-Xms10g -Xmx10g" ./bin/elasticsearch

2、CPU

Running complex cache queries and intensive writing of data require a lot of CPU, so choosing the correct query type and gradual writing strategy is very important.

A node uses multiple thread pools to manage memory consumption. The queue associated with the thread pool allows pending requests to be retained (similar to a buffering effect) instead of being discarded.

Because Elasticsearch will do dynamic allocation, unless there are very specific requirements, it is not recommended to change the thread pool and queue size.

Recommended reading:

https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-threadpool.html

3. Number of shards

Sharding is the unit that Elasticsearch distributes data within the cluster. The speed of rebalancing after a cluster failure depends on the size of the shards, the number of shards, network and disk performance.

In Elasticsearch, each query is executed in a single thread for each shard. However, multiple shards can be processed in parallel. Multiple queries and aggregations for the same shard can also be processed in parallel.

This means that without involving caching, the minimum query latency will depend on three factors: data, query type, and shard size.

3.1 Set a lot of small shards VS set a few large shards?

  • Many small fragments are queried, and each fragment can respond quickly, but due to the need to queue up and process the results in order. Therefore, it is not necessarily faster than querying a small number of large shards.

  • If there are multiple concurrent queries, having a large number of small shards will also reduce query throughput.

Therefore, there is the question of how to set the number of shards below?

3.2 Setting the number of fragments

Choosing the correct number of shards is a complicated problem, because the number of documents is generally not known exactly during the cluster planning stage and before data writing begins.

For clusters, when the number of shards increases, indexing and shard management may overload the master node, and may cause the cluster to become unresponsive, or even cause the cluster to go down.

Recommendation : Allocate sufficient resources to the master node (Master node) to deal with problems that may be caused by too many shards.

It must be emphasized that the number of primary shards is defined when the index is created, and the dynamic modification of the number of class replicas is not supported through the update API. After the index is created, the only way to change the number of primary shards is to re-create the index, and then reindex the original index data   to the new index.

Reasonable advice given by the official: the data size of each shard: 30GB-50GB.

Recommendation 1: How many shards should Elasticsearch set?

https://elastic.blog.csdn.net/article/details/78080602

Recommendation 2: How to allocate index fragments reasonably in Elasticsearch

https://qbox.io/blog/optimizing-elasticsearch-how-many-shards-per-index

4. Copy

Elasticsearch uses replicas to achieve high availability of the cluster. Data is replicated between data nodes to achieve the backup of the primary shard data. Therefore, even if some nodes go offline due to abnormalities, data will not be lost.

By default, the number of copies is 1, but it can be increased according to the high availability requirements of the product. The more copies, the higher the disaster tolerance of the data.

Another advantage of multiple replicas is that each node has a replica shard, which helps to improve query performance.

Ming Yi reminded:

  • The actual number of copies increases to improve query performance. It is recommended to combine the cluster to do a test. I have tested that the effect is not obvious.

  • An increase in the number of copies means doubling of disk storage, and it also tests hard disk space and disk budgets.

Recommendation : Consider setting the number of copies based on the actual business situation. For normal business scenarios (non-precision and high-availability), setting the copy to 1 is sufficient.

5. Cold and hot cluster architecture configuration

According to the specific and demand of product business data, we can divide the data into hot data and cold data, which is the premise of the cold and hot cluster architecture.

An index with a higher access frequency can be allocated more data nodes with a higher configuration (such as SSD), and an index with a lower access frequency can be allocated a data node with a lower configuration (such as a mechanical disk).

The hot and cold cluster architecture is particularly useful for storing data such as application logs or real-time data collected on the Internet (based on time series data).

Data migration strategy: By running timed tasks, the index can be moved to different types of nodes on a regular basis.

Specific implementation: curator tool or with the help of ILM index life cycle management.

5.1 Hot Node

A hot node is a specific type of data node, and the associated index data is: recent, newest, and hottest data.

Because these hot node data usually tend to be queried most frequently. The operation of hot data will take up a lot of CPU and IO resources, so the corresponding server needs to be powerful (high configuration) and additional SSD storage support.

For large-scale cluster scenarios, it is recommended to run at least 3 hot nodes to achieve high availability.

Of course, this is also related to the amount of data written and queried by your actual business. If the amount of data is very large, you may need to increase the number of hot nodes.

5.2 Cold node (or warm node)

Cold nodes are a kind of data nodes that are benchmarked against hot nodes and are designed to process a large amount of read-only index data that is not frequently queried.

Since these indexes are read-only, cold nodes tend to use ordinary mechanical disks instead of SSD disks.

Comparing with hot nodes, it is also recommended : at least 3 cold nodes to achieve high availability.

It should also be noted that if the cluster size is very large, more nodes may be required to meet the performance requirements.

Even more types are needed, such as hot nodes, warm nodes, cold nodes, etc.

Emphasize: CPU and memory allocation eventually require you like by using the production environment by means of esrally environmental performance testing tools to determine, rather than a direct reference to a variety of best practices racking our brains may be.

For more detailed information about hot nodes and hot nodes, see:

https://www.elastic.co/blog/hot-warm-architecture-in-elasticsearch-5-x

6. Node role division

The core of Elasticsearch nodes can be divided into three categories: master nodes, data nodes, and coordination nodes.

6.1 Master node

Master node: If the master node is only a candidate master node and does not contain the role of a data node, its configuration requirements are not so high, because it does not store any index data.

As mentioned earlier, if there are too many fragments, it is recommended that the master node improve the hardware configuration.

Responsibilities of the master node: storage cluster status information, shard allocation management, etc.

At the same time, note that Elasticsearch should have multiple candidate master nodes to avoid split-brain problems.

6.2 Data Node

Data node responsibilities: CURD, search and aggregation related operations.

These operations are generally IO, memory, and CPU intensive.

6.3 Coordination node

Coordinating node responsibilities: Similar to a load balancer, the main job is to distribute search tasks to related data nodes, collect all results, and then aggregate them and return them to the client application.

6.4 Node Configuration Reference

See the official blog PPT for the following table

Character description storage RAM Calculation The internet
Data node Store and retrieve data Extremely high high high in
Master node Manage cluster status low low low low
Ingest Node transformation input data low in high in
Machine learning node Machine learning low Extremely high Extremely high in
Coordinating node Request to forward and merge search results low in in in

6.5 The roles of different nodes are configured as follows

Must be configured to: elasticsearch.yml.

  • Master node

node.master:true 
node.data:false 
  • Data node

node.master:false 
node.data:true 
  • Coordinating node

node.master:false 
node.data:false

7. Troubleshooting tips

The performance of Elasticsearch depends to a large extent on the resources of the host.

CPU, memory usage, and disk IO are the basic indicators of each Elasticsearch node.

It is recommended that you check the Java Virtual Machine (JVM) metrics when the CPU usage surges.

7.1 High heap memory usage

High heap memory usage pressure affects cluster performance in two ways:

7.1.1 Heap memory pressure rises to 75% and higher

There is less remaining free memory, and the cluster now needs to spend some CPU resources to reclaim the memory through garbage collection.

When garbage collection is enabled, these CPU cycles are not available for processing user requests. As a result, as the system becomes more resource-constrained, the response time to user requests increases.

7.1.2 Heap memory pressure continues to rise and reaches close to 100%

A more aggressive form of garbage collection will be used, which in turn will greatly affect the cluster response time.

Index response time metrics indicate that high heap memory pressure can severely affect performance.

7.2 Growth of non-heap memory usage

The growth of non-heap memory outside the JVM eats up memory used for page caching and may lead to kernel-level OOM.

7.3 Monitoring Disk IO

Because Elasticsearch uses a lot of storage devices, disk IO monitoring is the basis for all other optimizations. Finding disk IO problems and adjusting related business operations can avoid potential problems.

The countermeasures should be evaluated based on the situation that caused the disk IO. Common practical strategies for optimizing disk IO are as follows:

  • Optimize the number of shards and their size

  • Segment merge strategy optimization

  • Replace ordinary disk with SSD disk

  • Add more nodes

7.5 Set up early warning reasonably

For applications that rely on search, the user experience is related to the waiting time for search requests.

There are many factors that affect query performance, such as:

  • Unreasonable way of constructing query

  • Unreasonable Elasticsearch cluster configuration

  • JVM memory and garbage collection issues

  • Disk IO, etc.

Query latency is an indicator that directly affects user experience, so make sure to put some warning actions on it.

Example: Online actual combat problem:

image

How to avoid it? The following two core configurations are for reference:

PUT _cluster/settings
{
  "transient": {
    "search.default_search_timeout": "50s",
    "search.allow_expensive_queries": false
  }
}

What needs to be emphasized is: "search.allow_expensive_queries" is a feature only available in version 7.7+, and earlier versions will report an error.

Recommended reading:

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-wildcard-query.html

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-your-data.html

7.6 Reasonable configuration of cache

By default, most filters in Elasticsearch are cached.

This means that when a filtering query is executed for the first time, Elasticsearch will find documents that match the filter and use that information to build a structure called "bitset".

The data stored in the bitset contains the document identifier and whether the given document matches the filter.

Subsequent execution of the query with the same filter will reuse the information stored in the bitset, thereby speeding up the execution of the query by saving IO operations and CPU cycles.

It is recommended to use filter filters in queries.

7.7 Set the refresh frequency reasonably

Refresh frequency (refresh_interval) and segment merging frequency are closely related to index performance. In addition, they also affect the performance of the entire cluster.

The refresh frequency needs to be set reasonably according to business needs, especially business scenarios that are frequently written.

7.8 Start slow query log

Enabling slow query logging will help identify which queries are slow and what steps can be taken to improve them. This is particularly useful for wildcard queries.

7.9 Increase ulimit size

Increase the ulimit size to allow the maximum number of files, which is a very common setting.

Set under /etc/profile:

ulimit -n 65535

7.10 Set interactive memory reasonably

When the operating system decides to swap out unused application memory, ElasticSearch performance may be affected.

Configure under elasticsearch.yml:

bootstrap.mlockall: true  

7.11 Disable wildcard fuzzy matching to delete index

It is forbidden to delete all indexes through wildcard query.

To ensure that someone will not issue a DELETE operation to all indexes (* or _all), set the following:

PUT /_cluster/settings
{
  "persistent": {
    "action.destructive_requires_name": true
  }
}

At this time, if we use wildcards to delete the index, for example, perform the following operations:

DELETE join_*

The error will be as follows:

{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "Wildcard expressions or all indices are not allowed"
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "Wildcard expressions or all indices are not allowed"
  },
  "status" : 400
}

8. Common indicator monitoring API

8.1 Cluster Health Status API

GET _cluster/health?pretty

8.2 Index Information API

GET _cat/indices?pretty&v

8.3 Node Status API

GET _nodes?pretty

8.4 Master node information API

GET _cat/master?pretty&v

8.5 Shard allocation, index information statistics API

GET _stats?pretty

8.6 Node Status Information Statistics API

Statistics node's jvm, http, io statistics.

GET _nodes/stats?pretty

Most system monitoring tools (such as kibana, cerebro, etc.) support Elasticsearch's index aggregation.

It is recommended to use such tools to continuously monitor cluster status information.

9. Summary

ElasticSearch has a good default configuration for novices to quickly get started and get started. However, once you get to the actual online business environment, you must spend some time adjusting the settings to meet the actual business function requirements and performance index requirements.

It is recommended that you refer to the suggestions in this article and modify the relevant configuration in conjunction with official documents to optimize the overall deployment of the cluster.

Guess you like

Origin blog.csdn.net/weixin_42073629/article/details/115038559