Segment merging in Elasticsearch

From the previous article, we already know that in elasticsearch, each shard will be refreshed every 1 second, and each refresh will generate a new segment. At this rate, the number of segments will explode after a long time, so there are too many Segments are a big problem, because each segment will occupy file handles, memory resources, cpu resources, and more importantly, each search request must access each segment, which means that the more segments there are, the more search requests will be become slower.

So how does elaticsearch solve this problem? In fact, elasticsearch has a background process that is responsible for merging segments. It merges small segments into larger segments, and then repeats this. When merging segments, the marked and deleted documents will not be merged into a new larger segment. All processes do not require our intervention. es will be automatically completed in the process of indexing and searching. The merged segment can be on disk. An index that has been committed can also be a segment that has not yet been committed in memory:

(1) When indexing, the refresh process will create a new segment every second and open it to make the search visible

(2) The merge process will select some small-sized segments in the background, and then merge them into a larger segment, this process will not interrupt the current indexing and search functions.

image

(3) Once the merge is completed, the old segments will be deleted, the process is as follows:

3.1 新的segment会被flush到磁盘

3.2 然后会生成新的commit point文件,包含新的segment名称,并排除掉旧的segment和那些被合并过的小的segment

3.3 接着新的segment会被打开用于搜索

3.4 最后旧的segment会被删除掉

image

So far, the original documents marked for pseudo-deletion will be cleaned up. If not controlled, merging a large segment will consume more io and cpu resources, and will also affect the search performance, so by default, es has already done the merge thread. Resource quota so that it doesn't impact search performance too much.

The api is as follows:

PUT /_cluster/settings
{
    "persistent" : {
        "indices.store.throttle.max_bytes_per_sec" : "100mb"
    }
}

or without restriction:

PUT /_cluster/settings
{
    "transient" : {
        "indices.store.throttle.type" : "none" 
    }
}

The es api also provides us with an external command to force the merging of segments. This command is optimize, which can force a shard to be merged into a specified number of segments. This parameter is: max_num_segments. The higher the performance, it is usually optimized into a segment. It should be noted that the optimize command should not be used on a frequently updated index. For the frequently updated index es, the default merge process is the optimal strategy. The optimize command is usually used on a static index, which means that the index is not written. When the input operation is only a query operation, it is very suitable to use optimize to optimize. For example, some of our log indexes are basically indexed by day, week, or month. As long as today passes, this week or this month is basically There is no write operation. At this time, we can use the optimize command to force the merge of only one segment on each shard index, so that the query performance can be greatly improved. The api is as follows:

POST /logstash-2014-10/_optimize?max_num_segments=1 

Note that the optimize command sent from the outside does not limit resources, that is, as many IO resources as your system has, it will use as many IO resources, which may cause no response to the search for a certain period of time, so if you plan to optimize a large Index, you should use the shard allocation function to move this index to a specified node machine to ensure that the merge operation will not affect other services or the performance of es itself.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325450334&siteId=291194637