The write-optimized es

Article directory
Elasticsearch Optimization - write-optimized
1. translog flush interval adjustment
2. refresh interval index refresh_interval
combined optimization section 3.
4. Indexing Buffer
5. The bulk request using
5.1 queues and thread pools bulk
5.2 concurrent execution request bulk
6. disk between equalization task
the task of balancing between nodes 7.
8. adjustment and optimization of the indexing process
8.1 automatic generation ID DOC
8.2 adaptation field Mappings
8.3 _source field adjustments
8.4 disable field _all
8.5 pairs Analyzed field disable noms
8.6 index_options disposed
9. reference configuration
10. I am concerned

In the default settings of the ES, is considering the reliability of data, real-time search factors, such as the write speed. When leaving the default settings, the pursuit of the ultimate write speed, many at the expense of reliability and real-time search for the price. Sometimes, business requirements of real-time search and data reliability is not high, but very familiar with writing requirements, then you can adjust some strategies to maximize the write speed.
The next optimization based on the premise of the normal operation of the cluster, if the cluster for the first time bulk import data, you can set the number of copies to 0, then import the finished copy adjustments back, this copy of the need to copy and save the indexing process.

In summary, to enhance the write speed from the following aspects:

Increase translog flush interval, aims to reduce iops, writeblock.
Increase the index refresh interval, in addition to reducing the I / O, and more important to reduce the frequency segment merge.
Adjust the bulk request.
Optimization tasks evenly between the disk situation, the shard distributed as evenly as possible to each physical disk hosts.
Optimizing the distribution of tasks between the nodes, the task will try to distribute evenly to each node.
Optimization layer Lucene indexing process aimed at reducing the CPU usage and I / O, for example, disabling _all field.
1. translog flush interval adjustment
from ES 2.x start at the default setting, translog persistence strategy: Each request is "flush". Configuration item corresponding to the following:

index.translog.durability: Request
1
This is the biggest factor affecting ES write speed. But the only way possible to write reliable. If the system can accept a certain probability of data loss (for example, data is written to the primary fragment success has not been replicated to the copy fragments, host power failure. Since the data is neither brush to Lucene, translog did not brush plate, there is no recovery is translog this data, loss of data), then adjust the translog persistence strategy is cyclical and a certain amount of time to "flush", for example:

# Async set to represent translog brush disk configuration policy by sync_interval specified time period will be
index.translog.durability: async
# increase translog brush set interval. The default 5s, not be less than 100ms.
index.translog.sync_interval: 120s
# larger than this will lead to refresh operation, a new segment of Lucene. The default value of 512MB
index.translog.flush_threshold_size: 1024MB

 
2. The index refresh interval refresh_interval
default index case refresh_interval 1 second, after which means that data written to one second can be searched every time the index refresh will generate a new segment of Lucene, which results in frequent segment merge behavior, if not such a high real-time search index refresh cycle should be reduced, for example:

index.refresh_interval: 120s
. 1
or

The PUT / my_index
{
    "Settings": {
        "refresh_interval": "120s"
    }
}
 
3. segments combined to optimize
segment merge operation are high on the system I / O and memory usage, beginning from the ES 2.0, merge behavior is no longer controlled by the ES but Lucene control. Therefore, the following configuration is deleted:

indices.store.throttle.type 
indices.store.throttle.max_bytes_per_sec 
index.store.throttle.type 
index.store.throttle.max_bytes_per_sec
 
to the adjustment switch:

index.merge.scheduler.max_thread_count:. 4
. index.merge.policy *
. 1
2
The default value for the maximum number of threads max_thread_count follows:

= Math.max max_thread_count (. 1, Math.min (. 4, Runtime.getRuntime (). avaliableProcessors () / 2))
. 1
above is an ideal value, if not only the SSD and a hard disk, you should set it 1, since complicated written on a rotating storage medium, addressing reasons, only lower writing speed.

merge strategy index.merge.policy in three ways:

tiered (default policy);
log_byte_size;
log_doc.
Detailed description of each policy can refer to:

We are currently using the default policy, but the policy parameters have been adjusted.

When the index is created on the merger strategy it has been determined, can not be changed, but can be dynamically updated policy parameters. If the stack is often a lot of merge, you can try to adjust the policy configuration:

index.merge.policy.segments_per_tier

This attribute specifies the number of each segment, the smaller the value the less the segment, the merge operation is required more, can be considered appropriate to increase this value. The default is 10, which should be greater than or equal index.merge.poliycy.max_merge_at_once.

index.merge.policy.max_merged_segment

Specifies the maximum capacity of a single segment, the default is 5GB, may be considered appropriate to reduce this value.

Indexing Buffer 4.
ndex Buffer used when indexed doc, when the buffer is full brush into the disk will generate a new segment, which is in addition to refresh_interval refresh the index, the chance of another new segment generated. Each shard has its own ndexing buffer, following the buffer size divided by the number of configuration requires shard of this node:

indices.memory.index_buffer_size: The default is 10% of the entire heap space.

indices.memory.min_index_buffer_size: The default is 48MB.

indices.memory.max_index_buffer_size: Unlimited default.

In the case of large indexing operation, indices.memory.index_buffer_size default value may be insufficient, and that the amount of heap memory available shard, single-node correlation, increase the value may be considered appropriate.

5. Use the bulk request
bulk write request to write efficient than an index of only a single document is much higher, but be aware that the overall number of bytes bulk requested not too big, too big a request may give clusters bring pressure memory, so each request is best to avoid exceeding tens of megabytes, better looks even larger request execution.

5.1 bulk thread pool and queue
indexing process belongs to the compute-intensive tasks, you should use a fixed-size thread pool, no time to deal with the task in the queue. The maximum number of threads in the thread pool should be configured as the core number +1 CPU, which is the default configuration bulk thread pool, to avoid excessive context switching. Queue size can be increased, but must be strictly controlled size, too large queue GC result in higher pressure and can lead to frequent occurrence of FGC.

5.2 concurrent execution bulk request
bulk write request is a long task, the system in order to increase pressure enough to write, write process should be multiple clients, parallel execution of multiple threads. If you want to verify the system's ability to limit writing, then the goal is to pressure full CPU. Disk util, memory, and so is generally not a bottleneck. If the CPU is not full pressure, it should increase the number of concurrent write terminal. But pay attention to the case of bulk reject the thread pool queue, there regect on behalf of the ES bulk queue is full, a client request is denied, then the client receives the 429 error (TOO_MANY_REQUESTS), the client of this treatment strategy should be delayed Retry. This exception can not be ignored, otherwise the data is written to the system will be less than expected. Even if the client correctly handle a 429 error, we should still try to avoid reject. Therefore, in assessing the capacity limit of the writing, the writing of concurrent client limit maximum amount should be controlled under the premise of no reject is appropriate.

6. The balance between the disk task
if the solution is not familiar with the plurality of paths configured to use multiple disks path.data, ES equalized by the following two kinds of write strategies different disk:

Simple Polling: at system initialization phase, the effect is the most simple polling uniform.
Dynamic Weighted Round available space: the available space as weights, weighted round-robin between the disks.
7. The task of balance between nodes
for tasks try to balance between the nodes, data is written to the client should poll the bulk request sent to each node, when transmitting data using bulk Interface the REST API, the client will be sent to the poll node cluster, add nodes when you create a client object.

8. The adjustment and optimization of the indexing process
8.1 doc ID generated automatically
by ES writing process can be seen, if the external writing doc specified id, the version number es attempts to read the original doc, to determine whether to update. This will involve a disk read operation, by automatically generating the doc ID to avoid this link.

8.2 adjustment field mappings
to reduce the number of fields, the field is not required for indexing does not write ES.
Fields index property setting will not need to be indexed for not_analyzed or no. Words regardless of the field, with or without indexing, many arithmetic operations can be reduced, reducing the CPU usage. In particular binary type, default CPU occupancy is very high, and this type of segmentation is usually carried out does not make sense.
Reduce the length of the field contents, if the contents of a large section of the original data without having to build index, minimize unnecessary content.
Using different analyzer (analyzer), different computational complexity analyzer during the indexing process is also a greater difference.
8.3 _source field adjustment
_source doc a field for storing raw data for the field need not be stored, can be filtered through a includes excludes, or disabled _source, generally used for indexing and data separation.

This reduces the pressure of I / O, but often does not disable the actual scene _Source, even filters out some fields, the writing speed for lifting effect is not large, the writing situation under full load, substantially the full CPU Xianpao , the bottleneck is CPU.

8.4 Disable _all field
starting ES 6.0, _all field is not enabled by default, while in the previous version, _all field is enabled by default. _all field contains keywords after all fields word, the role can not specify a particular field when searching, retrieval from all fields. ES 6.0 is disabled by default _all the following main reasons:

Due to the need to copy all field values from all of the other fields, resulting in _all the fields occupy a very large space.
_all field has its own parser, during certain queries (eg, synonyms), the results do not meet expectations, because there is no match with a parser.
Because of the additional overhead of indexed data duplication caused.
When you want to debug, its content is not easy to check.
Some users do not even know the existence of this field, leading to confusion query.
Alternatives to change.
In previous versions ES 6.0, you can set in the mapping enabled to false to disable _all fields:

The PUT / my_index
{
    "Mappings": {
        "my_type": {
            "_all": {
                "Enabled": to false
            }
        }
    }
}
 
disabling _all field can significantly reduce the pressure on the CPU and I / O's.

Analyzed for 8.5 Noms disable field
Norms for calculating doc score in the search, if no score, it can be disabled:

My_index the PUT / _mapping / my_type
{
    "Properties": {
        "title": {
            "type": "keyword",
            "Norms": {
                "Enabled": to false
            }
        }
    }
}
 
8.6 index_options provided
index_options for controlling establishing inverted the process of indexing, the contents of which will be added to the inverted index, for example, the number of doc, word frequency, options, offset and other information, these settings can be optimized to some extent, reduce indexing process computing tasks, saving CPU usage.

However, in the actual scene, it is often difficult to determine the business will use this information in the future, unless it is clearly the beginning of the program is designed.

9. reference configuration
index level of vision required to write in the template, or create an index is specified.

{
    "Template": "*",
    "Order": 0,
    "Settings": {
        // maximum capacity of a single segment of
        "index.merge.policy.max_merged_segment": "2GB",
        // number of each segment , the smaller the value, the more the merge operation can be increased, default 10
        // not less than max_merge_at_once
        "index.merge.policy.segments_per_tier": "24",
        // index into the brush buffer time operating system, the default one second
        "index .refresh_interval ":" 120s ",
        // index brush into the operating system caching policy, the default request, changed regularly updated
        " index.translog.durability ":" the async ",
        // translog index thresholds brush into the disk, the default 512mb , Point submitted the commit
        "index.translog.flush_threshold_size": "512MB",
        // log flush to disk, the default 5S
        "index.translog.sync_interval": "120s",
        // index assignment delay time, default 1 minute
        "index.unassigned.node_left.delayed_timeout":"5d"
    }
}
 
elasticsearch.yml中的配置:

# Index buffer size, use the default heap% 10
indices.memory.index_buffer_size: 30%
 
Source: https: //blog.csdn.net/dwjf321/article/details/103836211

 

Published 277 original articles · won praise 65 · views 380 000 +

Guess you like

Origin blog.csdn.net/ailiandeziwei/article/details/104578417