ElasticSearch read and write and the underlying principles of performance tuning

## a, the underlying principle of the reader

Elasticsearch writing process personal data

1) client selects a node sends a request in the past, the node is Coordinating node (coordinator node) 2) coordinating node, on the document routing, forwards the request to the corresponding node (with a primary shard) 3) primary on actual node shard processing request, then the data is synchronized to the replica node 4) coordinating node, and the primary node if it is found after all have to get replica node, in response to the result returned to the client

Elasticsearch process of reading data

1) The client sends a request to a Node any, become coordinate node 2) coordinate node document routing to forward the request to the corresponding Node, this time using the round-robin algorithm randomly polling, primary shard and its replica in all is selected randomly, so that the read requested load balancing 3) the node receives the request returns the document to coordinate node 4) coordinate node returns the document to the client

1. written document, each document will automatically assign a globally unique id i.e. doc id, is also routed to hash the corresponding primary shard according doc id. You can also specify a manual doc id, such as with the order id, user id.

2. When reading the document, you can query by doc id, and will be based on hash doc id, was judged to allocate doc id which shard to go above, to make inquiries from the shard

Elasticsearch data search process

es most powerful full-text search is done 1) The client sends a request to a coordinate node 2) coordinating node search request will be forwarded to all of the corresponding primary shard shard or replica shard can also be 3) query phase: Each shard own the results (in fact, some of doc id), returns to the coordinator node, a node coordinate data by merging, sorting, paging and other operations, final result outputs 4) fetch phase: followed by a coordinating node, each node according to doc id pulling the actual document data, eventually returned to the client

Search the underlying principle: the inverted index

Elasticsearch underlying principle of write data

1) first write buffer, when data in the buffer is less than the search; translog while data is written to a log file. 2) If the buffer is almost full, or to a certain time, they will be refresh buffer data into a new segment file, the data at this time but not directly into the segment file disk file, but first enter the os cache. This process is to refresh. Every one second, es the data buffer is written to a new segment file, every second will generate a new disk file, segment file, this segment file is stored in the data recently in one second buffer written . However, if the buffer there is no data at this time, of course, does not perform refresh operations slightly, create an empty segment file for a second, if there are data buffer, default 1 second to perform a refresh operation, the brush into a new segment file in. Inside the operating system, disk files actually have a thing called the os cache, the operating system cache, that is to say before the data is written to a disk file, you will first enter os cache, before entering the operating system level of a memory cache. As long as the data buffer is in refresh operation, the brush into the os cache, it means that the data can be searched for.

Why is it called es is near real-time? NRT, near real-time, near-real time. The default is every second refresh time, so es quasi real-time, because the data is written after one second in order to be seen.

Restful es through the api api, or Java, to manually perform a refresh operation, the data buffer is to manually brush into the os cache, so that the data can be immediately found.

As long as the data is entered os cache, buffer will be cleared because the buffer is not required to retain the data in the translog which has persisted to disk to copy the

Pictures .png

Second, the performance tuning

System-level tuning

Tuning system level is the main memory setting to avoid swap memory. After ES installation defaults heap memory is 1GB, which is obviously not enough, then the next there will be a problem: we want to set how much memory to ES it? In fact, this is to look at the size of our memory cluster nodes, but also on whether we are on the server node also whether or not to deploy other services. If the memory is relatively large, such as 64G and above, and other services are not deployed on the cluster ES, it is recommended that ES memory can be set to 31G-32G, 32G because there is a performance bottleneck, straightforward to say that even if you gave ES clusters larger than 32G of memory, not necessarily more excellent performance, even better to set performance 31G-32G time. Set ES cluster memory when another point is to ensure that the minimum size of the heap memory (Xms) and maximum (Xmx) is the same, to prevent the program to change the heap memory size at runtime, this is a waste of resources if the process of .

Prohibit swap, swap memory and disk once allowed, it will cause a fatal performance issues. swap space is a piece of disk space, the operating system uses this space to save swapped out from memory the operating system is not commonly used data page, so you can allocate more memory to do page cache. This usually will increase throughput and IO performance of the system, but will also create many problems. Page frequently swapped out will have to read and write IO, operating system interruption, these are affecting the performance of the system. The higher the value of the operating system will be more active use of swap space. By: bootstrap.memory_lock in elasticsearch.yml in: true, in order to keep the JVM memory lock, ensure the performance of the ES.

Fragmentation and copy

Fragment (shard): ES is a distributed search engine, the index usually broken down into different parts, different portions of the data in the distributed nodes is fragmented. ES automatic management and tissue fragments, and rebalance the distribution of slice data when necessary, the user basically need not worry about the details of fragmentation process. The default when creating an index number of fragments is 5, and once created can not be changed.

Copy (replica): ES default create a copy, that is based on five main slices on each of the main fragments have a copy of the corresponding fragment. Additional copies of pros and cons, we can have a copy of a stronger recovery capability, but also accounted for a corresponding multiple copies of disk space.

That when we create the index, the number of slices and the number of copies should create it?

For the number of copies, better sure, you can decide with our storage node cluster based on how much we, our multi-server cluster, and there is enough most of the storage space, you can set the number of copies and more, usually one to three the number of copies, If the cluster server storage space is relatively small and not so loose, you can set only one copy in order to ensure disaster recovery (copy number can be dynamically adjusted).

For the number of fragments, it is more difficult to determine. Because once an index number of slices determined, it can not be changed, so before we create the index, to be fully taken into account, the amount of data we create the index after the storage, or the creation of an inappropriate number of fragments, will our performance of great impact.

For the magnitude of the number of fragments, the industry agreed that how much memory is linked with the number of fragments that correspond to 20-25 1GB heap memory fragmentation, and a fragmentation size should not exceed 50G, this configuration helps health cluster . But I personally think that this configuration is too rigid, I personally in tune ES cluster process, according to the size of the total amount of data, the corresponding set of fragments, each fragment to ensure that the size does not exceed 50G (probably in 40G or so), but the number of fragments compared to the previous query it, the effect is not obvious. And later tried to increase the number of fragments, we found that after an increase in the number of fragments, query speed has been significantly improved, the amount of each slice of data control in about 10G.

Query a large number of small fragments so that each slice to process data faster, and the more points it is not the number of pieces, the faster our query, the more good ES performance? In fact, not since the inquiry process, there is a fragmentation of the merger process, if the pieces continue to increase the number of points, the combined time will increase, and as more of the tasks required in order queuing and handling, more small fragments are not necessarily smaller than the number of queries larger slice faster. If there are multiple concurrent queries, there are a lot of small debris will reduce query throughput.

Now if your scene is the number of fragments inappropriate, but do not know how to adjust, then there is a good solution is to follow the time to create the index, followed by a wildcard queries. If a large amount of data every day, you can create an index by the day, if it is a month accumulated result in large amount of data, you can create an index for a month. If you want to re-index the existing fragmentation, the need to rebuild the index number for each shard index may be based on the amount of data, writing pressure, the number of other nodes after setting into consideration, then the data detecting periodically the state of growth under shard number is reasonable.

Tencent cloud CES technology team recommendations are: For smaller amounts of data (100GB or less) of the index, often written queries pressure relatively low pressure, usually set three to five shard, number_of_replicas can be set to 1 (that is, a master-slave, a total of two copies). For large data volume index (100GB or more): generally the amount of data in a single shard control (20GB ~ ​​50GB) so that the pressure index is allocated to the plurality of nodes: can index.routing.allocation.total_shards_per_node parameters defining a force node the index number shard, so far as possible shard shard number assigned to consider the entire index on different nodes, if the number of shard (not including copies) over 50, is likely to cause problems rejection rate rise, then you can consider the index index is split into a plurality of independent, sharing the amount of data, in conjunction with the use of routing, each shard to reduce the number of queries need to access.

Here I will introduce some of the key parameters tuning ES. There are a lot of scenes, our ES cluster takes up much cpu usage, how to adjust it. cpu usage is high, it may be written cause, there may be a result of inquiries, that to see how it? You can start by GET _nodes/{node}/hot_threadsviewing thread stacks, which thread to see high occupancy cpu, if it elasticsearch[{node}][search][T#10]is a result of a query, if elasticsearch[{node}][bulk][T#1]the data is written is caused. In practice the tuning, cpu utilization is high, using SSD (Solid State Disk) replace the mechanical hard disk. SSD compared with mechanical disk, read and write speed and with high stability. If not SSD, suggested index.merge.scheduler.max_thread_count: 1 index merge maximum number of threads is set to 1, this parameter can effectively regulate the performance of writing. Since Concurrent written on the storage medium, for reasons addressed, write performance will not improve, it will only decrease.

There are several important parameters can be set, the graduates can, depending on their circumstances and cluster data availability.

index.refresh_interval: The meaning of the parameter data is written after a few seconds can be searched, the default is 1s. Each time the index refresh will generate a new lucene segment, which leads to frequent mergers, if the business is not so high demand for real-time requirements, you can turn up this parameter, the actual tuning told me that the argument really is to force , cpu utilization plummeted.

indices.memory.index_buffer_size: If we want to be very heavy high concurrent write operation, it is best to indices.memory.index_buffer_sizetransfer larger, index bufferthe size of the shard is common to all, for each shard for up to 512mb, because then nothing big performance improves . The ES will be provided as each shard share index buffer, those particularly active shard will be more use of this buffer. The default value of this parameter is 10%, which is 10% jvm heap of.

translog: ES In order to ensure that data is not lost, each index, bulk, delete, update, when completed, will trigger a refresh translog to disk. While improving data security is of course a bit lower performance. If you do not mind this possibility, hope performance priority, you can set the following parameters:

"index.translog": {
 "sync_interval": "120s",     #sync间隔调高
 "durability": "async",      # 异步更新
 "flush_threshold_size":"1g" #log文件大小
        }
复制代码

This setting means that enables asynchronous write to the disk, and set the time interval and the size of the write, write help improve performance. The number of replica

To make es index created evenly distributed on each datanode, the same number on the same datanode a shard index should not be more than three. Formula: (number_of_shard * (1 + number_of_replicas)) <3 * shard number assigned on each machine number_of_datanodes "index.routing.allocation.total_shards_per_node": "2

Disk cache parameters

vm.dirty_background_ratioThis parameter specifies (eg 5%) is triggered when the number of dirty pages in the file system cache memory system to achieve what percentage of pdflush/flush/kdmflushback-office operation process of writing back, a certain cache dirty pages asynchronously brush into the external memory;

vm.dirty_ratio

This parameter specifies the file system cache when the number of dirty pages to reach what percentage of system memory (such as 10%), the system had to start processing the cache dirty pages (because the number of dirty pages have more, in order to avoid loss of data needed a certain dirty brush into the external memory page); in this process, many of the application process because the system could turn to deal with file IO blocked.

The proper parameter to transfer a small, on principle (1) similar. If the dirty data cached percentage (here is the proportion of MemTotal) exceeds this setting, the system will stop all IO application-layer write, wait for data recovery after brushing IO. So in case triggered the operating system for the user, the impact is very large.

sysctl -w vm.dirty_ratio=10
sysctl -w vm.dirty_background_ratio=5
复制代码

In order to permanently save the settings, the above-described configuration file entry is written /etc/sysctl.conf

vm.dirty_ratio = 10
vm.dirty_background_ratio = 5
复制代码
merge the relevant parameters
"index.merge.policy.floor_segment": "100mb",
 "index.merge.scheduler.max_thread_count": "1",
 "index.merge.policy.min_merge_size": "10mb"
复制代码
There are some timeout parameter settings:
discovery.zen.ping_timeout 判断 master 选举过程中,发现其他 node 存活的超时设置
discovery.zen.fd.ping_interval 节点被 ping 的频率,检测节点是否存活
discovery.zen.fd.ping_timeout 节点存活响应的时间,默认为 30s,如果网络可能存在隐患,可以适当调大
discovery.zen.fd.ping_retries ping 失败/超时多少导致节点被视为失败,默认为 3
复制代码
Linux system configuration parameters
File handle

Linux, each process default maximum number of open file handles is 1000, for the server process, it is clear too, by modifying /etc/security/limits.confto increase the maximum number of open handles

* - nofile 65535
复制代码
Reading Optimization

① avoid large result sets and deep plowing on one talked about the query process in the cluster, for example, to query data from the stripe size from the beginning, you need to query scoring position in front of the bar from + size of each slice data. The collected cooperative node n × (from + size) polymerization of data, then the first sort, and then starts to return from the size of data from + size. When from, size or n has a value great time, the number will need to participate in the sort of growth, such queries will consume a lot of CPU resources, resulting in reduced efficiency. In order to improve query efficiency, ES provides Scroll and Scroll-Scan both query mode. Scroll: to retrieve a large number of results and design. For example, we need to query data from 1 to 100, and 100 each page of data. If Search Query: every need, the highest score from + 100 pieces of data on each slice, and then, the collected cooperative node n × (from + 100) th data sorted gather together once again. Returns from + 100 every pieces of data starting at 1, and to be repeatedly performed 100 times. If the query Scroll: 10,000 query data on each slice, the polymerization cooperative node n × 10000 pieces of data merging, sorting, and the ranking results of the previous snapshot 10,000 up. The benefit of this is to reduce the number of queries and sorting.

other suggestion

Insert automatically generated index id: When writing a specific end use to write data id ES, ES checks for the id corresponding to the same index, this operation will increase as the number of documents that the increasing consumption, so if there is no hard and fast on demand service is recommended to use id ES automatically generated, to accelerate the write speed.

Avoid sparse index: sparse after the index will lead to increase in the index file. ES, keyword, using doc_values ​​array type structure, even if the field is null, each document will also take up some space, so sparse index will cause the disk increases, resulting in a decrease in query and writing efficiency.

Tuning parameters
index.merge.scheduler.max_thread_count:1 # 索引 merge 最大线程数
indices.memory.index_buffer_size:30% # 内存
index.translog.durability:async # 这个可以异步写硬盘,增大写的速度
index.translog.sync_interval:120s #translog 间隔时间
discovery.zen.ping_timeout:120s # 心跳超时时间
discovery.zen.fd.ping_interval:120s     # 节点检测时间
discovery.zen.fd.ping_timeout:120s     #ping 超时时间
discovery.zen.fd.ping_retries:6 # 心跳重试次数
thread_pool.bulk.size:20 # 写入线程个数 由于我们查询线程都是在代码里设定好的,我这里只调节了写入的线程数
thread_pool.bulk.queue_size:1000 # 写入线程队列大小
index.refresh_interval:300s #index 刷新间隔
bootstrap.memory_lock: true#以保持JVM锁定内存,保证ES的性能。 
复制代码
About rebuild the index

Before rebuilding the index, we must first consider the need to rebuild the index, because the index rebuild is very time consuming. ES reindex api's not going to try to set a target index, the index will not copy the settings of the source, so we should set a target index before running _reindex operations, including setting a mapping (mapping), fragmentation, such as a copy.

The first step, and create a common index to create new indexes. When the large amount of data, the need to set the refresh interval, the refresh_intervalsset to -1, i.e. not refreshed, number_of_replicas the number of copies is set to 0 (since the number of copies can be dynamically adjusted, which helps to enhance the speed).

{ 
"settings": {
 "number_of_shards": "50",
 "number_of_replicas": "0", 
 "index": { "refresh_interval": "-1" }
              } 

"mappings":
 {
    }
}
复制代码

The second step, call reindex interfaces, suggested adding wait_for_completion=falseparameter conditions, such reindex will return directly taskId.

POST _reindex?wait_for_completion=false { "source": { "index": "old_index",   //原有索引
  "size": 5000            //一个批次处理的数据量
}, "dest": { "index": "new_index",   //目标索引
}
}
复制代码

The third step is to wait. You can GET _tasks?detailed=true&actions=*reindexquery the progress of reconstruction. If you want to cancel the task is invoked _tasks/node_id:task_id/_cancel.

The fourth step, delete the old index, free up disk space. Rebuild the index when the parameters in the last rebuild the index plus the timestamp is straightforward to say, such as our data is 100G, this time we rebuild the index, but the 100G is increasing, so when we rebuild the index, record the need to rebuild the index timestamps, the timestamp is the purpose of recording the next time the task of rebuilding the index do not run all the reconstruction, only needs to be rebuilt after this time stamp on it, so iterations, until the old and new indexes are basically the same amount of data , the data flow is switched to the name of the new index.

POST /_reindex
{ 
"conflicts": "proceed",          //意思是冲突以旧索引为准,直接跳过冲突,否则会抛出异常,停止task
    "source": { "index": "old_index"         //旧索引
        "query": { "constant_score" : 
                      { "filter" : { 
                          "range" : { "data_update_time" : 
                                          { "gte" : 123456789   //reindex开始时刻前的毫秒时间戳
                                              }
                        }
                    }
                }
            }
        }, 
"dest": { "index": "new_index",       //新索引
        "version_type": "external"  //以旧索引的数据为准
 }
}
复制代码

Guess you like

Origin juejin.im/post/5de0c453f265da05aa65d8b1