Elastic Optimization Road

I have never been in contact with elastic before, but I have admired this name for a long time, and I look forward to seeing his beauty and her power one day. Coincidentally, the usage scenario of a recent project requirement just fits with elastic. So I started to try to lift her veil...

There are two sources of data in es, one is to periodically import the latest collected data into es through scheduling tasks, ps: hive 2 es mode. The other is to implement sparkSql 2 es mode through spark jobs for huge historical data. Next, we will summarize some of the problems and solutions we encountered on the road.
The following optimized hardware configuration background:
5 nodes,
5 shards
1 replica
4 core
8G mem
ordinary hard drive
1. Product requirements to program design.
    Often some performance problems are caused by mistakes in the original storage solution design. The storage solution mentioned here is how to reasonably plan the cluster, machine, index, shard, replica, and storage space. One of our goals and principles is to rationally measure the number of indexes and the number of docs under a single index. The configuration of es cluster provided by our company for general business: 5nodes, 4 core, 8G mem
(1) Measure the number of docs under a single index to create an index: it can be based on the type of business and month
(2) The default configuration of shards for a single index is 5. If the amount of data is very large, there are more shards. The more shards the merges and aggregations involved in the query require more resources.
(3) Replica defaults to 1, and negative shards play a role in disaster recovery. The larger the number, the greater the IO for data synchronization.
 
2. Write performance optimization
(1) By setting the number of replicas to 0, the IO resource consumption of data synchronization between the primary shard and the secondary shard is reduced (for business scenarios where write operations are relatively concentrated,)
(2) Modify the index refresh frequency, the default value is refreshed once every 1s. It can be refreshed once every 10s after adjustment to reduce the system performance occupied by data refresh (suitable for scenarios with low real-time query requirements)
(3) The amount of data written to es in a single time should not be too large. It is recommended that 5-15MB be used. Otherwise, EsRejectedExecutionException will occur, which means that the bottleneck of the node has been reached. It is necessary to reduce the concurrency or upgrade the hardware to increase the node.
(4) Use bulk to import a large amount of data
 3. Query optimization
(1) Routing settings
The system calculates the location of index storage through the formula: shard_num = hash(_routing) % num_primary_shards, the system defaults to routing according to the index ID. If routing can be done according to a certain field in the business scenario, then when we search according to this field, we can quickly locate the corresponding shard, reduce full index scans, and greatly improve performance. But it should be noted that once routing is created, num_primary_shards cannot be easily changed, otherwise the routing strategy will fail.
 
(2) Use filter
Filters are very important when doing precise queries because they are very efficient, filters do not compute dependencies (skip the entire scoring stage) and are easy to cache.
 
(3) Use segment merge
When lucene inserts and updates data, it will generate many segments to support real-time query, so many segment fragments will be generated, and the index will query multiple segments when searching. After merging, the number of segment queries can be reduced and the speed can be improved. Recommended max_num_segments=1
(4) Narrow the query scope according to the business situation
4 Aggregation and sorting
 Elasticsearch performs search through reverse indexing and analysis through DocValues ​​columnar storage, unifying the search and analysis scenarios into a distributed system. By setting DocValues 
 

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326857249&siteId=291194637