Elasticsearch indexing and query performance tuning 21 recommendations

Elasticsearch deployment recommendations

1. Select the sound hardware configuration: Whenever possible, use SSD

Elasticsearch often the biggest bottleneck is the disk read and write performance, especially the random read performance. Using SSD (PCI-E Interface Card SSD / SATA interfaces the SSD) is usually higher than mechanical hard drive (SATA disk / SAS disk) 5 to 10 times faster search speed, the write performance is not obvious.
For higher performance requirements of document retrieval query class scene, it is recommended to consider SSD as storage, memory and hard drive simultaneously configured in accordance with the ratio of 1:10. For class query log analysis requires less complicated scenario, could be considered as the mechanical hard disk storage, memory and hard drive while the configuration according to the ratio of 1:50. Single-node data stored within the recommended 2TB, not more than 5TB, to avoid the slow queries, system instability.

2. Give JVM configuration of the machine half the memory, but does not recommend more than 32G

Modify conf / jvm.options configuration, and -Xmx -Xms set to the same value is set to about half the recommended machine memory, leaving the remaining half of the operating system using the cache. JVM memory is recommended not less than 2G, or they may lead to ES because of insufficient memory or memory overflow does not start properly, JVM is recommended not to exceed 32G, otherwise disabled JVM memory object pointer compression technology, resulting in wasted memory. When the machine memory is greater than 64G memory, it recommended configuration -Xms30g -Xmx30g. JVM heap memory is large, memory garbage collection pauses a long time, it is recommended to configure ZGC or G1 garbage collection algorithm.

3. large-scale proprietary master node cluster configuration to avoid split brain problem

Elasticsearch master node cluster responsible for the meta-information management, index additions and deletions operation, adding nodes removed, regularly broadcast the latest state to each cluster node. When the cluster size, the proposed configuration is only responsible for proprietary master node cluster management, do not store data, read and write data do not bear the pressure.

# 专有主节点配置(conf/elasticsearch.yml):
node.master:true
node.data: false
node.ingest:false


# 数据节点配置(conf/elasticsearch.yml):
node.master:false
node.data:true
node.ingest:true

Each default candidate node Elasticsearch both the master node, and a data node. The minimum number of parameters master node minimum_master_nodes recommended configuration is more than half the number of candidates for the master node, this configuration Elasticsearch tell when there is not enough time to master candidate nodes, without master node elections, and other master node is sufficient only to conduct elections.
For example, the 3-node cluster, the minimum number of master nodes from the default value of 12.

# 最小主节点数量配置(conf/elasticsearch.yml):
discovery.zen.minimum_master_nodes: 2

4. Linux operating system tuning

Close swap partition, to prevent the exchange reduced memory performance.

# 将/etc/fstab 文件中包含swap的行注释掉
sed -i '/swap/s/^/#/' /etc/fstab
swapoff -a

# 单用户可以打开的最大文件数量,可以设置为官方推荐的65536或更大些
echo "* - nofile 655360" >> /etc/security/limits.conf

# 单用户线程数调大
echo "* - nproc 131072" >> /etc/security/limits.conf

# 单进程可以使用的最大map内存区域数量
echo "vm.max_map_count = 655360" >> /etc/sysctl.conf

# 参数修改立即生效
sysctl -p

Index performance tuning recommendations

1. Set a reasonable number of the index sheet and the number of copies fraction

Index number of fragments is recommended to set an integer multiple of the cluster nodes, the initial number of copies of the data lead is set to 0, the number of copies of the production environment is recommended set to 1 (provided a copy of any cluster node 1 down data is not lost; Settings more copies will take up more storage space, operating system cache hit rate will decline, not necessarily enhance the retrieval performance). Single node index number of points is not recommended more than three pieces, each index fragmentation recommended 10-40GB size, the number of index fragmentation settings can not be modified, you can modify the number of copies set.
And earlier Elasticsearch6.X default index number of fragments is 5, the number of copies to 1, Elasticsearch7.0 began to adjust the index from the default number of fragments is 1, the number of copies is 1.

# 索引设置
curl -XPUT http://localhost:9200/fulltext001?pretty -H 'Content-Type: application/json'   
-d '{
    "settings": {
        "refresh_interval": "30s",
        "merge.policy.max_merged_segment": "1000mb",
        "translog.durability": "async",
        "translog.flush_threshold_size": "2gb",
        "translog.sync_interval": "100s",
        "index": {
            "number_of_shards": "21",
            "number_of_replicas": "0"
        }
    }
}'

# mapping 设置
curl -XPOST http://localhost:9200/fulltext001/doc/_mapping?pretty  -H 'Content-Type: application/json' 
-d '{
    "doc": {
        "_all": {
            "enabled": false
        },
        "properties": {
            "content": {
                "type": "text",
                "analyzer": "ik_max_word"
            },
            "id": {
                "type": "keyword"
            }
        }
    }
}'

# 写入数据示例
curl -XPUT 'http://localhost:9200/fulltext001/doc/1?pretty' -H 'Content-Type: application/json' 
-d '{
    "id": "https://www.huxiu.com/article/215169.html",
    "content": "“娃娃机,迷你KTV,VR体验馆,堪称商场三大标配‘神器’。”一家地处商业中心的大型综合体负责人告诉懂懂笔记,在过去的这几个月里,几乎所有的综合体都“标配”了这三种“设备”…"
}'

# 修改副本数示例
curl -XPUT "http://localhost:9200/fulltext001/_settings" -H 'Content-Type: application/json' 
-d '{
    "number_of_replicas": 1
}'

2. Use a batch request

Use batch request will result in performance than single-document indexing request much better. Call the batch submitted Interface When data is written, it is recommended to submit each batch 5 ~ 15MB of data. For example, a single record size 1KB, each batch of about 10,000 to submit records write better performance; 5KB single record size, record submitted to about 2000 per batch write better performance.

# 批量请求接口API
curl -XPOST "http://localhost:9200/_bulk" -H 'Content-Type: application/json' 
-d'
{ "index" : { "_index" : "test", "_type" : "_doc", "_id" : "1" } }{ "field1" : "value1" }
{ "delete" : { "_index" : "test", "_type" : "_doc", "_id" : "2" } }
{ "create" : { "_index" : "test", "_type" : "_doc", "_id" : "3" } }{ "field1" : "value3" }
{ "update" : {"_id" : "1", "_type" : "_doc", "_index" : "test"} }{ "doc" : {"field2" : "value2"} }'

3. send data through multi-process / thread

Single-threaded batch write data often do not fully utilize server CPU resources, you can try to adjust the number of threads to write or submit written request to the Elasticsearch servers on multiple clients. And batch resize a similar request, only a test to determine the optimal number of worker. By gradually increasing the number of tasks to test on the cluster until I / O or CPU saturation.

4. The transfer large refresh interval

In Elasticsearch, the writing process and open a new lightweight segment is called refresh. Each slice will automatically refresh every second by default. That is why we say Elasticsearch is near real-time search: changes in the document are not immediately visible to a search, but will become visible within one second.
Not all cases need to be refreshed per second. Perhaps you are using Elasticsearch index large log files, you may want to optimize the indexing speed instead of near real-time search, you can set refresh_interval, reduce the refresh rate for each index.

# 设置 refresh interval API
curl -XPUT "http://localhost:9200/index" -H 'Content-Type: application/json' 
-d'{
    "settings": {
        "refresh_interval": "30s"
    }
}'

refresh_interval can be performed on the index already exists dynamically updated, in a production environment, when you are building a large new index, you can turn off the auto-refresh, to be started using the index, then they are transferred back.

curl -XPUT "http://localhost:9200/index/_settings" -H 'Content-Type: application/json' 
-d'{ "refresh_interval": -1 }'

curl -XPUT "http://localhost:9200/index/_settings" -H 'Content-Type: application/json' 
-d'{ "refresh_interval": "1s" }'

The transaction log configuration parameters

Translog transaction log is used to prevent data loss when a node fails. It is designed to help shard recovery operation, otherwise the data may be lost when the accident occurred flush from memory to disk. The transaction log translog drop disk (fsync) ES is performed automatically in the background, the default submit to disk every 5 seconds, or when the translog file size is greater than 512MB submitted, or in each successful index, delete, update, or batch request submit.
When the index is created, you can adjust the default log refresh interval of 5 seconds, for example, was changed to 60 seconds, index.translog.sync_interval: "60s". After creating the index, you can dynamically adjust translog parameters, "index.translog.durability": "async" the equivalent of close synchronization flush index, bulk operations such as translog operation, use only the default periodic refresh mechanism to refresh the file size threshold.

# 动态设置 translog API
curl -XPUT "http://localhost:9200/index" -H 'Content-Type: application/json' 
-d'{
    "settings": {
        "index.translog.durability": "async",
        "translog.flush_threshold_size": "2gb"
    }
}'

6. The design of the appropriate field type mapping configuration

Elasticsearch when writing the document, if the index specified in the request does not exist, it will automatically create a new index, and speculate on the possible types of fields as document content. But this is often not the most efficient, we can design a reasonable field types depending on the application scenario.

# 例如写入一条记录
curl -XPUT "http://localhost:9200/twitter/doc/1?pretty" -H 'Content-Type: application/json' 
-d'{
    "user": "kimchy",
    "post_date": "2009-11-15T13:12:00",
    "message": "Trying out Elasticsearch, so far so good?"
}'

Query index mapping Elasticsearch created automatically, you will find the post_date field is automatically recognized as a date type, but the message and the user field is set to text, keyword redundant field, resulting in lower writing speed, take up more disk space.

{
    "twitter": {
        "mappings": {
            "doc": {
                "properties": {
                    "message": {
                        "type": "text",
                        "fields": {
                            "keyword": {
                                "type": "keyword",
                                "ignore_above": 256
                            }
                        }
                    },
                    "post_date": {
                        "type": "date"
                    },
                    "user": {
                        "type": "text",
                        "fields": {
                            "keyword": {
                                "type": "keyword",
                                "ignore_above": 256
                            }
                        }
                    }
                }
            }
        },
        "settings": {
            "index": {
                "number_of_shards": "5",
                "number_of_replicas": "1"
            }
        }
    }
}

The business scene design configuration index reasonable number of fragments, number of copies, type setting field, tokenizer. If you do not need to merge all of the fields, disabled _all the fields to merge fields by copy_to.

curl -XPUT "http://localhost:9200/twitter?pretty" -H 'Content-Type: application/json' 
-d'{
    "settings": {
        "index": {
            "number_of_shards": "20",
            "number_of_replicas": "0"
        }
    }
}'

curl -XPOST "http://localhost:9200/twitter/doc/_mapping?pretty" -H 'Content-Type: application/json' 
-d'{
    "doc": {
        "_all": {
            "enabled": false
        },
        "properties": {
            "user": {
                "type": "keyword"
            },
            "post_date": {
                "type": "date"
            },
            "message": {
                "type": "text",
                "analyzer": "cjk"
            }
        }
    }
}'

Query tuning recommendations

1. Filter query cache buffer and the fragment

By default, Elasticsearch query will return calculated degree of correlation with each data query, but for full-text indexing of non-usage scenarios, users do not care about relevance of query results and query condition, just want to precisely locate the target data . At this point, you can not let pass filter Elasticsearch score is calculated, and the result set caching filter as much as possible, for subsequent queries contain the same filter to improve the query efficiency.

# 普通查询
curl -XGET "http://localhost:9200/twitter/_search" -H 'Content-Type: application/json' 
-d'{
    "query": {
        "match": {
            "user": "kimchy"
        }
    }
}'

# 过滤器(filter)查询
curl -XGET "http://localhost:9200/twitter/_search" -H 'Content-Type: application/json' 
-d'{
    "query": {
        "bool": {
            "filter": {
                "match": {
                    "user": "kimchy"
                }
            }
        }
    }
}'

The purpose is to fragment the query cache buffer polymerization, the results presented words and hits (it does not cache documents returned, therefore, it only works when search_type = count).
By the following parameters, we can set the size of the fragment the cache, the JVM default stack size of 1%, we can of course also be set manually config / elasticsearch.yml file.

indices.requests.cache.size: 1%

See cache memory for the case of (name represents the node name, query_cache represents filter cache, request_cache represents fragment caching, fielddata field represents a data buffer, segments represents a segment index).

curl -XGET "http://localhost:9200/_cat/nodes?h=name,query_cache.memory_size,request_cache.memory_size,fielddata.memory_size,segments.memory&v" 

2. Use routing routing

When Elasticsearch written document, it will be routed through a formula of an index to a slice. The default formula is as follows:

shard_num = hash(_routing) % num_primary_shards

_routingThe value field, default _idfield, the field can be used as routing based on business scenarios set field frequently queried. For example, may be considered the user id, the region as a routing field, you can filter the unnecessary fragmentation query, speed up queries.

# 写入时指定路由
curl -XPUT "http://localhost:9200/my_index/my_type/1?routing=user1" -H 'Content-Type: application/json' 
-d'{
    "title": "This is a document",
    "author": "user1"
}'

# 查询时不指定路由,需要查询所有分片
curl -XGET "http://localhost:9200/my_index/_search" -H 'Content-Type: application/json' 
-d'{
    "query": {
        "match": {
            "title": "document"
        }
    }
}'

# 返回结果
{
    "took": 2,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    }
    ... ...
}

# 查询时指定路由,只需要查询1个分片
curl -XGET "http://localhost:9200/my_index/_search?routing=user1" -H 'Content-Type: application/json' 
-d'{
    "query": {
        "match": {
            "title": "document"
        }
    }
}'

# 返回结果
{
    "took": 1,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    }
    ... ...
}

3. Mandatory merge read-only index to close historical index data

Read-only permanent memory from the heap index can be merged into a single large segment in revenue, reduce index fragmentation and reduce the JVM. Forced to merge index operations will consume a lot of disk IO, try to configure a low peak in business (for example in the morning) execution. If the index is no longer supported by historical data queries on business, you can consider closing index, reducing the JVM memory footprint.

# 索引forcemerge API
curl -XPOST "http://localhost:9200/abc20180923/_forcemerge?max_num_segments=1"

# 索引关闭API
curl -XPOST "http://localhost:9200/abc2017*/_close"

Configuring Suitable tokenizer

Elasticsearch built a lot of word breakers, including standard, cjk, nGram, etc., can also be installed from the research / open source word breaker. Select the appropriate word according to business scenarios, avoiding all the default standard tokenizer.

Common word breaker:

  • standard: the default word in English on an empty Geqie points, Chinese segmentation according to individual characters.
  • cjk: according to the binary index CJK word, we can guarantee the recall.
  • nGram: English letters can be sliced, ES binding search phrases (match_phrase) follow.
  • IK: the more popular Chinese word can be divided in accordance with Chinese semantic cut, you can customize the dictionary.
  • pinyin: allows the user to enter the alphabet, you can find relevant keywords.
  • aliws: Alibaba research since the word, support a variety of models and segmentation algorithm, thesaurus rich, accurate segmentation result, the electricity supplier for high precision requirements such as for the scene.
# 分词效果测试API
curl -XPOST "http://localhost:9200/_analyze" -H 'Content-Type: application/json' 
-d'{
    "analyzer": "ik_max_word",
    "text": "南京市长江大桥"
}'

5. Query configuration aggregator

Polymeric particles query node may send a query request to another node, to collect and merge the results, and in response to the client query is issued. By aggregating node configured to query a higher standard CPU and memory, you can speed up the query speed of operation, improve the cache hit rate.

# 查询聚合节点配置(conf/elasticsearch.yml):
node.master:false
node.data:false
node.ingest:false

6. Set the number of records and the query reads fields

The default is usually the first 10 queries returns a sorted records, up to 10,000 records read time, the control range of the recording by reading from and size parameters, to avoid excessive reading recorded once. Control can return information _source parameter field, to avoid large read field.

# 查询请求示例
curl -XGET http://localhost:9200/fulltext001/_search?pretty  -H 'Content-Type: application/json' 
-d '{
    "from": 0,
    "size": 10,
    "_source": "id",
    "query": {
        "bool": {
            "must": [
                {
                    "match": {
                        "content": "虎嗅"
                    }
                }
            ]
        }
    },
    "sort": [
        {
            "id": {
                "order": "asc"
            }
        }
    ]
}'

7. Set teminate_after queries quick return

If the query does not need to hit an accurate count of the number of records, you can specify with teminate_after match up to N records each shard returns, set the query timeout timeout. In the query results by "terminated_early" field identifies whether the end of the query request in advance.

# teminate_after 查询语法示例
curl -XGET "http://localhost:9200/twitter/_search" -H 'Content-Type: application/json' 
-d'{
    "from": 0,
    "size": 10,
    "timeout": "10s",
    "terminate_after": 1000,
    "query": {
        "bool": {
            "filter": {
                "term": {
                    "user": "elastic"
                }
            }
        }
    }
}'

8. Avoid turning the depth inquiry

Elasticsearch default result 10000 before ordering only allowed to see when the page view after sort records by the response generally takes longer. Use search_after way queries are more lightweight, if each only need to return 10 results, only 10 per shard results to return after search_after, only the total amount of data returned and the number of shard and this needs the number of related and unrelated to the number of history that has been read.

# search_after查询语法示例
curl -XGET "http://localhost:9200/twitter/_search" -H 'Content-Type: application/json' 
-d'{
    "size": 10,
    "query": {
        "match": {
            "message": "Elasticsearch"
        }
    },
    "sort": [
        {
            "_score": {
                "order": "desc"
            }
        },
        {
            "_id": {
                "order": "asc"
            }
        }
    ],
    "search_after": [
        0.84290016,     //上一次response中某个doc的score
        "1024"          //上一次response中某个doc的id
    ]
}'

9. avoid fuzzy matching prefix

Elasticsearch support by default ? Regular expression to do fuzzy matching, fuzzy matching, if executed on a larger scale the index amount of data, in particular prefix fuzzy matching, usually takes will be longer, and may even lead to memory overflow. Try to avoid performing such operations in a production environment highly concurrent query requests.
A customer needs to be fuzzy queries on the license plate number, by querying the request "license plate number:
A8848 *" query, often leads to higher overall cluster load. By preprocessing the data, increasing the redundant field "number plate .keyword", and all license plate number in advance in accordance with mono-,, $ 3 ... 7 yuan the word stored in the field, the field storage contents Example: Shanghai , A, 8,4, Shanghai A, A8,88,84,48, Shanghai Shanghai A8 ... A88488. By querying "the license plate number .keyword: A8848" can solve the problem of the original performance.

10. Avoid sparse index

Elasticsearch6.X previous version allows you to create multiple default type in a index below, Elasticsearch6.X version only allows you to create a type, Elasticsearch7.X version only allowed type values "_doc". Creating multiple fields in a different type index below, or the field is not the same as hundreds of indexes into a single index, the index will lead to thinning problems.
Create a type recommended only for each index, the field is not the same data independently created index, do not merge into one large index. Each query request to read the corresponding index if necessary, to avoid a large query index scan all records, speed up queries.

11. The expansion of the number of cluster nodes, the nodes to upgrade specifications

The more usual server nodes, the higher the server hardware specs, the more processing power Elasticsearch cluster.

Reprinted from: https://mp.weixin.qq.com/s/pAuYJxAeJuO_lTNKX4LUpg

Guess you like

Origin www.cnblogs.com/sanduzxcvbnm/p/12624662.html