Recently, in the large-scale data storage operation of ES, I encountered a problem: when the data volume is small, the storage is normal; when the data volume is large, during the storage process, some data nodes will leave the cluster after a period of time . The log log is as follows:
[2018-04-09T21:08:48,481][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch_25] [gc][young][4312117][1523595] duration [11.4s], collections [1]/[1.6s], total [11.4s]/[12.9m], memory [27.7gb]->[15.3gb]/[31.8gb], all_pools {[young] [17.4gb]->[13.4mb]/[1.4gb]}{[survivor] [46mb]->[191.3mb]/[191.3mb]}{[old] [10gb]->[14.6gb]/[30.1gb]}
[2018-04-09T21:08:48,481][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch_25] [gc][4312117] overhead, spent [11.4s] collecting in the last [12s]
[2018-04-09T21:08:54,654][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch_25] [gc][4312123] overhead, spent [412ms] collecting in the last [1.1s]
Obviously, the full GC time of the JVM is too long, which has a lot to do with the heap size being set to 32GB:
ES memory usage and GC metrics - By default, the master node will check the status of other nodes every 30 seconds. If the garbage collection time of any node exceeds 30 seconds (Garbage collection duration), it will cause the master node task to leave the node. cluster.
Setting a heap that is too large can lead to long GC times, and these long stop-the-world pauses can make the cluster mistakenly think that the node has detached.
However, if the heap size is set too small, the GC will be too frequent, which will affect the efficiency of ES storage and search.
By reading official documentation https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-discovery-zen.html and blog http://www.cnblogs.com/bonelee/p/8063915.html
Setting | Description |
---|---|
|
How often a node gets pinged. Defaults to |
|
How long to wait for a ping response, defaults to |
|
How many ping failures / timeouts cause a node to be considered failed. Defaults to |
Therefore, by increasing the time of ping_timeout and the number of ping_retries to prevent the node from leaving the cluster by mistake, the node can have sufficient time for full GC.
discovery.zen.fd.ping_timeout: 1000s
discovery.zen.fd.ping_retries: 10