Elasticsearch warehousing error: gc overhead causes data nodes to leave the cluster

    Recently, in the large-scale data storage operation of ES, I encountered a problem: when the data volume is small, the storage is normal; when the data volume is large, during the storage process, some data nodes will leave the cluster after a period of time . The log log is as follows:

[2018-04-09T21:08:48,481][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch_25] [gc][young][4312117][1523595] duration [11.4s], collections [1]/[1.6s], total [11.4s]/[12.9m], memory [27.7gb]->[15.3gb]/[31.8gb], all_pools {[young] [17.4gb]->[13.4mb]/[1.4gb]}{[survivor] [46mb]->[191.3mb]/[191.3mb]}{[old] [10gb]->[14.6gb]/[30.1gb]}             

[2018-04-09T21:08:48,481][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch_25] [gc][4312117] overhead, spent [11.4s] collecting in the last [12s]
[2018-04-09T21:08:54,654][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch_25] [gc][4312123] overhead, spent [412ms] collecting in the last [1.1s]

 Obviously, the full GC time of the JVM is too long, which has a lot to do with the heap size being set to 32GB:

    ES memory usage and GC metrics - By default, the master node will check the status of other nodes every 30 seconds. If the garbage collection time of any node exceeds 30 seconds (Garbage collection duration), it will cause the master node task to leave the node. cluster.

    Setting a heap that is too large can lead to long GC times, and these long stop-the-world pauses can make the cluster mistakenly think that the node has detached. 

However, if the heap size is set too small, the GC will be too frequent, which will affect the efficiency of ES storage and search.

By reading official documentation  https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-discovery-zen.html   and blog  http://www.cnblogs.com/bonelee/p/8063915.html

Setting Description

ping_interval

How often a node gets pinged. Defaults to 1s.

ping_timeout

How long to wait for a ping response, defaults to 30s.

ping_retries

How many ping failures / timeouts cause a node to be considered failed. Defaults to 3.

 Therefore, by increasing the time of ping_timeout and the number of ping_retries to prevent the node from leaving the cluster by mistake, the node can have sufficient time for full GC.

discovery.zen.fd.ping_timeout: 1000s
discovery.zen.fd.ping_retries: 10

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325156529&siteId=291194637