Elasticsearch principle analysis-node startup and shutdown

Elasticsearch principle analysis-node startup and shutdown


This chapter analyzes the startup and shutdown process of a single node. How to look at the process that parses the configuration, check the environment, the internal initialization module , as well as how to handle node ** "kill" ** of time.

1. What did the startup process do

In general, the task of the node startup process is to do the following types of work:

  • Parse the configuration, including configuration files and command line parameters.
  • Check the external and internal environments, such as JVM version, operating system kernel parameters, etc.
  • Initialize internal resources, create internal modules, and initialize detectors.
  • Start each submodule and keepalive thread.

2. Startup process analysis

2.1 Start script

When we start ES through the startup script bin/elasticsearch, the script loads the Java program through exec. code show as below:

exec \ #执行命令
    "$JAVA" \ #Java程序路径
    $ES_JAVA_OPTS \ #JVM选项
    -Des.path.home="$ES_HOME" \ #设置path.home路径
    -Des.path.conf="$ES_PATH_CONF" \ #设置path.conf路径
    -Des.distribution.flavor="$ES_DISTRIBUTION_FLAVOR" \
    -Des.distribution.type="$ES_DISTRIBUTION_TYPE" \
    -cp "$ES_CLASSPATH" \ #设置 java classpath
    org.elasticsearch.bootstrap.Elasticsearch \ #指定main函数所在类
    "$@" #传递给main函数命令行参数

The ES_JAVA_OPTS variable holds the JVM parameters, whose content comes from the analysis of the config/jvm.options configuration file.

If the -d parameter is added when executing the startup script:

bin/elasticsearch -d

Then the startup script will add <&- & in exec. The function of <&- is to close the standard input, which is the 0th fd in the process. The function of & is to make the process run in the background.

2.2 Parsing command line parameters and configuration files

The currently supported command line parameters are as follows, and they are not used at startup by default, as shown in the following table:

parameter meaning
-E Set up a configuration. For example, to set the cluster name: -E "cluster.name=my_cluster", which is generally set through the configuration file, not on the command line
-V, --version Print version number information
-d, --daemonize Background start
-h, --help Print help information
-p, --pidfile Create a pid file in the specified path at startup, which saves the pid of the current process, and then you can close the process by viewing the pid file
-q, --quiet Turn off the standard output and standard error output of the console
-s, --silent Terminal output minimum information (default is normal)
-v, --verbose Terminal output detailed information

In actual engineering applications, it is recommended to add -d and -p to the startup parameters, for example:

bin/elasticsearch -d -p es.pid

There are the following two configuration files parsed here, jvm.options is parsed in the startup script.

  • elasticsearch.yml #Main configuration file
  • log4j2.properties #Log configuration file

2.3 Load security configuration

What is a security configuration? Essentially it is configuration information. Since it is configuration information, it is generally written in a configuration file. Several configuration files of ES were mentioned in the previous chapters. The "security configuration" here is to solve that some sensitive information is not suitable to be placed in the configuration file, because the configuration file is stored in plain text. Although the file system is protected based on user rights, it is still not enough. Therefore, ES encrypts these sensitive configuration information and puts it in a separate file: config/elasticsearch.keystore. Then provide some commands to view, add and delete configurations.

What kind of configuration information is suitable to be placed in the security configuration file? For example, security related configuration in X-Pack, LDAP base_dn and other information (equivalent to the user name and password for logging in to the server).

2.4 Check the internal environment

The internal environment refers to the integrity and correctness of the ES software package itself. include:

  • Check the Lucene version. Each version of ES requires the use of the Lucene version. Check the Lucene version here to prevent someone from replacing the incompatible jar package.
  • Check the jar conflict, and exit the process if the conflict is found.

2.5 Check the external environment

The "node" in ES is encapsulated as a Node module when it is implemented. Call other internal components in the Node class, and provide startup and shutdown methods to the outside. The inspection of the external environment is performed in Node.start().

The external environment refers to the JVM and operating system related parameters at runtime, which are called " Boostrap Check " in ES . In the early ES version, ES detected some unreasonable configuration and recorded it in the log to continue running. But sometimes users miss these logs. In order to avoid discovering problems later, ES checks those important parameters during the startup phase. Some configurations that affect performance will be marked as errors, so that users pay enough attention to these parameters.

All these checks are individually encapsulated in the BoostrapChecks class. There are currently the following detection items:

2.5.1 Heap size check

If the JVM initial heap size (Xms) is different from the maximum heap size (Xmx), there may be a pause when the JVM heap size is adjusted during use . Therefore, it should be set to the same value.

If bootstrap.mempry_block is enabled , the JVM will lock the initial size of the pair at startup. If the initial heap size is different from the maximum heap size, after the heap size changes, it may not be guaranteed that all JVM heaps are locked in memory.

To pass this check, the heap size must be configured.

2.5.2 File Descriptor Check

In UNIX-based systems, "files" can be ordinary physical files or virtual files, and network sockets are also file descriptors. The ES process requires a lot of file descriptors. For example, each fragment has many segments, and each segment has many files. It also includes many network connections with other nodes.

To pass this check, you need to adjust the default configuration of the system. Under Linux, execute ulimit -n 65536 (valid only for the current terminal), or configure " - nofile 65536" in the **/etc/security/limits.conf* file (valid for all users). Limits.conf under Ubuntu is ignored by default, and the pam_limits.so module needs to be enabled.

Since the Ubuntu version is updated relatively quickly, and the production environment is not suitable for frequent updates, we recommend using CentOS as the server operating system.

* soft nofile 131072
* hard nofile 131072

2.5.3 Memory Lock Check

ES allows processes to use only physical memory, avoiding the use of swap partitions. In fact, we recommend directly disabling the swap partition of the operating system in a production environment. Now we have deployed the era when we need to swap to the hard disk because of insufficient memory. For servers, when the memory is really used up, swapping to the hard disk will cause more problems.

Enable the bootstrap.memory_lock option to allow ES to lock the memory. When this check is turned on and the lock fails, the execution of this check fails.

2.5.4 Maximum number of threads check

ES decomposes the request into execution on each node, and each stage uses a different thread pool to execute. Therefore, the ES process needs to create many threads. This check is to ensure that the ES process has the authority to create enough threads. This check is only performed on Linux systems. You need to adjust the maximum number of threads that a process can create. This value is at least 2048.

To pass this check, you can modify the nproc in the ** /etc/security/limits.conf file to complete the configuration

* soft nproc 131072
* hard nproc 131072

And /etc/security/limits.d/90-nproc.conf

* soft nproc 131072
* hard nproc 131072

2.5.5 Maximum virtual memory check

Lucene uses mmap to map part of the index to the process address space. The maximum virtual memory check ensures that the ES process has enough address space. This check is only performed on Linux.

To pass this check, you can modify the **/etc/security/limits.conf file and set as to unlimited**.

* soft as unlimited
* hard as unlimited

2.5.6 Maximum file size check

Segment files and transaction log files are stored on the local disk, and they may be very large. In an operating system with a maximum file size limit, it may cause write failure. It is recommended to set the size of the largest file to unlimited.

To pass this check, you can modify the **/etc/security/limits.conf file, and modify the fsize to unlimited**.

* soft fsize unlimited
* hard fsize unlimited

2.5.7 Checking the maximum number of virtual memory areas

The ES process needs to create a lot of memory mapped areas. This check is to ensure that the kernel allows at least 262144 memory mapped areas to be created. This check is only performed on Linux.

To pass this check, you can execute the following command (temporarily effective, invalid after restart):

sysctl -w vm.max_map_count=262144

Or add a line vm.max_map_count=262144** to the **/etc/sysctl.conf file , and then execute the following command (immediately and permanently):

vm.max_map_count=262144
sysctl -p

2.5.8 OnError and OnOutOfMemoryError check

If the JVM encounters a fatal error (OnError) or (OnOutOfMemoryError), then the JVM options OnError and OnOutOfMemoryError can execute arbitrary commands.

However, by default, ES system call filter is enabled (seccomp), fork will be blocked. Therefore, using OnError or OnOutOfMemoryError is not compatible with system call filters.

To pass this check, do not enable OnError or OnOutOfMemoryError, but upgrade to Java 8u92 and use ExitOnOutOfMemoryError.

Prevent the ES node from being in a dead state and unable to recover after the memory overflow, affecting the entire cluster. When the process appears OOM, the process will be shut down, exit the ES cluster and cause an alarm, and then restart.
Add JVM startup parameters in config/jvm.options:

-XX:+ExitOnOutOfMemoryError

2.6 Start internal modules

After the environment check is completed, start each sub-module. The submodules are created in the Node class, and their respective start() methods are called when they are started, for example:

  • discovery.start();
  • clusterServer.start();
  • nodeConnectionsService.start();

The start method of the submodule is basically to initialize internal data, create thread pool, start thread pool and other operations.

2.7 Start keepalive thread

Call the keepAliveThread.start() method to start the keepalive thread, the thread itself does not do specific work. The main thread will exit after executing the startup process. The keepalive thread is the only user thread, and its role is to keep the process running. In a Java program, there is only one user thread. Exit the process when the number of yoghurt threads is zero.

3. Node shutdown process

Now we discuss the shutdown process of a single node. Imagine that when we update the configuration and upgrade the version of the ES cluster, we need to "kill" the ES process to shut down the node. But is the kill operation safe? What is the impact if the node is performing read and write operations at this time? If the node is the master, what should the master do? How is the closing process achieved? What risks will kill nodes bring?

The answer is: the ES process will capture the SIGTERM signal (the default signal of the kill command) for processing, call the stop method of each module, and give them the opportunity to stop the service and exit safely.

  1. The master node is shut down

    During the restart of the cluster, if the master node is shut down , the cluster will re-elect the master. During this period, the cluster has a short-lived state of no master. If the master node in the cluster is deployed separately, after the new master is elected, you can skip the gateway and recovery process, otherwise the new master needs to redistribute the shards held by the old master: promote other replicas to be the master shards and allocate The new replica fragment.

  2. Data node is shut down

    If the data node is closed, the TCP connection for the read and write request will also be closed, and the write operation fails for the client. However, if the writing process has reached the Engine link, the writing will be completed normally, but the client cannot perceive the result. At this time, the client retries. If the auto-generated ID is used, the data content will be repeated.

In summary, the impact of the rolling upgrade is the interruption of the current write request and the fragment allocation process that may be caused by the restart of the master node. It is generally faster to upgrade a new primary shard, so it has little effect on the write availability of the cluster.

When the primary shard of the index part is not allocated, if the auto-generated ID is used, if writing continues, the client may succeed in retrying the failure (the request reaches the primary shard that has been allocated successfully), but it will Data skew occurs between slices, and the degree of skew depends on the number of periods.

4. Close process analysis

During the node startup process, a shutdown hook is added to the Bootstrap#setup method. When the process receives the system SIGTERM (kill default signal) or SIGINT signal, the node shutdown process is called.

There are doStop and doClose in the Service of each module, which are used to handle the normal closing process of this module. The general closing process of the node is located in Node#cloase. In the implementation of the close method, first call the doStop of each module, and then traverse each module again to execute doClose. The main implementation code is as follows:

if(lifecycle.started()){
    
    
    stop();//调用各个模块的dostop方法
}
List<Closeable> toClose = new ArrayList<>();
//在toClose中添加所需要关闭的Service,以nodeService为例
toClose.add(nodeService);
......
//调用各模块doClose方法
IOUtils.close(toClose);

The closing of each module has a certain sequence relationship. Taking doStop as an example, call the doStop method of each module in the order shown in the following table.

service Introduction
ResourceWatcherService General Resource Monitoring Service
HttpServerTransport HTTP transmission service, providing REST interface service
SnapshotsService Snapshot service
SnapshotShardsService Responsible for starting and stopping shard-level snapshots
IndicesClusterStateService After receiving the cluster status information, process the index related operations
Discovery Cluster topology management
RoutingService Handling reroute (migrating shards between nodes)
ClusterService Cluster management service. Mainly handle cluster tasks and publish cluster status
NodeConnectionsService Node connection pipeline service
MonitorService Provide process level, system level, file system and JVM monitoring services
GatewayService Responsible for cluster metadata persistence and recovery
SearchService Processing search request
TransportService Underlying transport service
plugins All current plugins
IndicesService Responsible for index operations such as creating and deleting indexes

Taken together, the closing sequence is roughly as follows:

  • Close the snapshot and HTTPServer, and no longer respond to user REST requests.
  • Close the machine topology management and no longer respond to ping requests.
  • Close the network module and let the node go offline.
  • Perform the closing process of each plug-in.
  • Close IndicatorsService.

Finally, the IndicatorService is closed because the resources that need to be released during this period are the most and the longest.

5. Perform shutdown during fragment read and write

The following analyzes the nodes that are shut down during the read and write execution process.

5.1 Closed during writing

When a thread writes data, it will add a write lock to the Engine. The doStop method of IndicesService indexes all the indexes on this node and executes removeIndex. When the flushAndClose of the Engine is executed (first flush and then close the Engine), it will also add a write lock to the Engine. Since the write lock has been added to the write operation, the write lock will wait until the write is completed. Therefore, the data writing process will not be interrupted. But because the network module is closed, the client connection will be disconnected. The client should handle the failure, although the writing process of the ES server is still continuing.

5.2 Close the reading process

When a thread reads data, it will add a read lock to the Engine. The write lock at flushAndClose will wait for the completion of the read process. However, because the connection is closed, it cannot be sent to the client, causing the client to fail to read.

The following figure shows the process of engine flushAndClose:Insert picture description here

During the node shutdown process, the doStop of IndicatorService sets a timeout for the Engine. If flushAnd has been waiting, CountDownLatch.await will continue the following process by default for 1 day.

6. The master node is shut down

When the master node is shut down, there is no special processing as imagined. The node executes the shutdown process normally. When the TransportSerice module is shut down, the cluster re-elects a new Master. therefore. During the rolling restart, there will be a period of time in an unowned state.

7. Summary

  1. Generally speaking, the node startup process is to initialize and check. After each sub-module starts asynchronously, it loads local data, or selects the master, joins the cluster, etc., which are introduced separately in the following chapters.
  2. The node has the opportunity to process the uncompleted data when it is closed, but it may not be too late to notify the client after it is written. Including tasks that have not yet been executed in the thread pool, they have a chance to be executed within a certain timeout period.

The time for cluster health to change from Red to Green is mainly consumed in maintaining the consistency of the primary and secondary shards. We can also choose to allow the client to write when the cluster health is Yellow, but some data security will be sacrificed.

/etc/security/limits.conf
文件描述符配置
* soft nofile 131072
* hard nofile 131072


最大线程数检查
* soft nproc 131072
* hard nproc 131072
/etc/security/limits.d/90-nproc.conf
* soft nproc 1024

最大虚拟内存检查
* soft as unlimited
* hard as unlimited
最大文件大小检查
* soft fsize unlimited
* hard fsize unlimited

虚拟内存区域最大数量检查
/etc/sysctl.conf
vm.max_map_count=262144
件描述符配置
* soft nofile 131072
* hard nofile 131072


最大线程数检查
* soft nproc 131072
* hard nproc 131072
/etc/security/limits.d/90-nproc.conf
* soft nproc 1024

最大虚拟内存检查
* soft as unlimited
* hard as unlimited
最大文件大小检查
* soft fsize unlimited
* hard fsize unlimited

虚拟内存区域最大数量检查
/etc/sysctl.conf
vm.max_map_count=262144

8. Follow me

Search WeChat public account: the road to a strong java architecture
Insert picture description here

Guess you like

Origin blog.csdn.net/dwjf321/article/details/104649928