12. Design and implementation panorama of Java technology stack middleware elegant downtime scheme

Design and Implementation Panorama of Java Technology Stack Middleware Elegant Shutdown Scheme

This series of Netty source code analysis articles is based on    version 4.1.56.Final

Summary of this article

In the previous article, the author introduced in detail the complete process of Netty when handling connection closure, and introduced in detail how Netty responds to various scenarios that will be encountered when a TCP connection is closed.

After the connection is closed, it's Netty's curtain call. The author of this article will detail the detailed design and implementation of the elegant shutdown solution in the middleware of the Java technology stack.

The author will start with common version releases and service offline scenarios in daily development work, leading to the demand for graceful start and stop of services, and starting from this demand, I will take you step by step to explore the design of graceful shutdown in various middleware.

Readers who are familiar with the author's style of writing should know that the author will definitely not just give a brief introduction, or if he does not talk about it, he must explain the past and present of the entire technical system clearly to everyone.

Based on this purpose, the author will start with the cornerstone of the underlying technology that supports graceful shutdown - the semaphore in the kernel.

From the kernel layer, we will then talk about the JVM layer, and explore the underlying technical mystery of graceful shutdown at the JVM layer.

Then we will rush all the way from the JVM layer to Spring and then to Dubbo. In this process, the author will also take you with a Bug in Shooting Dubbo under graceful shutdown, and introduce the repair process in detail.

Finally, the elegant shutdown of the Dubbo layer leads to the design and implementation of our protagonist - Netty's elegant shutdown:

                                                Reactor gracefully closes the total process.png

Let's start the content of this article~~

                                                                Summary of this article.png

1. Graceful start and stop of Java process

In our daily development work, the iteration and optimization of business requirements is accompanied by our entire development cycle. When we worked overtime to complete the development of business requirements, and then passed the test verification through various difficulties and obstacles, and finally passed and product After all kinds of entanglement and love and killing of the manager, it finally came to the most exciting moment when the program was about to be deployed and launched.

         

   

                                                        Mood swings when online.png

Then, in the process of program deployment and online, the shutdown and restart of online services will inevitably be involved. There are many details about the startup and shutdown of online services. The online service may carry production traffic and may be performing important business processes.

For example: the user is purchasing a product, and the money has already been paid. Just in time for the program to go online, if we simply shut down and restart the service at this time, the user may have paid, but the order has not been created or the product has not been processed. Appearing in the user's shopping list has caused substantial losses to the user, which is a very serious consequence.

In order to ensure that the business can not be damaged during the process of launching the program, the harmony of online services 优雅关闭is 优雅启动very, very important.

       

                                                                It's important to be classy.png

1.1 Graceful start

During the running of a Java program, the running speed of the program generally increases slowly as the program runs, so from the online performance point of view, the Java program is often much faster after running for a period of time than when the program is just started.

This is because during the running of a Java program, the JVM will continue to collect dynamic data when the program is running, so that high-frequency execution code can be compiled into machine code on the fly, and then the program will directly execute the machine code when the program runs, and the running speed is not lost at all. C or C++ program.

At the same time, during program execution, the used classes will be loaded into the JVM cache, so that when the program is used again, it will not trigger temporary loading and affect program execution performance.

We can regard the above points as the performance bonus brought to us by the JVM, and when the application is restarted, these performance bonuses will disappear . If we let the newly started program continue to bear the previous traffic scale, it will cause the program When it was first started, it directly entered the high-load operation state without the blessing of these performance bonuses, which may cause large-scale timeouts of online requests and affect the business.

So it is very important to start a program gracefully. The core idea of ​​graceful start is to let the program not bear too much traffic when it is just started, and let the program run for a period of time under a low load state to upgrade it to the best In the running state, the program is gradually allowed to undertake greater traffic processing.

Let's take a look at two technical solutions commonly used in elegant startup scenarios:

1.1.1 Start warm-up

Start-up warm-up means that the application that has just been launched does not bear all the previous traffic at once, but slowly sends the traffic to the application that has just been launched within a time window. The purpose is to let the JVM slowly collect the program to run first. Some dynamic data at the time, the high-frequency code is compiled into machine code on the fly.

We can see this technical solution in the implementation of many RPC frameworks. The service caller will get the addresses of all service providers from the registration center, and then select a service provider from these addresses through a specific load balancing algorithm. send request.

In order to allow the newly launched service provider to have time to warm up, we need to control the traffic sent by the service caller from the source. The service caller should load balance as little as possible to the newly started service when initiating RPC calls Provider instance.

So how can the service caller determine which are the service provider instances just started?

After the service provider starts successfully, it will register its service information with the registration center. We can include the real start time of the service provider in the service information and register with the registration center together, so that the registration center will notify the service caller that there is a new The service provider instance comes online and is notified of its start time.

The service caller can slowly increase the load weight to the newly started service provider instance according to the startup time. In this way, the problem of cold start of the service provider can be solved. The caller slowly sends the request to the provider instance within a time window, so that the newly started provider instance can have time to warm up and achieve smoothness. online effect.

1.1.2 Delayed exposure

Startup warm-up is more about achieving graceful startup by reducing the load balancing weight of the service provider instance just started from the perspective of the service caller.

Delayed exposure is to delay the exposure of service time from the perspective of the service provider. Using the delayed period, the service provider can preload some resources that depend on it, such as cached data and beans in the spring container. After all these resources are loaded and in place, we are exposing the service provider instance. This can effectively reduce the probability of request processing errors in the early stage of startup.

For example, we can configure the delayed exposure time of the service in the dubbo application:

//延迟5秒暴露服务
<dubbo:service delay="5000" /> 

1.2 Graceful shutdown

The issues to be considered and the scenarios to be dealt with in graceful shutdown are much more complicated than graceful startup, because a service program running normally online is bearing production traffic and processing business processes at the same time.

It is still very challenging to gracefully shut down such a service program to ensure that the business is not damaged. A good shutdown process can ensure that our business can go online and offline smoothly, and avoid unnecessary additional operation and maintenance work after going online.

Let's discuss the specific angles from which we should start to consider the realization of graceful shutdown:

1.2.1 Cut off traffic

The first step must be to cut off all the existing traffic borne by the program, and tell the service caller that I am going to close it, please don't send me any more requests. So what if the flow is cut? ?

In the RPC scenario, the service caller dynamically perceives the online and offline changes of the service provider from the registration center through service discovery. Before the service provider shuts down, it must first unregister itself from the registration center, and then the registration center will notify the service caller that if the service provider instance goes offline, please remove it from the local cache list. In this way, the RPC call after the service caller is not requested to the offline service provider instance.

But there is a problem here, that is, usually our registration center is of the AP type, which only guarantees the final consistency and does not guarantee real-time consistency. For this reason, the service caller perceives that the service provider is offline The event may be delayed, so during this delay time, the service caller is very likely to initiate an RPC request to the offline service.

Because the service provider has already entered the closing process, many objects may have been destroyed at this time. If the request is received at this time, it must not be processed, and an inexplicable exception may even be thrown out. Yes Business has a certain impact.

So since this problem is caused by the possible delayed notification of the registration center, we naturally think of letting the service provider who is about to go offline take the initiative to notify its service caller.

The combination of the active notification of the service provider and the passive notification of the registration center should be able to ensure foolproof.

In fact, this solution is feasible in most scenarios, but there is still an extreme situation that needs to be dealt with, that is, when the service provider notifies the caller that its offline network request has a very limited time before reaching the service caller In this case, the service caller initiates an RPC request to the offline service provider. In such an extreme situation, the service provider and the caller need to cooperate to deal with it.

First of all, when the service provider is about to close, it sets itself to the closing state. In this state, it will not accept any request. If it encounters the above extreme case request at this time, it will throw a CloseException ( This exception is agreed in advance between the provider and the caller), and the caller receives the CloseException, then removes the node of the service provider, and selects a node from the remaining nodes through load balancing to retry, by making the request Fail fast to keep your business intact.

The combination of these three schemes, the author thinks it is a relatively perfect flow cutting scheme.

1.2.2 Try to ensure that the business is not damaged

After all the traffic is cut off, there may still be some business requests being processed in the service program that will be closed at this time, then we have to wait until all these business processing requests are processed and respond to the business results to the client. Shutting down the service.

Of course, in order to ensure the controllability of the shutdown process, we need to introduce a shutdown timeout limit. When the remaining business request processing times out, it will be forced to close.

In order to ensure that the closing process is controllable, we can only do our best to ensure that the business is not damaged, not 100% guaranteed. Therefore, after the program is launched, we should monitor the abnormal business data and repair it in time.


Through the graceful shutdown scheme introduced above, we know that when we are going to gracefully shut down an application, we need to do the following two things:

  1. The first thing we need to do is to cut off all the production traffic carried by the application that is currently about to be shut down, so as to ensure that no new traffic hits the application instance that is about to be shut down.

  2. After all production traffic is cut off, we also need to ensure that the business requests currently being processed by the application instance to be shut down are completed and the business processing results are responded to the client. In order to ensure that the business is not damaged. Of course, in order to make the shutdown process controllable, we need to introduce a shutdown timeout.

The above two tasks are what we need to do when the application is about to be closed, so the question is how can we know that the application is about to be closed ? In other words, how can we perceive the shutdown event of the program process in the application to trigger the execution of the above two graceful shutdown operations?

Since we have such a requirement, the operating system kernel will definitely provide us with such a mechanism. In fact, we can obtain the shutdown process notification by capturing the signal sent by the operating system to the process, and trigger the graceful shutdown operation in the corresponding signal callback .

Next, let's take a look at the signal mechanism provided by the operating system kernel:

2. Kernel signal mechanism

The signal is the mechanism provided by the operating system kernel for us to communicate between processes. The kernel can use the signal to notify the process of events that occur in the current system (including closing process events).

The signal is not represented by a particularly complex data structure in the kernel, but a number with the same code name is used to identify different signals. Linux provides dozens of signals, which represent different meanings. Signals are distinguished by their value

A signal can be sent to a process at any time, and the process needs to configure a signal handler for this signal. When a signal occurs, the corresponding signal processing function is executed by default. This is equivalent to an emergency manual for an operating system. Define in advance what situation you will encounter, what to do, prepare in advance, and follow suit when something happens.

The signal sent by the kernel means that the current system has encountered a certain situation, and the steps we need to deal with are encapsulated in the callback function of the corresponding signal.

The purpose of introducing the signal mechanism is to:

  • Let the application process know that a specific event has occurred (such as the shutdown event of the process).

  • Force the process to execute the signal processing function we set in advance (such as encapsulating graceful shutdown logic).

Generally speaking, once the program is started, it will continue to run, unless OOM is encountered or we need to re-release the program, we will call the kill command in the operation and maintenance script to close the program. The Kill command literally kills a process, but its essence is to send a signal to the process to shut it down.

Let's use the kill -l command to see what signals the kill command can send to the process:

# kill -l
 1) SIGHUP       2) SIGINT       3) SIGQUIT      4) SIGILL       5) SIGTRAP
 6) SIGABRT      7) SIGBUS       8) SIGFPE       9) SIGKILL     10) SIGUSR1
11) SIGSEGV     12) SIGUSR2     13) SIGPIPE     14) SIGALRM     15) SIGTERM
16) SIGSTKFLT   17) SIGCHLD     18) SIGCONT     19) SIGSTOP     20) SIGTSTP
21) SIGTTIN     22) SIGTTOU     23) SIGURG      24) SIGXCPU     25) SIGXFSZ
26) SIGVTALRM   27) SIGPROF     28) SIGWINCH    29) SIGIO       30) SIGPWR
31) SIGSYS      34) SIGRTMIN    35) SIGRTMIN+1  36) SIGRTMIN+2  37) SIGRTMIN+3
38) SIGRTMIN+4  39) SIGRTMIN+5  40) SIGRTMIN+6  41) SIGRTMIN+7  42) SIGRTMIN+8
43) SIGRTMIN+9  44) SIGRTMIN+10 45) SIGRTMIN+11 46) SIGRTMIN+12 47) SIGRTMIN+13
48) SIGRTMIN+14 49) SIGRTMIN+15 50) SIGRTMAX-14 51) SIGRTMAX-13 52) SIGRTMAX-12
53) SIGRTMAX-11 54) SIGRTMAX-10 55) SIGRTMAX-9  56) SIGRTMAX-8  57) SIGRTMAX-7
58) SIGRTMAX-6  59) SIGRTMAX-5  60) SIGRTMAX-4  61) SIGRTMAX-3  62) SIGRTMAX-2
63) SIGRTMAX-1  64) SIGRTMAX

The author here extracts a few common signals to briefly explain:

  • SIGINT:The signal code is 2. For example, when we run a process instance in non-background mode in the terminal, if we want to close it, we can close the foreground program by pressing Ctrl+C. This Ctrl+C sends the SIGINT signal to the process.

  • SIGQUIT:The signal code is 3. For example, we use Ctrl+\ to close a foreground process. At this time, a SIGQUIT signal will be sent to the process. Unlike the SIGINT signal , the process terminated by the SIGQUIT signal will save the running state of the current process in the core through Core Dump when exiting. dump file for easy viewing later.

  • SIGKILL:The signal code is 9. Ending a process through the kill -9 pid command is a very, very dangerous action. We should resolutely stop this behavior of closing a process , because the SIGKILL signal cannot be captured and ignored by the process, and the default operation defined by the kernel can only be executed to directly close the process. And our graceful shutdown operation needs to capture the operating system signal, so that the graceful shutdown action can be performed in the corresponding signal processing function . Since the SIGKILL signal cannot be caught, graceful shutdown cannot be achieved. Now everyone quickly check whether the operation and maintenance script of your company's production environment ends the process through the kill -9 pid command. You must avoid using this method, because this method is an extremely ruthless and slightly cruel process closing behavior .

  • SIGSTOP :The signal code is 19. This signal, like the SIGKILL signal, cannot be ignored and captured by the application. Sending a SIGSTOP signal to a process is also unable to achieve a graceful shutdown. Close a foreground process by Ctrl+Z, and the signal sent is the SIGSTOP signal.

  • SIGTERM:The signal code is 15. We usually use the kill command to close a process running in the background. The default signal sent by the kill command is SIGTERM, which is also the basis for the graceful shutdown discussed in this article . We usually use kill pid or kill -15 pid to send the background process Send the SIGTERM signal to achieve a graceful shutdown of the process. If you find that the operation and maintenance script of your company's production environment uses the kill -9 pid command to end the process, you must immediately replace it with the kill pid command.

The above lists are some commonly used signals. You can also use the man 7 signal command to check the meaning of each signal:

Signal     Value     Action   Comment
──────────────────────────────────────────────────────────────────────
SIGHUP        1       Term    Hangup detected on controlling terminal
                              or death of controlling process
SIGINT        2       Term    Interrupt from keyboard
SIGQUIT       3       Core    Quit from keyboard
SIGILL        4       Core    Illegal Instruction


SIGABRT       6       Core    Abort signal from abort(3)
SIGFPE        8       Core    Floating point exception
SIGKILL       9       Term    Kill signal
SIGSEGV      11       Core    Invalid memory reference
SIGPIPE      13       Term    Broken pipe: write to pipe with no
                              readers
SIGALRM      14       Term    Timer signal from alarm(2)
SIGTERM      15       Term    Termination signal
SIGUSR1   30,10,16    Term    User-defined signal 1
SIGUSR2   31,12,17    Term    User-defined signal 2
……

The application process generally divides signal processing into the following three methods:

  • 内核定义的默认操作: The system kernel specifies default operations for each signal, such as Term in the Action column of the above list, which means to terminate the process. The default operation of the SIGINT signal and SIGTERM signal introduced earlier is Term. Core means Core Dump, that is, after the process is terminated, the running state of the current process will be saved in the file through Core Dump, so that we can analyze the problem afterwards. The default operation of the SIGQUIT signal introduced earlier is Core.

  • 捕获信号:The application can use the system call provided by the kernel to capture the signal, and encapsulate the steps of graceful shutdown in the corresponding signal processing function. When sending the close signal SIGTERM to the process, we can capture the SIGTERM signal in the process, and then execute our custom signal processing function. We can thus execute the logic of graceful shutdown of the process in the signal processing function.

  • 忽略信号:When we don't want to process certain signals, we can ignore the signal without any processing, but the SIGKILL signal and SIGSTOP introduced earlier cannot be captured and ignored, and the kernel will directly execute the default operations defined by these two signals Close the process directly.

When we don't want the signal to perform the default operation defined by the kernel, we need to capture the signal in the process and register the callback function of the signal to execute our custom signal processing logic.

For example, in the graceful shutdown scenario we are going to discuss in this article, when the process receives the SIGTERM signal, in order to realize the graceful shutdown of the process, we do not want the process to execute the default operation of the SIGTERM signal to directly close the process, so we need to capture SIGTERM in the process Signal, and encapsulate the operation steps of graceful shutdown in the corresponding signal processing function.

2.1 How to capture the signal

After introducing the classification of kernel signals and the three ways in which processes process signals, let's see how to capture kernel signals and customize our processing logic in the corresponding signal callback function.

The kernel provides the sigaction system call for us to capture signals and bind them to corresponding signal processing functions.

int sigaction(int signum, const struct sigaction *act,
                     struct sigaction *oldact);
  • int signum:Indicates the signal we want to capture in the process. For example, in this article, we need to capture the SIGTERM signal in the process to achieve graceful shutdown, and the corresponding signum = 15.

  • struct sigaction *act:A sigaction structure will be used in the kernel to encapsulate our custom signal processing logic.

  • struct sigaction *oldact:This is for compatibility with the old signal processing functions, just understand it, and it has nothing to do with the main line of this article.

The sigaction structure is used to encapsulate the processing function corresponding to the signal and the information for more fine-grained control of signal processing.

struct sigaction {
  __sighandler_t sa_handler;
  unsigned long sa_flags;
        .......
  sigset_t sa_mask; 
};
  • __sighandler_t sa_handler:In fact, it is essentially a function pointer, which is used to save the signal processing function we registered for the signal, and the logic of graceful shutdown is encapsulated here .

  • long sa_flags:In order to control the signal processing logic more finely, this field saves a set of options to control the signal processing behavior. Common options are:

    • SA_ONESHOT : It means that the signal processing function we registered only works once. After responding once, set back to the default behavior.

    • SA_NOMASK : Indicates that the signal processing function will be interrupted during execution. For example, our process captures a signal of interest, and then executes the registered signal processing function, but at this time the process receives other signals or the same signal as last time, and the signal processing function being executed at this time will be interrupted. Thus turn to execute the latest incoming signal processing function. If multiple identical signals are continuously generated, then our signal processing function must take measures such as synchronization and idempotence .

    • SA_INTERRUPT : When the process is executing a very time-consuming system call, if the process receives a signal at this time, the system call will be interrupted by the signal, and the process will go to execute the corresponding signal processing function. Then when the signal processing function is executed, if SA_INTERRUPT is set here, the system call will not continue to execute and will return a  -EINTR constant, telling the caller that the system call was interrupted by the signal, how to deal with it is up to you.

    • SA_RESTART : When the system call is interrupted by a signal and the corresponding signal processing function is executed, if the SA_RESTART system call is set here, it will be automatically restarted.

  • sigset_t sa_mask:This field mainly specifies which signals need to be shielded if multiple signals are continuously generated during the running of the signal processing function. That is to say, when the process receives the masked signal, the ongoing signal processing function will not be interrupted.

Shielding does not mean that the signal must be lost, but temporarily stored, so that the processing function of the same signal can be processed one by one when the process receives multiple identical signals continuously.

Finally, the sigaction function will call the underlying system call rt_sigaction function. In rt_sigaction, the user mode struct sigaction structure introduced above will be copied to the kernel mode k_sigaction, and then the do_sigaction function will be called.

Finally, in the do_sigaction function, set the signal that the user wants to capture in the process and the corresponding signal processing function into the process descriptor task_struct structure.

The data structure task_struct of the process in the kernel has an attribute sighand of the struct sighand_struct structure, and the struct sighand_struct structure contains an array action[] of k_sigaction type. This array stores the signals that need to be captured in the process and the corresponding signal processing functions. In the structure k_sigaction in the kernel, the subscript of the array is the signal that the process needs to capture.

#include <signal.h>

static void sig_handler(int signum) {

    if (signum == SIGTERM) {

        .....执行优雅关闭逻辑....

    }

}

int main (Void) {

    struct sigaction sa_usr; //定义sigaction结构体
    sa_usr.sa_flags = 0;
    sa_usr.sa_handler = sig_handler;   //设置信号处理函数

    sigaction(SIGTERM, &sa_usr, NULL);//进程捕获信号,注册信号处理函数
        
        ,,,,,,,,,,,,
}

We can register the SIGTERM signal and its corresponding custom signal processing function to the process through the simple sample code above. When we execute the kill -15 pid command, the process will capture the SIGTERM signal and then perform a graceful shutdown step up.

3. ShutdownHook in JVM

The content introduced in the section "2. Kernel Signal Mechanism" is the lowest-level system-level support mechanism provided by the operating system kernel for us to realize the graceful shutdown of the process. With the strong support of the kernel, the theme of this article is Java process Graceful shutdown is easy to achieve.

If we want to realize the elegant shutdown function of the Java process, we only need to encapsulate the graceful shutdown operation in a Thread when the process starts, and then register this Thread in the ShutdownHook of the JVM. When the JVM process receives the kill - 15 signal, the ShutdownHook shutdown hook we registered will be executed, and then the graceful shutdown steps we defined will be executed.

        Runtime.getRuntime().addShutdownHook(new Thread(){
            @Override
            public void run() {
                .....执行优雅关闭步骤.....
            }
        });

3.1 Several situations that cause the JVM to exit

  1. The last non-daemon thread in the JVM process exits.

  2. Actively calling the java.lang.System#exit(int status) method in the program code will cause the exit of the JVM process and trigger the call of ShutdownHook. If the parameter int status is non-zero, it means that this shutdown is an abnormal shutdown behavior. For example: OOM exception occurs in the process or other runtime exceptions.

public static void main(String[] args) {
        try {

           ......进程启动main函数.......

        } catch (RuntimeException e) {
            logger.error(e.getMessage(), e);
            // JVM 进程主动关闭触发调用 shutdownHook
            System.exit(1);
        }
}
  1. When the JVM process receives the shutdown signals introduced in the second subsection "2. Kernel Signal Mechanism", the JVM process will be shut down. Since the SIGKILL signal and the SIGSTOP signal cannot be captured and ignored by the process , these two signals will directly and roughly shut down the JVM process, so generally we will send the SIGTERM signal, and the JVM process can execute the ShutdownHook we defined to complete the elegance by capturing the SIGTERM signal Closed operation.

  2. An error occurs during the execution of the Native Method, such as trying to access a memory that does not exist, which will also cause the JVM to be forced to close, and the ShutdownHook will not run.

3.2 Precautions for using ShutdownHook

  1. ShutdownHook is essentially a Thread that has been initialized but not started. These ShutdownHooks registered through the Runtime.getRuntime().addShutdownHook method will be started and executed concurrently when the JVM process is shut down, but the execution order is not guaranteed .

So when writing the logic in ShutdownHook, we should ensure the thread safety of the program and avoid deadlock as much as possible. It is best to register only one ShutdownHook for a JVM process.

  1. If we  java.lang.Runtime#runFinalizersOnExit(boolean value) enable finalization-on-exit, then after all ShutdownHooks run, the JVM will continue to call all uncalled finalizers methods before shutting down. The default finalization-on-exit option is off.

Note : When the JVM starts to shut down and performs the above shutdown operations, the daemon thread will continue to run. If the user actively initiates the JVM shutdown by using the java.lang.System#exit(int status) method, the non-daemon thread will also run during the shutdown. keep running.

  1. Once the JVM process starts to shut down, generally this process cannot be interrupted unless the operating system forcibly interrupts or the user calls java.lang.Runtime#halt(int status) to forcibly shut down.

   public void halt(int status) {
        SecurityManager sm = System.getSecurityManager();
        if (sm != null) {
            sm.checkExit(status);
        }
        Shutdown.halt(status);
    }

The java.lang.Runtime#halt(int status) method is used to forcibly shut down the running JVM process, which will cause the ShutdownHook we registered will not be run and executed. If the JVM is executing the ShutdownHook at this time, when calling this method After that, the JVM process will be forcibly shut down without waiting for the completion of the ShutdownHook.

  1. When the JVM shutdown process begins, the ShutdownHook that has been registered before the ShutdownHook cannot be registered or unregistered, otherwise an IllegalStateException will be thrown.

  2. The program in ShutdownHook should complete the graceful shutdown logic as soon as possible, because when the user calls the System#exit method, he hopes that the JVM will complete the shutdown action as soon as possible while ensuring that the business is not damaged. This is not suitable for long-running tasks or operations that interact with users.

If the JVM is shut down because the physical machine is shut down, the operating system will only allow the JVM to shut down as soon as possible within a limited time, and the operating system will force the JVM to shut down after the limited time.

  1. An exception may also be thrown in the ShutdownHook, and the ShutdownHook is essentially a Thread for the JVM, so for the uncaught exception in the ShutdownHook, the JVM handles it in the same way as other ordinary threads, by calling the ThreadGroup#uncaughtException method deal with. The default implementation of this method is to print the exception's stack trace to System#err and terminate the exception's ShutdownHook thread.

Note: This will only stop the abnormal ShutdownHook, but it will not affect the execution of other ShutdownHook threads and will not cause the JVM to exit.

  1. The last and very important point is that when the JVM process receives the SIGKILL signal and the SIGSTOP signal, it will be forced to close, and the ShutdownHook will not be executed. Another situation that causes the JVM to be forced to shut down is that an error occurs during the execution of the Native Method, such as trying to access a memory that does not exist, which will also cause the JVM to be forced to shut down, and ShutdownHook will not run.

3.3 ShutdownHook execution principle

We add a shutdown hook in the JVM through Runtime.getRuntime().addShutdownHook. When the JVM receives the SIGTERM signal, it will call the ShutdownHooks we registered.

The ShutdownHook introduced in this section is similar to the signal processing function we introduced in the second section "Kernel Signal Mechanism".

Everyone must have a question here, that is, in the section introducing the kernel signal mechanism, we can register the signal to be captured by the process and the corresponding signal processing function with the kernel through the system call sigaction function.

int sigaction(int signum, const struct sigaction *act,
                     struct sigaction *oldact);

But in the JVM introduced in this section, we just registered a shutdown hook through Runtime.getRuntime().addShutdownHook. But the signals that the JVM process needs to catch are not registered. So how does the JVM capture the shutdown signal?

        Runtime.getRuntime().addShutdownHook(new Thread(){
            @Override
            public void run() {
                .....执行优雅关闭步骤.....
            }
        });

In fact, the part of the JVM that captures the operating system signal has been handled for us in the JDK. In the user layer, we don't need to pay attention to the processing of capturing the signal, but only need to pay attention to the processing logic of the signal.

Let's take a look at how JDK helps us register the signal to be captured with the kernel?

After the first thread of the JVM is initialized, the System#initializeSystemClass function will be called to initialize some system classes in the JDK, including registering the signals and signal processing functions that the JVM process needs to capture.

public final class System {

    private static void initializeSystemClass() {

           .......省略.......

            // Setup Java signal handlers for HUP, TERM, and INT (where available).
           Terminator.setup();

           .......省略.......

    }

}

It can be seen from here that the JDK registers the kernel signal that needs to be captured with the JVM in the Terminator class.

class Terminator {
    //信号处理函数
    private static SignalHandler handler = null;

    static void setup() {
        if (handler != null) return;
        SignalHandler sh = new SignalHandler() {
            public void handle(Signal sig) {
                Shutdown.exit(sig.getNumber() + 0200);
            }
        };
        handler = sh;

        try {
            Signal.handle(new Signal("HUP"), sh);
        } catch (IllegalArgumentException e) {
        }
        try {
            Signal.handle(new Signal("INT"), sh);
        } catch (IllegalArgumentException e) {
        }
        try {
            Signal.handle(new Signal("TERM"), sh);
        } catch (IllegalArgumentException e) {
        }
    }

}

JDK provides us with  sun.misc.Signal#handle(Signal signal, SignalHandler signalHandler) functions to realize the capture of kernel signals in the JVM process. The bottom layer relies on the system call sigaction we introduced in the second subsection.

int sigaction(int signum, const struct sigaction *act,
                     struct sigaction *oldact);

sun.misc.Signal#handlesigaction There is a one-to-one correspondence between the meaning of the parameters of the function and the meaning of the parameters in  the system call function  :

  • Signal signal: Indicates the kernel signal to be captured. From here we can see that the JVM mainly captures three signals: SIGHUP(1), SIGINT(2), SIGTERM(15).

In addition to the above three signals, if the JVM receives other signals, it will execute the default operation of the system kernel, directly shut down the process, and will not trigger the execution of ShutdownHook.

  • SignalHandler handler: Signal response function. We see that the Shutdown#exit function is directly called here.

    SignalHandler sh = new SignalHandler() {
            public void handle(Signal sig) {
                Shutdown.exit(sig.getNumber() + 0200);
            }
        };

It should be easy for us to guess that the call of ShutdownHook should be triggered in the Shutdown#exit function.

class Shutdown {

    static void exit(int status) {

          ........省略.........

          synchronized (Shutdown.class) {
              // 开始 JVM 关闭流程,执行 ShutdownHooks
              sequence();
              // 强制关闭 JVM
              halt(status);
          }

    }

    private static void sequence() {
        synchronized (lock) {
            if (state != HOOKS) return;
        }
        //触发 ShutdownHooks
        runHooks();
        boolean rfoe;
        synchronized (lock) {
            state = FINALIZERS;
            rfoe = runFinalizersOnExit;
        }
        //如果 runFinalizersOnExit = true
        //开始运行所有未被调用过的 Finalizers
        if (rfoe) runAllFinalizers();
    }
}

The logic in the Shutdown#sequence function is the operation logic when the JVM is shut down that we introduced in the section "3.2 Precautions for Using ShutdownHook": here it will trigger the concurrent operation of all ShutdownHooks . Note that the running order is not guaranteed here.

After all the ShutdownHooks have finished running, if we have  java.lang.Runtime#runFinalizersOnExit(boolean value) enabled  finalization-on-exit the option, the JVM will continue to call all uncalled finalizers methods before shutting down. The default finalization-on-exit option is off.

3.4 Execution of ShutdownHook

        

                                shutdownhook running.png

As shown in the figure above, in the Shutdown class of JDK, it contains a Runnable[] hooks array with a capacity of 10. The ShutdownHook in the JDK is classified by type, and each slot of the array hooks stores a specific type of ShutdownHook.

And what we usually register in the program code through Runtime.getRuntime().addShutdownHook is  Application hooks a type of ShutdownHook, which is stored in the slot with index 1 in the array hooks.

When the runHooks() function is triggered in Shutdown#sequence to start running all types of ShutdownHooks in the JVM, the Runnable in the array hooks will be traversed in turn in the runHooks() function, and then the ShutdownHooks encapsulated in the Runnable will start to run.

When traversing to the second slot (index 1) of the array Hooks, Application hooks the type of ShutdownHook can be run, that is, the ShutdownHook we registered through Runtime.getRuntime().addShutdownHook starts to run at this time.

    // The system shutdown hooks are registered with a predefined slot.
    // The list of shutdown hooks is as follows:
    // (0) Console restore hook
    // (1) Application hooks
    // (2) DeleteOnExit hook
    private static final int MAX_SYSTEM_HOOKS = 10;
    private static final Runnable[] hooks = new Runnable[MAX_SYSTEM_HOOKS];

    /* Run all registered shutdown hooks
     */
    private static void runHooks() {
        for (int i=0; i < MAX_SYSTEM_HOOKS; i++) {
            try {
                Runnable hook;
                synchronized (lock) {
                    // acquire the lock to make sure the hook registered during
                    // shutdown is visible here.
                    currentRunningHook = i;
                    hook = hooks[i];
                }
                if (hook != null) hook.run();
            } catch(Throwable t) {
                if (t instanceof ThreadDeath) {
                    ThreadDeath td = (ThreadDeath)t;
                    throw td;
                }
            }
        }
    }

Next, let's take a look at how the JDK registers our custom ShutdownHook to the array Hooks in the Shutdown class through the Runtime.getRuntime().addShutdownHook function.

3.5 Registration of ShutdownHook

public class Runtime {

    public void addShutdownHook(Thread hook) {
        SecurityManager sm = System.getSecurityManager();
        if (sm != null) {
            sm.checkPermission(new RuntimePermission("shutdownHooks"));
        }
        //注意 这里注册的是 Application 类型的 hooks
        ApplicationShutdownHooks.add(hook);
    }

}

From the JDK source code, we can see that in the addShutdownHook method in the Runtime class, JDK will encapsulate our custom ShutdownHook in the ApplicationShutdownHooks class. From the naming of this type, it encapsulates what we described in the previous section "3.4 ShutdownHook Application hooks The type of ShutdownHook mentioned in "Execution"  is implemented by the user.

class ApplicationShutdownHooks {
    // 存放用户自定义的 Application 类型的 hooks
    private static IdentityHashMap<Thread, Thread> hooks;

    static synchronized void add(Thread hook) {
        if(hooks == null)
            throw new IllegalStateException("Shutdown in progress");

        if (hook.isAlive())
            throw new IllegalArgumentException("Hook already running");

        if (hooks.containsKey(hook))
            throw new IllegalArgumentException("Hook previously registered");

        hooks.put(hook, hook);
    }

    static void runHooks() {
        Collection<Thread> threads;
        synchronized(ApplicationShutdownHooks.class) {
            threads = hooks.keySet();
            hooks = null;
        }
        // 顺序启动 shutdownhooks
        for (Thread hook : threads) {
            hook.start();
        }
        // 并发调用 shutdownhooks ,等待所有 hooks 运行完毕退出
        for (Thread hook : threads) {
            try {
                hook.join();
            } catch (InterruptedException x) { }
        }
    }
}

There is also a collection in the ApplicationShutdownHooks class  IdentityHashMap<Thread, Thread> hooks , which is specially used to store the ShutdownHook of the Application hooks type defined by the user. Add it to the hooks collection through the ApplicationShutdownHooks#add method.

Then start the ShutdownHook threads one by one in the runHooks method and execute them concurrently. Note that the runHooks method here is in the ApplicationShutdownHooks class .

In the static code block static{.....} of the ApplicationShutdownHooks class, the runHooks method will be encapsulated into a Runnable and added to the hooks array in the Shutdown class. Note that the index passed into the Shutdown#add method is 1.

class ApplicationShutdownHooks {
    /* The set of registered hooks */
    private static IdentityHashMap<Thread, Thread> hooks;

    static {
        try {
            Shutdown.add(1 /* shutdown hook invocation order */,
                false /* not registered if shutdown in progress */,
                new Runnable() {
                    public void run() {
                        runHooks();
                    }
                }
            );
            hooks = new IdentityHashMap<>();
        } catch (IllegalStateException e) {
            // application shutdown hooks cannot be added if
            // shutdown is in progress.
            hooks = null;
        }
    }
}

                                        Shutdownhook execution.png

The logic of the Shutdown#add method is very simple:

class Shutdown {

    private static final int MAX_SYSTEM_HOOKS = 10;
    private static final Runnable[] hooks = new Runnable[MAX_SYSTEM_HOOKS];

    static void add(int slot, boolean registerShutdownInProgress, Runnable hook) {
        synchronized (lock) {
            if (hooks[slot] != null)
                throw new InternalError("Shutdown hook at slot " + slot + " already registered");

            if (!registerShutdownInProgress) {
                if (state > RUNNING)
                    throw new IllegalStateException("Shutdown in progress");
            } else {
                if (state > HOOKS || (state == HOOKS && slot <= currentRunningHook))
                    throw new IllegalStateException("Shutdown in progress");
            }

            hooks[slot] = hook;
        }
    }
}
  • The parameter Runnable hook is the Runnable encapsulated by the runHooks method in the static code block static{....} in ApplicationShutdownHooks.

  • The parameter int slot indicates which slot in the hooks array to put the encapsulated Runnable into. Here we register ShutdonwHook of type Application hooks, so the index here is 1.

  • The parameter registerShutdownInProgress indicates whether it is allowed to continue adding ShutdownHook to the JVM after the JVM shutdown process starts. The default is false to disallow. Otherwise an IllegalStateException will be thrown. This point was emphasized by the author in the section "3.2 Precautions for Using ShutdownHook".

The above is a comprehensive introduction of how JVM captures operating system kernel signals, how to register ShutdownHook, and when to trigger the execution of ShutdownHook.

                                Shutdownhook complete trigger timing.png

After reading this, everyone should fully understand why the kill -9 pid command cannot be used to close the process. Now go and check the operation and maintenance script of your company's production environment! !


As the saying goes, talk is cheap! show me the code! After introducing so many theoretical solutions and principles about graceful shutdown, I think everyone must be wondering how we can implement this elegant shutdown solution?

Then, from the perspective of the source code implementation of some well-known frameworks, the author will explain in detail how to achieve elegant shutdown?

4. Spring's graceful shutdown mechanism

In the previous two sections, we started to talk about the underlying kernel signal mechanism that supports graceful shutdown, and then went to the ShutdwonHook principle that realizes graceful shutdown of the JVM process. After this series of introductions, we now discuss the relationship between graceful shutdown at the kernel layer and the JVM layer Mechanism principles have a certain understanding.

So in a real Java application, how do we implement an elegant shutdown solution based on the above mechanism? In this section, let's get the answer from the Spring source code! !

Before introducing the source code implementation of Spring's elegant shutdown mechanism, the author will take you back to review what callback mechanisms Spring provides us with when the application context of Spring is closed, so that we can write Java in these callbacks. Graceful shutdown logic for the application.

4.1 Publish the ContextClosedEvent event

When the Spring context starts to close, the ContextClosedEvent event will be published first. Note that the Bean of the Spring container has not yet started to be destroyed, so we can perform the graceful shutdown operation in the event callback.

@Component
public class ShutdownListener implements ApplicationListener<ContextClosedEvent> {
       @Override
       public void onApplicationEvent(ContextClosedEvent event) {
                  ........优雅关闭逻辑.....
       }
}

4.2 Callback before Bean destruction in Spring container

Before Spring starts to destroy the beans managed in the container, it will call back the postProcessBeforeDestruction method in all the beans that implement the DestructionAwareBeanPostProcessor interface.

@Component
public class DestroyBeanPostProcessor implements DestructionAwareBeanPostProcessor {

    @Override
    public void postProcessBeforeDestruction(Object bean, String beanName) throws BeansException {

             ........Spring容器中的Bean开始销毁前回调.......
    }
}

4.3 Call back the method of annotating the @PreDestroy annotation

@Component
public class Shutdown {
    @PreDestroy
    public void preDestroy() {
        ......释放资源.......
    }
}

4.4 Call back the destroy method in the DisposableBean interface

@Component
public class Shutdown implements DisposableBean{

    @Override
    public void destroy() throws Exception {
         ......释放资源......
    }

}

4.5 Call back custom destruction method

<bean id="Shutdown" class="com.test.netty.Shutdown"  destroy-method="doDestroy"/>
public class Shutdown {

    public void doDestroy() {
        .....自定义销毁方法....
    }
}

4.6 Implementation of Spring's elegant shutdown mechanism

Spring-related applications are essentially a JVM process, so the Spring framework must rely on the JVM ShutdownHook mechanism we introduced in the third section of this article to implement an elegant shutdown mechanism.

When Spring starts, ShutdownHook needs to be registered with JVM. When we execute  kill - 15 pid the command, then Spring will trigger the five callbacks described above in ShutdownHook.

Let's take a look at the registration logic of ShutdownHook in Spring:

4.6.1 Registration of ShutdownHook in Spring

public abstract class AbstractApplicationContext extends DefaultResourceLoader
  implements ConfigurableApplicationContext, DisposableBean {

 @Override
 public void registerShutdownHook() {
  if (this.shutdownHook == null) {
   // No shutdown hook registered yet.
   this.shutdownHook = new Thread() {
    @Override
    public void run() {
     synchronized (startupShutdownMonitor) {
      doClose();
     }
    }
   };
   Runtime.getRuntime().addShutdownHook(this.shutdownHook);
  }
 }
}

When Spring starts, we need to call  AbstractApplicationContext#registerShutdownHook the method to register Spring's ShutdownHook with the JVM. From this source code, we can see that Spring encapsulates the doClose() method in the ShutdownHook thread, and the doClose() method is the logic of Spring's graceful shutdown .

What needs to be emphasized here is that when we are in a pure Spring environment , the Spring framework will not actively call the registerShutdownHook method to register the ShutdownHook with the JVM for us. We need to manually call the registerShutdownHook method to register.

public class SpringShutdownHook {

    public static void main(String[] args) throws IOException {
        GenericApplicationContext context = new GenericApplicationContext();
                      ........
        // 注册 Shutdown Hook
        context.registerShutdownHook();
                      ........
    }
}

In the SpringBoot environment , SpringBoot will call this method for us to actively register ShutdownHook when it starts. We don't need manual registration.

public class SpringApplication {

 public ConfigurableApplicationContext run(String... args) {

                  ...............省略.................

                  ConfigurableApplicationContext context = null;
                  context = createApplicationContext();
                  refreshContext(context);

                  ...............省略.................
 }

 private void refreshContext(ConfigurableApplicationContext context) {
  refresh(context);
  if (this.registerShutdownHook) {
   try {
    context.registerShutdownHook();
   }
   catch (AccessControlException ex) {
    // Not allowed in some environments.
   }
  }
 }

}

4.6.2 Graceful shutdown logic in Spring

 protected void doClose() {
  // 更新上下文状态
  if (this.active.get() && this.closed.compareAndSet(false, true)) {
   if (logger.isInfoEnabled()) {
    logger.info("Closing " + this);
   }
            // 取消 JMX 托管
   LiveBeansView.unregisterApplicationContext(this);

   try {
    // 发布 ContextClosedEvent 事件
    publishEvent(new ContextClosedEvent(this));
   }
   catch (Throwable ex) {
    logger.warn("Exception thrown from ApplicationListener handling ContextClosedEvent", ex);
   }

   // 回调 Lifecycle beans,相关 stop 方法
   if (this.lifecycleProcessor != null) {
    try {
     this.lifecycleProcessor.onClose();
    }
    catch (Throwable ex) {
     logger.warn("Exception thrown from LifecycleProcessor on context close", ex);
    }
   }

   // 销毁 bean,触发前面介绍的几种回调
   destroyBeans();

   // Close the state of this context itself.
   closeBeanFactory();

   // Let subclasses do some final clean-up if they wish...
   onClose();

   // Switch to inactive.
   this.active.set(false);
  }
 }

Here we can see that the five callbacks introduced at the beginning of this section are finally triggered in the AbstractApplicationContext#doClose method:

  1. Publish the ContextClosedEvent event. Note that this is a synchronous event , that is to say, the Spring ShutdownHook thread will continue to execute the processing of the event synchronously after publishing the event here. After the event is processed, it will execute the subsequent destroyBeans() method to the Bean in the IOC container to destroy.

So in the ContextClosedEvent event listening class, you can safely do operations related to graceful shutdown, because the Bean in the Spring container has not been destroyed yet.

  1. The remaining four callbacks are triggered in sequence in the destroyBeans() method.

Finally, combined with the content introduced in the previous section, the entire graceful shutdown process of Spring is summarized as shown in the following figure:

                                        Spring graceful shutdown mechanism.png

5. Graceful shutdown of Dubbo

Part of the source code of graceful shutdown in this section is based on apache dubbo  2.7.7  version. There is a bug in the graceful shutdown in this version. Let’s shoot the bug together!

In the previous sections, we talked about the ShutdonwHook of the JVM from the underlying technical support provided by the kernel, and then talked about the elegant shutdown mechanism of the Spring framework from the JVM.

After understanding these contents, we will look at the elegant shutdown implementation in dubbo in this section. Since almost all Java applications now use Spring as the development framework, dubbo is generally integrated in the Spring framework for our use. Its Graceful shutdown is closely related to Spring.

5.1 Graceful shutdown of Dubbo in Spring environment

In the introduction of "4. Spring's graceful shutdown mechanism" in the fourth section of this article, we know that in Spring's graceful shutdown process, Spring's ShutdownHook thread will first publish the ContextClosedEvent event, which is a synchronous event. After the ShutdownHook thread publishes the The event will then execute the listener of the event synchronously. After the ContextClosedEvent event is processed in the event listener, the destroyBeans() method will be executed and the remaining four callbacks will be triggered in turn to destroy the beans in the IOC container. .

        

                                Spring graceful shutdown process.png

Since some key beans that Dubbo depends on have not been destroyed when the ContextClosedEvent event is processed, dubbo defines a DubboBootstrapApplicationListener to listen to the ContextClosedEvent event, and calls the dubboBootstrap.stop() method in the onContextClosedEvent event processing method to start dubbo graceful shutdown process.

public class DubboBootstrapApplicationListener extends OneTimeExecutionApplicationContextEventListener
        implements Ordered {

    @Override
    public void onApplicationContextEvent(ApplicationContextEvent event) {
        // 这里是 Spring 的同步事件,publishEvent 和处理 Event 是在同一个线程中
        if (event instanceof ContextRefreshedEvent) {
            onContextRefreshedEvent((ContextRefreshedEvent) event);
        } else if (event instanceof ContextClosedEvent) {
            onContextClosedEvent((ContextClosedEvent) event);
        }
    }

    private void onContextClosedEvent(ContextClosedEvent event) {
        // spring 在 shutdownhook 中会先触发 ContextClosedEvent ,然后在销毁 spring beans
        // 所以这里 dubbo 开始优雅关闭时,依赖的 spring beans 并未销毁
        dubboBootstrap.stop();
    }

}

When the service provider ServiceBean and service consumer ReferenceBean are initialized, DubboBootstrapApplicationListener will be registered in the Spring container. And start listening to ContextClosedEvent and ContextRefreshedEvent events.

public class ServiceClassPostProcessor implements BeanDefinitionRegistryPostProcessor, EnvironmentAware,
        ResourceLoaderAware, BeanClassLoaderAware {

    @Override
    public void postProcessBeanDefinitionRegistry(BeanDefinitionRegistry registry) throws BeansException {

        // @since 2.7.5 注册spring启动 关闭事件的listener
        //在事件回调中中调用启动类 DubboBootStrap的start  stop来启动 关闭dubbo应用
        registerBeans(registry, DubboBootstrapApplicationListener.class);
      
                  ........省略.......
    }
}

5.2 Introduction to Dubbo graceful shutdown process

Since the theme of this article is to introduce the entire main process of graceful shutdown, here I just briefly introduce the main process of Dubbo graceful shutdown. In the relevant details, the author will introduce the details of Dubbo graceful shutdown in the follow-up dubbo source code analysis series . In order to avoid too much divergence in this article, we still focus on the main line of the process here.

public class DubboBootstrap extends GenericEventListener {

    public DubboBootstrap stop() throws IllegalStateException {
        destroy();
        return this;
    }

}

The core logic here is actually the two elegant shutdown themes we introduced in the "1.2 Graceful Closure" section:

  • Cut off existing production traffic from application instances that are currently shutting down.

  • Guaranteed no loss of business.

Here you only need to understand the main process of Dubbo's graceful shutdown. I will introduce the relevant details in a special article later.

    public void destroy() {
        if (destroyLock.tryLock()) {
            try {
                DubboShutdownHook.destroyAll();

                if (started.compareAndSet(true, false)
                        && destroyed.compareAndSet(false, true)) {

                    //取消注册
                    unregisterServiceInstance();
                    //取消元数据服务
                    unexportMetadataService();
                    //停止暴露服务
                    unexportServices();
                    //取消订阅服务
                    unreferServices();
                    //注销注册中心
                    destroyRegistries();
                    //关闭服务
                    DubboShutdownHook.destroyProtocols();
                    //销毁注册中心客户端实例
                    destroyServiceDiscoveries();
                    //清除应用配置类以及相关应用模型
                    clear();
                    //关闭线程池
                    shutdown();
                    //释放资源
                    release();
                }
            } finally {
                destroyLock.unlock();
            }
        }
    }

It can be seen from the above that Dubbo's graceful shutdown relies on the release of Spring ContextClosedEvent events, and the release of ContextClosedEvent events relies on the registration of Spring ShutdownHook.

                                        Dubbo spring environment graceful shutdown.png

ApplicationContext#registerShutdownHook From the introduction in the section "4.6.1 Registration of ShutdownHook in Spring", we know that in the SpringBoot environment, SpringBoot will call methods for us to actively register ShutdownHook when it starts. We don't need manual registration.

In a pure Spring environment, the Spring framework will not actively call the registerShutdownHook method for us to register the ShutdownHook with the JVM. We need to manually call the registerShutdownHook method to register.

Therefore, in order to be compatible with the elegant shutdown in the SpringBoot environment and the pure Spring environment, Dubbo introduces that  SpringExtensionFactory类 as long as it is in the Spring environment, it will call registerShutdownHook to register Spring's ShutdownHook with the JVM.

public class SpringExtensionFactory implements ExtensionFactory {
    private static final Logger logger = LoggerFactory.getLogger(SpringExtensionFactory.class);

    private static final Set<ApplicationContext> CONTEXTS = new ConcurrentHashSet<ApplicationContext>();

    public static void addApplicationContext(ApplicationContext context) {
        CONTEXTS.add(context);
        if (context instanceof ConfigurableApplicationContext) {
            //在spring启动成功之后设置shutdownHook(兼容非SpringBoot环境)
            ((ConfigurableApplicationContext) context).registerShutdownHook();
        }
    }

}

When the service provider ServiceBean and service consumer ReferenceBean are initialized, they will call back  SpringExtensionFactory#addApplicationContext the method to register ShutdownHook.

public class ServiceBean<T> extends ServiceConfig<T> implements InitializingBean, DisposableBean,
        ApplicationContextAware, BeanNameAware, ApplicationEventPublisherAware {

   @Override
    public void setApplicationContext(ApplicationContext applicationContext) {
        this.applicationContext = applicationContext;
        SpringExtensionFactory.addApplicationContext(applicationContext);
    }

}
public class ReferenceBean<T> extends ReferenceConfig<T> implements FactoryBean,
        ApplicationContextAware, InitializingBean, DisposableBean {

    @Override
    public void setApplicationContext(ApplicationContext applicationContext) {
        this.applicationContext = applicationContext;
        SpringExtensionFactory.addApplicationContext(applicationContext);
    }

}

The above is the whole process of Dubbo's elegant shutdown in the Spring integration environment. Let's take a look at the graceful shutdown process of Dubbo in the non-Spring environment.

5.3 Graceful shutdown of Dubbo in a non-Spring environment

In the introduction in the previous section, we know that Dubbo relies on Spring's ShutdownHook in the Spring environment, and triggers Dubbo's graceful shutdown process by listening to the ContextClosedEvent event.

In a non-Spring environment, Dubbo needs to define its own ShutdownHook, thus introducing DubboShutdownHook to directly encapsulate the graceful shutdown process in its own ShutdownHook for execution.

public class DubboBootstrap extends GenericEventListener {

    private DubboBootstrap() {
        configManager = ApplicationModel.getConfigManager();
        environment = ApplicationModel.getEnvironment();

        DubboShutdownHook.getDubboShutdownHook().register();
        ShutdownHookCallbacks.INSTANCE.addCallback(new ShutdownHookCallback() {
            @Override
            public void callback() throws Throwable {
                DubboBootstrap.this.destroy();
            }
        });
    }

}
public class DubboShutdownHook extends Thread {

   public void register() {
        if (registered.compareAndSet(false, true)) {
            DubboShutdownHook dubboShutdownHook = getDubboShutdownHook();
            Runtime.getRuntime().addShutdownHook(dubboShutdownHook);
            dispatch(new DubboShutdownHookRegisteredEvent(dubboShutdownHook));
        }
    }

    @Override
    public void run() {
        if (logger.isInfoEnabled()) {
            logger.info("Run shutdown hook now.");
        }

        callback();
        doDestroy();
    }

   private void callback() {
        callbacks.callback();
    }

}

From the source code, we can see that when our Dubbo application receives  kill -15 pid a signal, after the JVM captures the SIGTERM(15) signal, it will trigger the DubboShutdownHook thread to run, thus calling back the DubboBootstrap introduced in the previous section through callback()# destroy method (the entire graceful shutdown logic of dubbo is encapsulated here).

                                                Dubbo graceful shutdown process in non-Spring environment.png

public class DubboBootstrap extends GenericEventListener {

    public void destroy() {
        if (destroyLock.tryLock()) {
            try {
                DubboShutdownHook.destroyAll();

                if (started.compareAndSet(true, false)
                        && destroyed.compareAndSet(false, true)) {

                    ........取消注册......
                  
                    ........取消元数据服务........
                  
                    ........停止暴露服务........
                 
                    ........取消订阅服务........
                 
                    ........注销注册中心........
                 
                    ........关闭服务........
                  
                    ........销毁注册中心客户端实例........
                 
                    ........清除应用配置类以及相关应用模型........
                
                    ........关闭线程池........
                 
                    ........释放资源........
                 
                }
            } finally {
                destroyLock.unlock();
            }
        }
    }

}

5.4 Aha! Bug!

We introduced the elegant shutdown schemes under these two environments in the sections "5.1 Dubbo's Graceful Shutdown in Spring Environment" and "5.3 Dubbo's Graceful Shutdown in Non-Spring Environment", when they are in their respective scenarios There is no problem when running.

But when these two schemes are combined and run together, there will be a big problem~~~

I still remember the author's special emphasis in the section "3.2 Precautions for Using ShutdownHook":

  • ShutdownHook is essentially a Thread that has been initialized but not started. These  Runtime.getRuntime().addShutdownHook ShutdownHooks registered through methods will be started and executed concurrently when the JVM process is shut down , but the execution order is not guaranteed .

So when writing the logic in ShutdownHook, we should ensure the thread safety of the program and avoid deadlock as much as possible. It is best to register only one ShutdownHook for a JVM process.

                                        Dubbo's graceful shutdown Bug.png in the Spring environment

So now we have registered two ShutdownHook threads in the JVM, one is Spring's ShutdownHook, and the other is Dubbo's ShutdonwHook. So what questions does this raise?

After the previous content introduction, we know that whether it is the ContextClosedEvent event triggered in Spring's ShutdownHook or the CallBack executed in Dubbo's ShutdownHook. In the end, the method will be called  DubboBootstrap#destroy to execute the real graceful shutdown logic.

public class DubboBootstrap extends GenericEventListener {

    private final Lock destroyLock = new ReentrantLock();

    public void destroy() {
        if (destroyLock.tryLock()) {
            try {
                DubboShutdownHook.destroyAll();

                if (started.compareAndSet(true, false)
                        && destroyed.compareAndSet(false, true)) {
                    
                        .......dubbo应用的优雅关闭.......
                 
                }
            } finally {
                destroyLock.unlock();
            }
        }
    }

}

Let's imagine a scenario like this: when Spring's ShutdownHook thread and Dubbo's ShutdownHook thread execute at the same time and come to the DubboBootstrap#destroy method at the same time to compete for destroyLock.

  • Dubbo's ShutdownHook thread obtains the destroyLock and enters the body of the destroy() method to execute the graceful shutdown logic.

  • Spring's ShutdownHook thread does not get the destroyLock and exits the destroy() method.

                                                Dubbo gracefully close Bug.png

After Spring's ShutdownHook thread exits the destroy() method, it will execute the destroyBeans() method to destroy the beans in the IOC container. This must involve the destruction of some key business beans, such as: database connection pools, and Dubbo-related cores Bean.

At the same time, Dubbo's ShutdownHook thread starts to execute the graceful shutdown logic. In the section "1.2 Graceful Shutdown", we mentioned that graceful shutdown must ensure that the business is not damaged. Therefore, it is necessary to continue to process the remaining ongoing business processes and respond to the client with the business processing results. But at this time, some business-critical beans that depend on have been destroyed, such as the database connection pool, and the database operation will be thrown at this time  CannotGetJdbcConnectionException . The graceful shutdown failed, which affected the business.

5.5 Bug fixes

The bug   was finally fixed in version 2.7.15 of apache dubbo

For details, please check Issue: https://github.com/apache/dubbo/issues/7093

After the analysis in the previous section, we know that since this bug is caused by the concurrent execution of Spring's ShutdownHook thread and Dubbo's ShutdownHook thread.

Then when we are in the Spring environment, just log out the ShutdownHook of Dubbo.

public class SpringExtensionFactory implements ExtensionFactory {
    private static final Logger logger = LoggerFactory.getLogger(SpringExtensionFactory.class);

    private static final Set<ApplicationContext> CONTEXTS = new ConcurrentHashSet<ApplicationContext>();

    public static void addApplicationContext(ApplicationContext context) {
        CONTEXTS.add(context);
        if (context instanceof ConfigurableApplicationContext) {
            // 注册 Spring 的 ShutdownHook
            ((ConfigurableApplicationContext) context).registerShutdownHook();
            // 在 Spring 环境下将 Dubbo 的 ShutdownHook 取消掉
            DubboShutdownHook.getDubboShutdownHook().unregister();
        }
    }
}

In a non-Spring environment, we still retain Dubbo's ShutdownHook.

public class DubboBootstrap {

    private DubboBootstrap() {
        configManager = ApplicationModel.getConfigManager();
        environment = ApplicationModel.getEnvironment();

        DubboShutdownHook.getDubboShutdownHook().register();
        ShutdownHookCallbacks.INSTANCE.addCallback(DubboBootstrap.this::destroy);
    }

}

The above content is the entire graceful shutdown mainline process of Dubbo, as well as the reasons and repair solutions for graceful shutdown bugs.


In Dubbo's graceful shutdown process,  DubboShutdownHook.destroyProtocols() the underlying service will be shut down eventually.

public class DubboBootstrap extends GenericEventListener {

    private final Lock destroyLock = new ReentrantLock();

    public void destroy() {
        if (destroyLock.tryLock()) {
            try {
                DubboShutdownHook.destroyAll();

                if (started.compareAndSet(true, false)
                        && destroyed.compareAndSet(false, true)) {
                    
                        .......dubbo应用的优雅关闭.......
                    //关闭服务
                    DubboShutdownHook.destroyProtocols();

                        .......dubbo应用的优雅关闭.......

                }
            } finally {
                destroyLock.unlock();
            }
        }
    }

}

During the destruction process of Dubbo service, the underlying Netty service will be closed by calling server.close.

public class DubboProtocol extends AbstractProtocol {

   @Override
    public void destroy() {
        for (String key : new ArrayList<>(serverMap.keySet())) {
            ProtocolServer protocolServer = serverMap.remove(key);
            RemotingServer server = protocolServer.getRemotingServer();
            server.close(ConfigurationUtils.getServerShutdownTimeout());
             ...........省略........
        }

         ...........省略........
}

Eventually triggers Netty's graceful shutdown.

public class NettyServer extends AbstractServer implements RemotingServer {

    @Override
    protected void doClose() throws Throwable {
        ..........关闭底层Channel......
        try {
            if (bootstrap != null) {
                // 关闭 Netty 的主从 Reactor 线程组
                bossGroup.shutdownGracefully();
                workerGroup.shutdownGracefully();
            }
        } catch (Throwable e) {
            logger.warn(e.getMessage(), e);
        }
        .........清理缓存Channel数据.......
    }

}

6. Netty's graceful shutdown

Through the introduction of dubbo's graceful shutdown related content in the previous section, we naturally lead to the trigger timing of Netty's graceful shutdown, so in this section, the author will introduce in detail how Netty gracefully installed... .. Graceful curtain call~~

In the previous series of articles, we introduced the creation , startup , operation , receiving network connection , receiving network data , sending network data of the master-slave ReactorGroup around the operation process of the entire core framework of Netty shown in the figure below , and how to implement it in the pipeline Handle the entire source code implementation of related IO events .

                                                        reactor.png in netty

This section is the time for Netty's graceful curtain call. During the curtain call, Netty will gracefully shut down its master-slave ReactorGroup and the Reactor in the corresponding ReactorGroup. Let's take a look at the process of this graceful shutdown~~~

6.1 The graceful curtain call of ReactorGroup

public abstract class AbstractEventExecutorGroup implements EventExecutorGroup {

    static final long DEFAULT_SHUTDOWN_QUIET_PERIOD = 2;
    static final long DEFAULT_SHUTDOWN_TIMEOUT = 15;

   @Override
    public Future<?> shutdownGracefully() {
        return shutdownGracefully(DEFAULT_SHUTDOWN_QUIET_PERIOD, DEFAULT_SHUTDOWN_TIMEOUT, TimeUnit.SECONDS);
    }

}

In the whole process of Netty's graceful shutdown, two very important control parameters are involved:

  • gracefulShutdownQuietPeriod: Gracefully close the silent period, the default is  2s . This parameter is mainly to ensure the elegance of Netty's entire shutdown process . After the shutdown process starts, if there are remaining asynchronous tasks in Reactor that need to be executed, then Netty cannot be shut down, and all asynchronous tasks need to be executed. After all asynchronous tasks are executed, in order to achieve a more elegant shutdown operation, Netty must ensure that the business is not damaged. At this time, the concept of quiet period is introduced. If the user does not submit new tasks to Reactor during this quiet period, then the Start off. If there are users who continue to submit asynchronous tasks during this silent period, they cannot be closed. You need to complete the execution of the asynchronous tasks submitted by users during the silent period before you can safely close them.

  • gracefulShutdownTimeout: Graceful shutdown timeout, the default is  15s . This parameter is mainly to ensure that the entire shutdown process of Netty is controllable . We know that a production-level elegant shutdown solution must not only ensure that the business is graceful and non-destructive, but more importantly, it must ensure that the shutdown process is controllable and cannot be infinitely graceful. As a result, the closing action cannot be completed for a long time. So Netty introduced this parameter. If the graceful shutdown times out, it will start to shut down regardless of whether there are asynchronous tasks that need to be executed at this time.

These two control parameters are very important and core parameters. When we introduce the details of Netty shutdown later, we will analyze them in detail for you. Here, we will give you a general understanding of the concept.

After introducing these two important core parameters, let's look at the shutdown process of ReactorGroup:

We all know that in order to ensure the throughput of the entire system and to ensure that Reactor can process IO events on each Channel in a thread-safe and orderly manner. For this purpose, Netty distributes the massive connections it carries to different Reactors for processing.

ReactorGroup contains multiple Reactors, each Channel can only be registered to one fixed Reactor, and this fixed Reactor is responsible for handling the entire life cycle events on the Channel.

Multiple Channels are registered on a Reactor, which is responsible for handling IO events and asynchronous tasks of all Channels registered on it.

The structure of ReactorGroup is shown in the figure below:

The shutdown process of ReactorGroup is actually the shutdown of all Reactors contained in ReactorGroup. When all Reactors in ReactorGroup are closed, ReactorGroup is truly closed.

public abstract class MultithreadEventExecutorGroup extends AbstractEventExecutorGroup {

    // Reactor线程组中的Reactor集合
    private final EventExecutor[] children;

    // 关闭future
    private final Promise<?> terminationFuture = new DefaultPromise(GlobalEventExecutor.INSTANCE);

    @Override
    public Future<?> shutdownGracefully(long quietPeriod, long timeout, TimeUnit unit) {
        for (EventExecutor l: children) {
            l.shutdownGracefully(quietPeriod, timeout, unit);
        }
        return terminationFuture();
    }

    @Override
    public Future<?> terminationFuture() {
        return terminationFuture;
    }

}

  • EventExecutor[] children: All Reactors contained in the current ReactorGroup are stored in the array, and the type is EventExecutor.

  • Promise<?> terminationFuture: Closing Future in ReactorGroup, through this terminationFuture, user threads can know when ReactorGroup completes closing, and can also register some listeners to terminationFuture. When the ReactorGroup completes the shutdown action, it will call back the listeners registered by the user. You can use it flexibly according to your business scenarios.

During the closing process of ReactorGroup, the closing process of all Reactors it contains will be triggered one by one. And return terminationFuture to the user thread.

When all the Reactors in the ReactorGroup are closed, the terminationFuture will be set to success, so that the user thread can perceive that the ReactorGroup has been closed.

This point is also emphasized by the author in the fourth subsection "4. Register the terminated callback function to all Reactors in the Reactor thread group" in the article " Reactor Implementation in Netty (Creation)" .

In the last step of ReactorGroup creation, the terminationListener for Reactor shutdown is defined. In the terminationListener of the Reactor, it will judge whether all the Reactors in the current ReactorGroup are closed. If they are all closed, the terminationFuture of the ReactorGroup will be set to success.

    //记录关闭的Reactor个数,当Reactor全部关闭后,ReactorGroup才可以认为关闭成功
    private final AtomicInteger terminatedChildren = new AtomicInteger();
    //ReactorGroup的关闭future
    private final Promise<?> terminationFuture = new DefaultPromise(GlobalEventExecutor.INSTANCE);

    protected MultithreadEventExecutorGroup(int nThreads, Executor executor,
                                            EventExecutorChooserFactory chooserFactory, Object... args) {

        ........挨个创建Reactor............

        final FutureListener<Object> terminationListener = new FutureListener<Object>() {
            @Override
            public void operationComplete(Future<Object> future) throws Exception {
                if (terminatedChildren.incrementAndGet() == children.length) {
                    //当所有Reactor关闭后 ReactorGroup才认为是关闭成功
                    terminationFuture.setSuccess(null);
                }
            }
        };

        for (EventExecutor e: children) {
            //向每个Reactor注册terminationListener
            e.terminationFuture().addListener(terminationListener);
        }
    }

From the shutdown process of ReactorGroup above, we can see that the shutdown logic of ReactorGroup just triggers the shutdown of all Reactors it contains one by one. The core of Netty's entire elegant shutdown is actually the shutdown logic of a single Reactor. After all, Reactor is the core engine that really drives Netty to run.

6.2 Reactor's graceful curtain call

                                        Reactor's elegant curtain call process.png

The status of the Reactor is particularly important. From the article "A Discussion on the Operating Architecture of the Netty Core Engine Reactor", we know that the Reactor works continuously in a for (;;) {....} infinite loop. For example, polling the IO ready event on the Channel, processing the IO ready event, and executing asynchronous tasks are completed in this infinite loop.

The Reactor will first judge the current state of the Reactor after each cycle task ends, and if the state changes to ST_SHUTTING_DOWN, the Reactor will start the graceful shutdown process.

So before introducing the shutdown process of Reactor, the author will give you a brief overview of the various states in Reactor.

  • ST_NOT_STARTED = 1: The initial state of the Reactor. When the Reactor was first created, the state was ST_NOT_STARTED.

  • ST_STARTED = 2: The startup state of Reactor. The start of Reactor is triggered when the first asynchronous task is submitted to Reactor. The status changes to ST_STARTED after startup.

Relevant details can be reviewed in the article "Detailed illustration of the whole process of starting Netty Reactor" .

  • ST_SHUTTING_DOWN = 3: The Reactor is ready to start shutting down. When Reactor's shutdownGracefully method is called, Reactor's state will change to ST_SHUTTING_DOWN. In this state, users can still submit tasks to Reactor.

  • ST_SHUTDOWN = 4: Reactor stopped state. Indicates that the graceful shutdown process of the Reactor has ended. At this time, the user cannot submit tasks to the Reactor , and the Reactor will execute the remaining asynchronous tasks for the last time in this state.

  • ST_TERMINATED = 5: The real final state of the Reactor, which means that the Reactor has been completely shut down. In this state Reactor will set its own terminationFuture to Success. Then start to call back the terminationListener mentioned at the end of the previous section.

After we understand the various states of Reactor, it's time to officially start introducing the shutdown process of Reactor:

public abstract class SingleThreadEventExecutor extends AbstractScheduledEventExecutor implements OrderedEventExecutor {

    //Reactor的状态  初始为未启动状态
    private volatile int state = ST_NOT_STARTED;
 
    //Reactor的初始状态,未启动
    private static final int ST_NOT_STARTED = 1;
    //Reactor启动后的状态
    private static final int ST_STARTED = 2;
    //准备正在进行优雅关闭,此时用户仍然可以提交任务,Reactor仍可以执行任务
    private static final int ST_SHUTTING_DOWN = 3;
    //Reactor停止状态,表示优雅关闭结束,此时用户不能在提交任务,Reactor最后一次执行剩余的任务
    private static final int ST_SHUTDOWN = 4;
    //Reactor中的任务已被全部执行完毕,且不在接受新的任务,真正的终止状态
    private static final int ST_TERMINATED = 5;

    //优雅关闭的静默期
    private volatile long gracefulShutdownQuietPeriod;
    //优雅关闭超时时间
    private volatile long gracefulShutdownTimeout;

    //Reactor的关闭Future
    private final Promise<?> terminationFuture = new DefaultPromise<Void>(GlobalEventExecutor.INSTANCE);

    @Override
    public Future<?> shutdownGracefully(long quietPeriod, long timeout, TimeUnit unit) {

        ......省略参数校验.......

        //此时Reactor的状态为ST_STARTED
        if (isShuttingDown()) {
            return terminationFuture();
        }

        boolean inEventLoop = inEventLoop();
        boolean wakeup;
        int oldState;
        for (;;) {
            if (isShuttingDown()) {
                return terminationFuture();
            }
            int newState;
            //需要唤醒Reactor去执行关闭流程
            wakeup = true;
            oldState = state;
            if (inEventLoop) {
                newState = ST_SHUTTING_DOWN;
            } else {
                switch (oldState) {
                    case ST_NOT_STARTED:
                    case ST_STARTED:
                        newState = ST_SHUTTING_DOWN;
                        break;
                    default:
                        //Reactor正在关闭或者已经关闭
                        newState = oldState;
                        wakeup = false;
                }
            }
            if (STATE_UPDATER.compareAndSet(this, oldState, newState)) {
                break;
            }
        }
        //优雅关闭静默期,在该时间内,用户还是可以向Reactor提交任务并且执行,只要有任务在Reactor中,就不能进行关闭
        //每隔100ms检测是否有任务提交进来,如果在静默期内没有新的任务提交,那么才会进行关闭 保证关闭行为的优雅
        gracefulShutdownQuietPeriod = unit.toNanos(quietPeriod);
        //优雅关闭的最大超时时间,优雅关闭行为不能超过该时间,如果超过的话 不管当前是否还有任务 都要进行关闭
        //保证关闭行为的可控
        gracefulShutdownTimeout = unit.toNanos(timeout);

        //这里需要保证Reactor线程是在运行状态,如果已经停止,那么就不在进行后续关闭行为,直接返回terminationFuture
        if (ensureThreadStarted(oldState)) {
            return terminationFuture;
        }

        //将正在监听IO事件的Reactor从Selector上唤醒,表示要关闭了,开始执行关闭流程
        if (wakeup) {
            //确保Reactor线程在执行完任务之后 不会在selector上停留
            taskQueue.offer(WAKEUP_TASK);
            if (!addTaskWakesUp) {
                //如果此时Reactor正在Selector上阻塞,则可以确保Reactor被及时唤醒
                wakeup(inEventLoop);
            }
        }

        return terminationFuture();
    }

    @Override
    public Future<?> terminationFuture() {
        return terminationFuture;
    }

}

First, before starting the shutdown process, you need to call isShuttingDown() to determine whether the current Reactor has started the shutdown process or completed the shutdown. If it has already started to close, it will directly return Reactor's terminationFuture here.

    @Override
    public boolean isShuttingDown() {
        return state >= ST_SHUTTING_DOWN;
    }

The remaining logic is to keep trying to change Reactor's current ST_STARTED state to ST_SHUTTING_DOWN shutting down state through CAS in a for loop.

If it is judged by inEventLoop() that the current execution thread is a Reactor thread, it means that the current state of the Reactor is only ST_STARTED running state, then you can directly set newState to ST_SHUTTING_DOWN. Because it will only run here when Reactor is in ST_STARTED state. Otherwise, it directly returns terminationFuture in front.

If the current execution thread is a user thread and not a Reactor thread, then the state of the Reactor may be closing or already closed, and the user thread is repeatedly initiating the shutdown process of the Reactor. So the handling of these exception scenarios will be done in the switch(oldState){....} statement.

            switch (oldState) {
                    case ST_NOT_STARTED:
                    case ST_STARTED:
                        newState = ST_SHUTTING_DOWN;
                        break;
                    default:
                        //Reactor正在关闭或者已经关闭
                        newState = oldState;
                        //当前Reactor已经处于关闭流程中,则无需在唤醒Reactor了
                        wakeup = false;
                }

If the current Reactor has not initiated the shutdown process, such as the state is ST_NOT_STARTED or ST_STARTED, then you can safely set newState to ST_SHUTTING_DOWN.

If the current Reactor is already in the shutdown process or has completed the shutdown, for example, the status is ST_SHUTTING_DOWN, ST_SHUTDOWN or ST_TERMINATED. Then there is no need to wake up the Reactor and repeat the shutdown process with wakeup = false. The state of the Reactor remains unchanged from the current state.

After the state of the Reactor is determined, the current state of the Reactor is continuously modified through the CAS in the for loop. At this point oldState = ST_STARTED, newState = ST_SHUTTING_DOWN.

          if (STATE_UPDATER.compareAndSet(this, oldState, newState)) {
                break;
            }

Then set the two very important core parameters that control Netty's graceful shutdown that we introduced at the beginning of the "6.1 ReactorGroup's Graceful Curtain Call" section in Reactor:

  • gracefulShutdownQuietPeriod: Gracefully close the silent period, the default is 2s. 100ms When there are no asynchronous tasks to be executed in Reactor, the silent period starts to be triggered. Netty will check whether there are tasks submitted here every once in a while.  If no new tasks are submitted during the silent period, it will be closed to ensure Graceful shutdown behavior.

  • gracefulShutdownTimeout: Graceful shutdown timeout, the default is 15s. The graceful shutdown behavior cannot exceed this time. If it exceeds, it must be closed regardless of whether there are currently tasks to ensure that the shutdown behavior is controllable.

When the process comes here, the Reactor is ready to execute the shutdown process. Before the shutdown operation, we need to ensure that the Reactor thread should be running at this time. If the Reactor thread has not started running at this time, we need to let it run. Close operation.

        //这里需要保证Reactor线程是在运行状态,如果已经停止,
        //那么就不在进行后续关闭行为,直接返回terminationFuture
        if (ensureThreadStarted(oldState)) {
            return terminationFuture;
        }

    private boolean ensureThreadStarted(int oldState) {
        if (oldState == ST_NOT_STARTED) {
            try {
                doStartThread();
            } catch (Throwable cause) {
                STATE_UPDATER.set(this, ST_TERMINATED);
                terminationFuture.tryFailure(cause);

                if (!(cause instanceof Exception)) {
                    // Also rethrow as it may be an OOME for example
                    PlatformDependent.throwException(cause);
                }
                return true;
            }
        }
        return false;
    }

If the Reactor thread has just finished executing the asynchronous task or is blocking on the Selector at this time, then we need to ensure that the Reactor thread is woken up in time so that it can directly enter the shutdown process. wakeup == true.

Here addTaskWakesUp is false by default. It means that not only the addTask method can wake up the Reactor thread, there are other ways to wake up the Reactor thread, such as the SingleThreadEventExecutor#execute method and the SingleThreadEventExecutor#shutdownGracefully method introduced in this section will wake up the Reactor thread.

For the detailed meaning and function of the addTaskWakesUp field, you can review the section "1.2.2 Reactor starts polling for IO readiness events" in the article " A Discussion on the Operating Architecture of Netty Core Engine Reactor" .

     //将正在监听IO事件的Reactor从Selector上唤醒,表示要关闭了,开始执行关闭流程
        if (wakeup) {
            //确保Reactor线程在执行完任务之后 不会在selector上停留
            taskQueue.offer(WAKEUP_TASK);
            if (!addTaskWakesUp) {
                //如果此时Reactor正在Selector上阻塞,则可以确保Reactor被及时唤醒
                wakeup(inEventLoop);
            }
        }

  • By  taskQueue.offer(WAKEUP_TASK) adding WAKEUP_TASK to the Reactor, it can be ensured that the Reactor will not stay on the Selector after executing the asynchronous task, and directly execute the shutdown operation.

  • If the Reactor thread is blocking on the Selector at this time, then directly call wakeup(inEventLoop) to wake up the Reactor thread and go directly to the shutdown process.

public final class NioEventLoop extends SingleThreadEventLoop {
    @Override
    protected void wakeup(boolean inEventLoop) {
        if (!inEventLoop && nextWakeupNanos.getAndSet(AWAKE) != AWAKE) {
            selector.wakeup();
        }
    }
}

6.3 Graceful shutdown of Reactor threads

Let's first take a look at the shutdown process as a whole through a Reactor graceful shutdown overall flow chart:

                                                Reactor thread gracefully closes the process.png

Through the introduction of the article "A Discussion on the Operating Architecture of Reactor, Netty's Core Engine" , we know that Reactor processes IO events and executes asynchronous tasks in a for loop 996 times. As shown in the Reactor running framework extracted by the author below:

public final class NioEventLoop extends SingleThreadEventLoop {

    @Override
    protected void run() {
        for (;;) {
            try {
                  .......1.监听Channel上的IO事件.......
                  .......2.处理Channel上的IO事件.......
                  .......3.执行异步任务..........
            } finally {
                try {
                    if (isShuttingDown()) {
                        //关闭Reactor上注册的所有Channel,停止处理IO事件,触发unActive以及unRegister事件
                        closeAll();
                        //注销掉所有Channel停止处理IO事件之后,剩下的就需要执行Reactor中剩余的异步任务了
                        if (confirmShutdown()) {
                            return;
                        }
                    }
                } catch (Error e) {
                    throw (Error) e;
                } catch (Throwable t) {
                    handleLoopException(t);
                }
            }
        }
    }

}

In the finally{....} statement block at the end of each for loop, the Reactor will use the isShuttingDown() method to check whether the current state of the Reactor is closed. If it is closed, it will officially enter the graceful shutdown process of the Reactor.

We mentioned in the "1.2 Graceful Shutdown" section earlier in this article when discussing the graceful shutdown scheme that we should focus on the following two aspects to implement graceful shutdown:

  1. First, you need to cut off the existing traffic borne by the program.

  2. Ensure that the existing remaining tasks can be completed and the business will not be damaged.

The graceful shutdown implemented by Netty here also follows these two points.

  1. Before the graceful shutdown process starts, the closeAll() method will be called to close all the channels registered on the Reactor and cut off the existing traffic.

  2. Then the confirmShutdown() method will be called to complete the remaining asynchronous tasks. In this method, as long as there is an asynchronous task to be executed, it cannot be closed to ensure that the business is not damaged. When the return value of this method is true, it means that it can be closed. Returning false means that it cannot be closed immediately.

6.3.1 Cut off traffic

    private void closeAll() {
        //这里的目的是清理selector中的一些无效key
        selectAgain();
        //获取Selector上注册的所有Channel
        Set<SelectionKey> keys = selector.keys();
        Collection<AbstractNioChannel> channels = new ArrayList<AbstractNioChannel>(keys.size());
        for (SelectionKey k: keys) {
            //获取NioSocketChannel
            Object a = k.attachment();
            if (a instanceof AbstractNioChannel) {
                channels.add((AbstractNioChannel) a);
            } else {
                .........省略......
            }
        }

        for (AbstractNioChannel ch: channels) {
            //关闭Reactor上注册的所有Channel,并在pipeline中触发unActive事件和unRegister事件
            ch.unsafe().close(ch.unsafe().voidPromise());
        }
    }

First, a non-blocking polling operation will be performed on the Selector for the last time through selectAgain(), in order to clear some invalid Keys on the Selector.

Regarding the removal of invalid keys, you can refer back to the section "3.1.3 Removing invalid SelectionKeys from Selector" in the article "A Discussion on the Operating Architecture of Netty Core Engine Reactor" for details.

Then get all SelectionKeys registered on the Selector through selector.keys(). Then get the NioSocketChannel in Netty. The corresponding relationship between SelectionKey and NioSocketChannel is shown in the figure below:

                                                Correspondence between channel and SelectionKey.png

Finally, close these NioSocketChannels registered on Reactor one by one.

For the closing process of Channel, you can look back at the author's article  "Let's see how Netty responds to the normal closing, abnormal closing, and half-closing scenarios of TCP connections"

6.3.2 Guarantee that the business will not be damaged

The logic in this method is the core to ensure the graceful shutdown of Reactor. In order to ensure that the business is not damaged, Netty adopts that as long as there are asynchronous tasks or ShutdownHooks that need to be executed, it cannot be closed. It needs to wait for all tasks or ShutdownHooks to be executed before considering off things.

    protected boolean confirmShutdown() {
        if (!isShuttingDown()) {
            return false;
        }

        if (!inEventLoop()) {
            throw new IllegalStateException("must be invoked from an event loop");
        }

        //取消掉所有的定时任务
        cancelScheduledTasks();

        if (gracefulShutdownStartTime == 0) {
            //获取优雅关闭开始时间,相对时间
            gracefulShutdownStartTime = ScheduledFutureTask.nanoTime();
        }

        //这里判断只要有task任务需要执行就不能关闭
        if (runAllTasks() || runShutdownHooks()) {
            if (isShutdown()) {
                // Executor shut down - no new tasks anymore.
                return true;
            }

            /**
             * gracefulShutdownQuietPeriod表示在这段时间内,用户还是可以继续提交异步任务的,Reactor在这段时间内
             * 是会保证这些任务被执行到的。
             *
             * gracefulShutdownQuietPeriod = 0 表示 没有这段静默时期,当前Reactor中的任务执行完毕后,无需等待静默期,执行关闭
             * */
            if (gracefulShutdownQuietPeriod == 0) {
                return true;
            }
            //避免Reactor在Selector上阻塞,因为此时已经不会再去处理IO事件了,专心处理关闭流程
            taskQueue.offer(WAKEUP_TASK);
            return false;
        }

        //此时Reactor中已经没有任务可执行了,是时候考虑关闭的事情了
        final long nanoTime = ScheduledFutureTask.nanoTime();

        //当Reactor中所有的任务执行完毕后,判断是否超过gracefulShutdownTimeout
        //如果超过了 则直接关闭
        if (isShutdown() || nanoTime - gracefulShutdownStartTime > gracefulShutdownTimeout) {
            return true;
        }

        //即使现在没有任务也还是不能进行关闭,需要等待一个静默期,在静默期内如果没有新的任务提交,才会进行关闭
        //如果在静默期内还有任务继续提交,那么静默期将会重新开始计算,进入一轮新的静默期检测
        if (nanoTime - lastExecutionTime <= gracefulShutdownQuietPeriod) {
            taskQueue.offer(WAKEUP_TASK);
            try {
                //gracefulShutdownQuietPeriod内每隔100ms检测一下 是否有任务需要执行
                Thread.sleep(100);
            } catch (InterruptedException e) {
                // Ignore
            }

            return false;
        }

        // 在整个gracefulShutdownQuietPeriod期间内没有任务需要执行或者静默期结束 则无需等待gracefulShutdownTimeout超时,直接关闭
        return true;
    }

Before the shutdown process starts, Netty will first call the cancelScheduledTasks() method to cancel all the remaining scheduled tasks in Reactor that need to be executed.

Record the graceful shutdown start time gracefulShutdownStartTime, which is used to determine whether the graceful shutdown process times out later.

Call the runAllTasks() method to take out and execute all remaining asynchronous tasks in TaskQueue in Reactor.

        

                                                Run remaining tasks and hooks.png

Call the runShutdownHooks() method to take out and execute the ShutdownHook registered by the user on the Reactor.

We can register ShutdownHooks with Reactor in the user thread as follows:

        NioEventLoop reactor = (NioEventLoop) ctx.channel().eventLoop();
        reactor.addShutdownHook(new Runnable() {
            @Override
            public void run() {
                .....关闭逻辑....
            }
        });

When the Reactor is shutting down, these ShutdownHooks registered by the user will be taken out and run.

public abstract class SingleThreadEventExecutor extends AbstractScheduledEventExecutor implements OrderedEventExecutor {

   //可以向Reactor添加shutdownHook,当Reactor关闭的时候会被调用
   private final Set<Runnable> shutdownHooks = new LinkedHashSet<Runnable>();

   private boolean runShutdownHooks() {
        boolean ran = false;
        while (!shutdownHooks.isEmpty()) {
            List<Runnable> copy = new ArrayList<Runnable>(shutdownHooks);
            shutdownHooks.clear();
            for (Runnable task: copy) {
                try {
                    //Reactor线程挨个顺序同步执行
                    task.run();
                } catch (Throwable t) {
                    logger.warn("Shutdown hook raised an exception.", t);
                } finally {
                    ran = true;
                }
            }
        }

        if (ran) {
            lastExecutionTime = ScheduledFutureTask.nanoTime();
        }

        return ran;
    }

}

It should be noted that the ShutdownHooks here is a mechanism provided by Netty, not the ShutdownHooks in the JVM we introduced in the "3. ShutdownHook in the JVM" section.

ShutdownHooks in the JVM is a Thread, and the JVM will run concurrently and out of order before shutting down . The ShutdownHooks in Netty is a Runnable, and the Reactor will be executed synchronously and orderly by the Reactor thread before it is closed .

It should be noted here that as long as there are tasks and hooks that need to be executed, Netty will continue to execute until all these tasks are executed .

When Reactor does not have any tasks to execute, it will judge whether the time spent on the current shutdown process exceeds the maximum timeout period of graceful shutdown we set earlier, gracefulShutdownTimeout.

nanoTime - gracefulShutdownStartTime > gracefulShutdownTimeout

If the shutdown process has timed out due to the execution of the previous tasks, then directly close the Reactor and exit the working cycle of the Reactor.

If there is no timeout, then the silent period gracefulShutdownQuietPeriod introduced earlier will be triggered.

During the silent period, the Reactor thread will check every 100ms whether there is a task request submitted by the user. If so, it needs to ensure that the tasks submitted by the user are completed. Then the silent period will start counting again and enter a new round of silent period detection.

If there is no task submitted during the entire silent period, the Reactor will be shut down directly without waiting for the gracefulShutdownTimeout to time out, and the Reactor working cycle will exit.

From the above process, we can see that Netty's graceful shutdown needs to wait at least a silent period. Another point is that the graceful shutdown time of Netty may exceed gracefulShutdownTimeout, because Netty needs to ensure that the remaining tasks are completed. When all tasks are executed, the timeout will be detected.

6.4 Reactor's final shutdown process

The confirmShutdown() described in the previous section will return true when there is no task submitted or the shutdown process times out during the silent period. The Reactor thread then exits the work loop.

public final class NioEventLoop extends SingleThreadEventLoop {

    @Override
    protected void run() {
        for (;;) {
            try {
                  .......1.监听Channel上的IO事件.......
                  .......2.处理Channel上的IO事件.......
                  .......3.执行异步任务..........
            } finally {
                try {
                    if (isShuttingDown()) {
                        //关闭Reactor上注册的所有Channel,停止处理IO事件,触发unActive以及unRegister事件
                        closeAll();
                        //注销掉所有Channel停止处理IO事件之后,剩下的就需要执行Reactor中剩余的异步任务了
                        if (confirmShutdown()) {
                            return;
                        }
                    }
                } catch (Error e) {
                    throw (Error) e;
                } catch (Throwable t) {
                    handleLoopException(t);
                }
            }
        }
    }

}

We mentioned in the introduction in the section "1.3.3 Reactor Thread Startup" in the article " Detailed Illustration of the Netty Reactor Startup Process" that the Reactor thread is started when the first asynchronous task is submitted to the Reactor. triggered. In the method of submitting tasks to Reactor,  SingleThreadEventExecutor#execute(java.lang.Runnable, boolean) the call of the following doStartThread() method will be triggered, and the Reactor work cycle run() method mentioned above will be called here.

In the finally{...} statement block of the doStartThread() method, the final closing process of the Reactor will be completed, that is, the subsequent closing process of the Reactor after exiting the for loop in the run method.

Finally, the complete process of Reactor's graceful shutdown is shown in the following figure:

                                        Reactor gracefully closes the whole process.png

public abstract class SingleThreadEventExecutor extends AbstractScheduledEventExecutor implements OrderedEventExecutor {

    private void doStartThread() {
        assert thread == null;
        executor.execute(new Runnable() {
            @Override
            public void run() {

                ..........省略.........

                try {
                    //Reactor线程开始轮询处理IO事件,执行异步任务
                    SingleThreadEventExecutor.this.run();
                    //后面的逻辑为用户调用shutdownGracefully关闭Reactor退出循环 走到这里
                    success = true;
                } catch (Throwable t) {
                    logger.warn("Unexpected exception from an event executor: ", t);
                } finally {
                    //走到这里表示在静默期内已经没有用户在向Reactor提交任务了,或者达到优雅关闭超时时间,开始对Reactor进行关闭
                    //如果当前Reactor不是关闭状态则将Reactor的状态设置为ST_SHUTTING_DOWN
                    for (;;) {
                        int oldState = state;
                        if (oldState >= ST_SHUTTING_DOWN || STATE_UPDATER.compareAndSet(
                                SingleThreadEventExecutor.this, oldState, ST_SHUTTING_DOWN)) {
                            break;
                        }
                    }

                    try {
                        for (;;) {
                            //此时Reactor线程虽然已经退出,而此时Reactor的状态为shuttingdown,但任务队列还在
                            //用户在此时依然可以提交任务,这里是确保用户在最后的这一刻提交的任务可以得到执行。
                            if (confirmShutdown()) {
                                break;
                            }
                        }

                        for (;;) {
                            // 当Reactor的状态被更新为SHUTDOWN后,用户提交的任务将会被拒绝
                            int oldState = state;
                            if (oldState >= ST_SHUTDOWN || STATE_UPDATER.compareAndSet(
                                    SingleThreadEventExecutor.this, oldState, ST_SHUTDOWN)) {
                                break;
                            }
                        }

                        // 这里Reactor的状态已经变为SHUTDOWN了,不会在接受用户提交的新任务了
                        // 但为了防止用户在状态变为SHUTDOWN之前,也就是Reactor在SHUTTINGDOWN的时候 提交了任务
                        // 所以此时Reactor中可能还会有任务,需要将剩余的任务执行完毕
                        confirmShutdown();
                    } finally {
                        try {
                            //SHUTDOWN状态下,在将全部的剩余任务执行完毕后,则将Selector关闭
                            cleanup();
                        } finally {
                            // 清理Reactor线程中的threadLocal缓存,并通知相应future。
                            FastThreadLocal.removeAll();

                            //ST_TERMINATED状态为Reactor真正的终止状态
                            STATE_UPDATER.set(SingleThreadEventExecutor.this, ST_TERMINATED);
                            
                            //使得awaitTermination方法返回
                            threadLock.countDown();

                            //统计一下当前reactor任务队列中还有多少未执行的任务,打出日志
                            int numUserTasks = drainTasks();
                            if (numUserTasks > 0 && logger.isWarnEnabled()) {
                                logger.warn("An event executor terminated with " +
                                        "non-empty task queue (" + numUserTasks + ')');
                            }

                            /**
                             * 通知Reactor的terminationFuture成功,在创建Reactor的时候会向其terminationFuture添加Listener
                             * 在listener中增加terminatedChildren个数,当所有Reactor关闭后 ReactorGroup关闭成功
                             * */
                            terminationFuture.setSuccess(null);
                        }
                    }
                }
            }
        });
    }
}

When the process reaches the finally{...} statement block in the doStartThread method, it means that during the silent period of graceful shutdown, no tasks have been submitted to Reactor. Or the shutdown time has exceeded the set maximum timeout period for graceful shutdown.

Now officially comes the shutdown process for Reactor. Before the process starts, it is necessary to ensure that the current state of the Reactor is ST_SHUTTING_DOWN, which is shutting down.

Note that user threads can still submit tasks to Reactor at this point. When the state of Reactor changes to ST_SHUTDOWN or ST_TERMINATED, the task submitted by the user to Reactor will be rejected, but at this time the state of Reactor is ST_SHUTTING_DOWN, and the task submitted by the user can still be accepted.

public abstract class SingleThreadEventExecutor extends AbstractScheduledEventExecutor implements OrderedEventExecutor {
  @Override
  public boolean isShutdown() {
        return state >= ST_SHUTDOWN;
  }

  private void execute(Runnable task, boolean immediate) {
        boolean inEventLoop = inEventLoop();
        addTask(task);
        if (!inEventLoop) {
            startThread();
            //当Reactor的状态为ST_SHUTDOWN时,拒绝用户提交的异步任务,但是在优雅关闭ST_SHUTTING_DOWN状态时还是可以接受用户提交的任务的
            if (isShutdown()) {
                boolean reject = false;
                try {
                    if (removeTask(task)) {
                        reject = true;
                    }
                } catch (UnsupportedOperationException e) {
                }
                if (reject) {
                    reject();
                }
            }
        }

        .........省略........
    }
}

Therefore, when Reactor exits from the work cycle run method and then the process comes all the way here, users may still submit tasks to Reactor. In order to ensure the graceful shutdown of the process, confirmShutdown() will be executed continuously in the for loop. method until all tasks are executed.

Then the state of Reactor will be changed to ST_SHUTDOWN state, and the user can no longer submit tasks to Reactor. If you submit the task at this time, you will receive a RejectedExecutionException exception.

You may have doubts here. Netty called the confirmShutdown() method again after the state of Reactor changed to ST_SHUTDOWN. Why?

In fact, the purpose of doing this is to prevent the user from submitting tasks to the Reactor within this limit time before the Reactor state changes to SHUTDOWN, so it is necessary to call confirmShutdown() for the last time to execute the tasks submitted within this limit time complete.

The above logical steps are the essence of a truly graceful shutdown, ensuring that all tasks are executed and the business is not damaged.

After the introduction of our elegant processing process, the following is the process of closing Reactor:

Reactor will close Selector in SHUTDOWN state.

    @Override
    protected void cleanup() {
        try {
            selector.close();
        } catch (IOException e) {
            logger.warn("Failed to close a selector.", e);
        }
    }

Clean up any ThreadLocal cache left over from Reactor threads.

FastThreadLocal.removeAll();

Change the state of Reactor from SHUTDOWN to ST_TERMINATED state. At this point the Reactor is truly closed .

 STATE_UPDATER.set(SingleThreadEventExecutor.this, ST_TERMINATED);

The user thread may call Reactor's awaitTermination method to block and wait for the Reactor to close. When the Reactor is closed, it will call threadLock.countDown() to make the user thread return from the awaitTermination method.

public abstract class SingleThreadEventExecutor extends AbstractScheduledEventExecutor implements OrderedEventExecutor {

    private final CountDownLatch threadLock = new CountDownLatch(1);

    @Override
    public boolean awaitTermination(long timeout, TimeUnit unit) throws InterruptedException {
        
         ........省略.......

        //等待Reactor关闭
        threadLock.await(timeout, unit);
        return isTerminated();
    }

    @Override
    public boolean isTerminated() {
        return state == ST_TERMINATED;
    }
}

When all this is done, the Reactor's terminationFuture is finally set to success. At this time, the listener registered on the terminationFuture of Reactor will be called back.

Here we still remember what we introduced in the article "Reactor Implementation in Netty (Creation)" , after all Reactors in ReactorGroup are successfully created one by one, a terminationListener will be registered with the terminationFuture of all Reactors.

In the terminationListener, check whether all Reactors in the current ReactorGroup are closed. If all Reactors are closed, set the terminationFuture of the ReactorGroup to Success. At this moment, the closing process of ReactorGroup is over, and Netty has officially finished its graceful curtain call~~

public abstract class MultithreadEventExecutorGroup extends AbstractEventExecutorGroup {

    //Reactor线程组中的Reactor集合
    private final EventExecutor[] children;
    //记录关闭的Reactor个数,当Reactor全部关闭后,才可以认为关闭成功
    private final AtomicInteger terminatedChildren = new AtomicInteger();
    //ReactorGroup关闭future
    private final Promise<?> terminationFuture = new DefaultPromise(GlobalEventExecutor.INSTANCE);

    protected MultithreadEventExecutorGroup(int nThreads, Executor executor,
                                            EventExecutorChooserFactory chooserFactory, Object... args) {
      
        ........挨个创建Reactor........

        final FutureListener<Object> terminationListener = new FutureListener<Object>() {
            @Override
            public void operationComplete(Future<Object> future) throws Exception {
                if (terminatedChildren.incrementAndGet() == children.length) {
                    //当所有Reactor关闭后 才认为是关闭成功
                    terminationFuture.setSuccess(null);
                }
            }
        };

        for (EventExecutor e: children) {
            e.terminationFuture().addListener(terminationListener);
        }

        ........省略........
    }

}


So far, the author has introduced the entire graceful shutdown process of Netty in detail. The following figure shows the complete flow chart of the entire graceful shutdown. You can review the source code logic we introduced earlier by referring to the overall flow chart below.

                                                Reactor gracefully closes the total process.png

6.5 Reactor state change flow

At the end of this article, the author will take everyone to review the state change process of Reactor.

                                                Reactor state change.png

  • After the Reactor is created, the state is ST_NOT_STARTED.

  • With the submission of the first asynchronous task Reactor starts to start and then the state is ST_STARTED.

  • When the shutdownGracefully method is called, the state of Reactor becomes ST_SHUTTING_DOWN. Indicates that a graceful shutdown is in progress. Users can still submit asynchronous tasks to Reactor at this point.

  • When all remaining tasks in Reactor are executed, the state of Reactor changes to ST_SHUTDOWN. At this point, if the user continues to submit an asynchronous task to Reactor, it will be rejected and a RejectedExecutionException will be received.

  • When the Selector is closed and all TheadLocal caches in the Reactor thread are cleaned up, the Reactor state becomes ST_TERMINATED.

Summarize

So far, the author has finished all the explanations about the past and present life of graceful closure. The amount of information is relatively large, and it needs to be digested carefully. I admire that everyone can read this in one breath.

In this article, we start with the graceful start-stop scheme of the process, and start with the implementation scheme of graceful shutdown. First, we introduce the underlying cornerstone of graceful shutdown - the semaphore mechanism of the kernel. From the kernel, we talk about the ShutdownHook principle and execution process of the JVM. Finally, Through three well-known open source frameworks as cases, we talked about the graceful shutdown mechanism of Spring to Dubbo's graceful shutdown, and finally introduced the detailed implementation plan of Netty's graceful shutdown through Dubbo's graceful shutdown, echoing back and forth.

Well, the content of this article is here. Everyone has worked hard. I believe that you will gain a lot after reading it carefully. See you in the next article~~~

                

Guess you like

Origin blog.csdn.net/blueheartstone/article/details/128127515