golang service Graceful Restart Summary


  • background
  • golang program frame GR
  • The reason appears defunct supervisor
  • Using master / worker mode

background

In the fast-growing business, the only pre-verify mode is feasible, ignore the early release program fleeting effects of downtime caused by restart. When the experimental model will gradually mature heavy volume, this time affecting our release of downtime will be much larger. Our entire service is cloud-based, request traffic from four -> Seven -> machine.

To achieve GR roughly three, one in the inlet flow rate scheduling process, the general practice is ApiGateway + the CD , when the release of the automatic extraction machine, the program waits for requests processed prior to do issuing process, so that the benefits that does not need to be concerned about how to make a smooth restart.

The second is that the program yourself GR ensure that when restarting the listen socket FD (file descriptor) can still accept requests come in, but the old switch to a new process, but this program needs to complete the program itself, some technology stacks may realize it is not very simple, some languages have no control to the operating system level, and this will make a lot of trouble.

The third option is to completely Docker , all things to k8s unified management, we are a small access applications.

golang program frame GR

And java, net and other languages based on different virtual machines, golang natural support system-level calls, graceful restart process very easily. In principle, based on linux fork way the child process to start the new code, then switch the listen socket FD , although the principle is not difficult, but fully their own still have a lot of details. Fortunately, there are more mature open source library to help us achieve.

graceful https://github.com/tylerb/graceful
endless https://github.com/fvbock/endless

The above two are github top ranked web host framework are supported by the GR, but the process of receiving the signal a little difference endless accepted Signal HUP , Use the graceful acceptance Signal USR2 . graceful relatively pure Web Host , Endless supports some routing capabilities.

We look at endless process the signal. (If srv.fork () are interested can read of reading materials inside.)

func (srv *endlessServer) handleSignals() {
    var sig os.Signal

    signal.Notify(
        srv.sigChan,
        hookableSignals...,
    )

    pid := syscall.Getpid()
    for {
        sig = <-srv.sigChan
        srv.signalHooks(PRE_SIGNAL, sig)
        switch sig {
        case syscall.SIGHUP:
            log.Println(pid, "Received SIGHUP. forking.")
            err := srv.fork()
            if err != nil {
                log.Println("Fork err:", err)
            }
        case syscall.SIGUSR1:
            log.Println(pid, "Received SIGUSR1.")
        case syscall.SIGUSR2:
            log.Println(pid, "Received SIGUSR2.")
            srv.hammerTime(0 * time.Second)
        case syscall.SIGINT:
            log.Println(pid, "Received SIGINT.")
            srv.shutdown()
        case syscall.SIGTERM:
            log.Println(pid, "Received SIGTERM.")
            srv.shutdown()
        case syscall.SIGTSTP:
            log.Println(pid, "Received SIGTSTP.")
        default:
            log.Printf("Received %v: nothing i care about...\n", sig)
        }
        srv.signalHooks(POST_SIGNAL, sig)
    }
}

The reason appears defunct supervisor

Use supervisor management process, in the middle need to add a layer of agents, because that supervisor can manage their own process started, meaning that supervisor can get their start the process id (PID), can detect whether a process still alive after carsh done automatically pull since, when you quit can receive the signal process exits.

But if we use the GR framework, originally supervisor started the process of restarting __fork__ released after the child process exits normally, restart again when the release fork after the child process will become ownerless process will be defunct (zombie process) of problem, the reason is this child process exits can not be completed, there is no master process to receive the signal it quits, quit the process itself is a small data structure can not be destroyed.

Using master / worker mode

supervisor itself provides pidproxy program, we in the configuration command supervisor when using pidproxy do one agent. With the process id will stop publishing because the fork changes the child, so each time you start the program needs to be PID stored in a file, general large-scale distributed software requires such a file, MySQL , ZooKeeper and other purposes the goal is to get the process id.

This is actually a master / worker model, master process to the supervisor management, supervisor started master process, which is pidproxy program, and then by pidproxy to start our program goals, whatever we target program fork how much the process will not affect the second son pidproxy master process.

pidproxy dependent PID file, we need to ensure that the program should be written every time you start when the current process id into the PID file, so pidproxy to work.
supervisor default pidproxy file can not be used directly, we need appropriate modifications.

https://github.com/Supervisor/supervisor/blob/master/supervisor/pidproxy.py

#!/usr/bin/env python

""" An executable which proxies for a subprocess; upon a signal, it sends that
signal to the process identified by a pidfile. """

import os
import sys
import signal
import time

class PidProxy:
    pid = None
    def __init__(self, args):
        self.setsignals()
        try:
            self.pidfile, cmdargs = args[1], args[2:]
            self.command = os.path.abspath(cmdargs[0])
            self.cmdargs = cmdargs
        except (ValueError, IndexError):
            self.usage()
            sys.exit(1)

    def go(self):
        self.pid = os.spawnv(os.P_NOWAIT, self.command, self.cmdargs)
        while 1:
            time.sleep(5)
            try:
                pid = os.waitpid(-1, os.WNOHANG)[0]
            except OSError:
                pid = None
            if pid:
                break

    def usage(self):
        print("pidproxy.py <pidfile name> <command> [<cmdarg1> ...]")

    def setsignals(self):
        signal.signal(signal.SIGTERM, self.passtochild)
        signal.signal(signal.SIGHUP, self.passtochild)
        signal.signal(signal.SIGINT, self.passtochild)
        signal.signal(signal.SIGUSR1, self.passtochild)
        signal.signal(signal.SIGUSR2, self.passtochild)
        signal.signal(signal.SIGQUIT, self.passtochild)
        signal.signal(signal.SIGCHLD, self.reap)

    def reap(self, sig, frame):
        # do nothing, we reap our child synchronously
        pass

    def passtochild(self, sig, frame):
        try:
            with open(self.pidfile, 'r') as f:
                pid = int(f.read().strip())
        except:
            print("Can't read child pidfile %s!" % self.pidfile)
            return
        os.kill(pid, sig)
        if sig in [signal.SIGTERM, signal.SIGINT, signal.SIGQUIT]:
            sys.exit(0)

def main():
    pp = PidProxy(sys.argv)
    pp.go()

if __name__ == '__main__':
    main()

We focus facie this method:

def go(self):
        self.pid = os.spawnv(os.P_NOWAIT, self.command, self.cmdargs)
        while 1:
            time.sleep(5)
            try:
                pid = os.waitpid(-1, os.WNOHANG)[0]
            except OSError:
                pid = None
            if pid:
                break

go method is the guardian of the method, it will start the process to get the id, and then do waitpid , but when we fork process when the main process exits, os.waitpid receive an exit signal, and then withdrew, but this is a normal switch logic.

Can be two solutions, the first is to let go method is purely a daemon, remove the exit logic, processing in the signal processing method:

    def passtochild(self, sig, frame):
        pid = self.getPid()
        os.kill(pid, sig)
        time.sleep(5)
        try:
            pid = os.waitpid(self.pid, os.WNOHANG)[0]
        except OSError:
            print("wait pid null pid %s", self.pid)
        print("pid shutdown.%s", pid)
        self.pid = self.getPid()

        if self.pid == 0:
            sys.exit(0)

        if sig in [signal.SIGTERM, signal.SIGINT, signal.SIGQUIT]:
            print("exit:%s", sig)
            sys.exit(0)

Another way is to modify the original go method:

    def go(self):
        self.pid = os.spawnv(os.P_NOWAIT, self.command, self.cmdargs)
        while 1:
            time.sleep(5)
            try:
                pid = os.waitpid(-1, os.WNOHANG)[0]
            except OSError:
                pid = None
            try:
                with open(self.pidfile, 'r') as f:
                    pid = int(f.read().strip())
            except:
                print("Can't read child pidfile %s!" % self.pidfile)
            try:
                os.kill(pid, 0)
            except OSError:
                sys.exit(0)

Of course, you can also use other methods or ideas, just throw the problem here. If you want to know where the real problems can be directly in the local debug pidproxy script file, it is quite interesting to know where the real problem of how to modify, it is entirely up to you to play.

Author: Wang Qingpei (Fun headlines Tech Leader)

Guess you like

Origin www.cnblogs.com/wangiqngpei557/p/11704747.html