How to deal with Dockerd resource leaks? See the essence of the problem through the phenomenon

Insert picture description here

1. Phenomenon

Online k8s cluster alarm, host fd utilization rate exceeds 80%, log in to check dockerd memory usage 26G

2. Troubleshooting ideas

Since we have encountered many dockerd resource leaks before, let's see if it is caused by a known cause.

3. Who is the opposite of fd?

Execute ss -anp | grep dockerd, and the result is as shown in the figure below. You can see that the problem is different from the problem encountered before. The eighth column shows 0, which is not consistent with the situation encountered before, and the peer cannot be found.

Insert picture description here

4. Why is the memory leaking?

In order to use pprof to analyze the memory leak location, first open the debug mode for dockerd, you need to modify the service file, add the following two sentences

ExecReload=/bin/kill -s HUP $MAINPID
KillMode=process

At the same time, add the configuration of "debug": true in the /etc/docker/daemon.json file. After the modification, execute systemctl daemon-reload to reload the docker service configuration, then execute systemctl reload docker, reload the docker configuration, and turn on the debug mode

By default, dockerd uses uds to provide no service. In order to facilitate our debugging, we can use socat to forward the port of docker, as follows sudo socat -d -d TCP-LISTEN:8080,fork,bind=0.0.0.0 UNIX:/var/run/ docker.sock, which means that the outside can call the docker api by accessing port 8080 of the host, so far everything is ready

Run go tool pprof http://ip:8080/debug/pprof/heap locally to view the memory usage, as shown in the figure below

Insert picture description here

You can see that the occupied areas are at the bufio NewWriterSize and NewReaderSize that come with golang. Every http call will be here. I also see what is wrong.

5. Goroutine also leaked?

Leak location

It is still not possible to know the specific problem location through the memory, the problem is not big, then look at the situation of goroutine, directly visit http://ip:8080/debug/pprof/goroutine?debug=1 in the browser, as shown below

Insert picture description here

There are a total of 1,572,822 goroutines, with two big heads each accounting for 786,212 each. Seeing this, you can basically go to the source code along the number of lines in the file. Here we use the docker version 18.09.2, switch the source code to the corresponding version, and check the source code to know the reasons for the leakage of these two types of goroutine. The related processing flow of dockerd and containerd is as follows

Insert picture description here

Corresponding to the above picture, the goroutine leak is caused by the wait chan close at the last docker kill above. Another goroutine will be started during the wait. Each docker kill will cause the leak of the two goroutines. The corresponding code is as follows

// Kill forcefully terminates a container.
func (daemon *Daemon) Kill(container *containerpkg.Container) error {
    
    
   if !container.IsRunning() {
    
    
      return errNotRunning(container.ID)
   }
   // 1. Send SIGKILL
   if err := daemon.killPossiblyDeadProcess(container, int(syscall.SIGKILL)); err != nil {
    
    
      // While normally we might "return err" here we're not going to
      // because if we can't stop the container by this point then
      // it's probably because it's already stopped. Meaning, between
      // the time of the IsRunning() call above and now it stopped.
      // Also, since the err return will be environment specific we can't
      // look for any particular (common) error that would indicate
      // that the process is already dead vs something else going wrong.
      // So, instead we'll give it up to 2 more seconds to complete and if
      // by that time the container is still running, then the error
      // we got is probably valid and so we return it to the caller.
      if isErrNoSuchProcess(err) {
    
    
         return nil
      }
      ctx, cancel := context.WithTimeout(context.Background(), 2*time.Second)
      defer cancel()
      if status := <-container.Wait(ctx, containerpkg.WaitConditionNotRunning); status.Err() != nil {
    
    
         return err
      }
   }
   // 2. Wait for the process to die, in last resort, try to kill the process directly
   if err := killProcessDirectly(container); err != nil {
    
    
      if isErrNoSuchProcess(err) {
    
    
         return nil
      }
      return err
   }
   // Wait for exit with no timeout.
   // Ignore returned status.
   <-container.Wait(context.Background(), containerpkg.WaitConditionNotRunning)
   return nil
}
// Wait waits until the container is in a certain state indicated by the given
// condition. A context must be used for cancelling the request, controlling
// timeouts, and avoiding goroutine leaks. Wait must be called without holding
// the state lock. Returns a channel from which the caller will receive the
// result. If the container exited on its own, the result's Err() method will
// be nil and its ExitCode() method will return the container's exit code,
// otherwise, the results Err() method will return an error indicating why the
// wait operation failed.
func (s *State) Wait(ctx context.Context, condition WaitCondition) <-chan StateStatus {
    
    
   s.Lock()
   defer s.Unlock()
   if condition == WaitConditionNotRunning && !s.Running {
    
    
      // Buffer so we can put it in the channel now.
      resultC := make(chan StateStatus, 1)
      // Send the current status.
      resultC <- StateStatus{
    
    
         exitCode: s.ExitCode(),
         err:      s.Err(),
      }
      return resultC
   }
   // If we are waiting only for removal, the waitStop channel should
   // remain nil and block forever.
   var waitStop chan struct{
    
    }
   if condition < WaitConditionRemoved {
    
    
      waitStop = s.waitStop
   }
   // Always wait for removal, just in case the container gets removed
   // while it is still in a "created" state, in which case it is never
   // actually stopped.
   waitRemove := s.waitRemove
   resultC := make(chan StateStatus)
   go func() {
    
    
      select {
    
    
      case <-ctx.Done():
         // Context timeout or cancellation.
         resultC <- StateStatus{
    
    
            exitCode: -1,
            err:      ctx.Err(),
         }
         return
      case <-waitStop:
      case <-waitRemove:
      }
      s.Lock()
      result := StateStatus{
    
    
         exitCode: s.ExitCode(),
         err:      s.Err(),
      }
      s.Unlock()
      resultC <- result
   }()
   return resultC
}

Compared with the picture of the goroutine, the two goroutines went to the last container.Wait of Kill and the select of Wait. It is precisely because the select of the Wait method has not returned, the resultC has no data, and the outside cannot be returned from container.Wait. Data is read in chan, which causes two goroutines to be blocked for each docker stop call.

Why leak?

Why does select never return? You can see that select is waiting for three chans, and any one of them has data or is closed will return

ctx.Done(): Does not return because the context.Background() was passed in the last time Wait was called. This is actually how dockerd handles requests. Since the client wants to delete the container, then I wait for the container to be deleted, and when to delete and when to exit. As long as the container is not deleted, there will always be a goroutine waiting.
waitStop and waitRemove: It does not return because it did not receive the task exit signal sent by containerd. You can check the above picture. Chan will be closed only after the task exit is received.
Why did I not receive the task exit event?

The problem is gradually clarified, but it is necessary to further investigate why the task exit event is not received. There are two possibilities

Sent but confiscated and received: The first thing that comes to mind here is a problem encountered by Tencent before. Also on version 18 of docker, the goroutine of processEvent exited abnormally, resulting in the inability to receive the signal sent by containerd, please refer to here [1]
No Sending
First see if it has been received, or the content of the goroutine. As shown in the figure below, you can see that both goroutine:processEventStream for processing events and goroutine:Subscribe for receiving events exist. The first possibility can be ruled out

Insert picture description here

Then look at the second possibility, no task exit event was issued at all. After the above analysis, it is known that there is a leak of goroutine and it is caused by docker stop, so it is certain that kubelet has initiated a request to delete the container, and it has been trying, otherwise it will not be leaked all the time. The only remaining problem is to find out which container is being deleted continuously, and why can't it be deleted. In fact, at this time, you may have thought that there is a high probability that there is a D process in the container, so even if the Kill signal is sent, the container process cannot exit normally. The next step is to verify this conjecture. First, find out which container has the problem. First look at the Kubelet log and docker log, as follows

Insert picture description here
Insert picture description here

Good guy, more than one container can't be deleted. It is verified that the container is constantly being deleted, but it cannot be deleted. Next, let's see if there is a D process, as follows

Insert picture description here
Insert picture description here
Insert picture description here

It is true that there are D processes in the container, you can go to the host to see, ps aux | awk'$8="D"', there are a lot of D processes.

to sum up

In order to ensure final consistency, Kubelet will continue to try to delete when it finds that there are containers on the host that should not exist. Each time it is deleted, the api of docker stop will be called to establish a uds connection with dockerd. dockerd will start when the container is deleted. A goroutine calls containerd in the form of rpc to delete the container and waits for the final deletion to be completed before returning. During the waiting process, another goroutine is created to obtain the result. However, when containerd calls runc to actually execute the deletion, it cannot be deleted because of the D process in the container. The container does not issue a task exit signal, and the two related goroutines of dockerd will not exit. The whole process is repeated continuously, which eventually leads to the leakage of fd, memory, and goroutine step by step, and the system gradually becomes unusable.

Looking back, there is actually no problem with the processing of kubelet itself. The kubelet is to ensure consistency. It is necessary to delete containers that should not exist until the container is completely deleted. The timeout is set every time the docker api is called. The logic of dockerd is open for discussion, at least some improvements can be made, because the client requests timeout, and the dockerd backend will do the container remove operation after receiving the task exit event, even if there is no docker stop request currently. Therefore, you can consider removing the Wait function call last passed into context.Background(), and you can exit directly after the Wait with the current timeout returns, so as not to cause resource leakage.

Insert picture description here

Guess you like

Origin blog.csdn.net/liuxingjiaoyu/article/details/112259524