Go for a container service GC pauses often exceed 100ms investigation

Go for a container service GC pauses often exceed 100ms investigation (Reprinted others documents)

 

GC pauses often exceed 100ms

phenomenon

Colleague feedback that the company recently began a trial of k8s, deployed in the docker's go in the process there is a problem, the interface takes a long time, but also time-out. The logic is simple, just call the store kv, kv general storage response time <5ms and a very small, less than 40qps, the vessel is allocated a quota of 0.5 nuclei, the daily operation of the CPU core is less than 0.1. 

recurrent

I found a container, kicked off traffic. These requests take a look structure with ab 50 concurrent network round trip delay of 60ms, but the average processing time-consuming than 200 ms, 99% time-consuming to 679ms.

 

When treatment with ab, looked under the CPU and memory information, are nothing issue. Docker distribution is 0.5 core also did not use so much here.

 

 

Looked under surveillance, GC STW (stop the world) than 10ms, 50-100ms are many, there are a lot of over 100ms. Go not claimed after 1.8 GC pause is basically less than 1ms do?

 

gc information and trace

Look at runtime information about the process and found very little memory, great gc-pause, GOMAXPROCS 76, is auditing the machine. 

export GODEBUG = gctrace = 1, restart the process to see can be seen gc pauses indeed very serious.

  1. gc 111 @97.209s 1%: 82+7.6+0.036 ms clock, 6297+0.66/6.0/0+2.7 ms cpu, 9->12->6 MB, 11 MB goal, 76 P

  2. gc 112 @97.798s 1%: 0.040+93+0.14 ms clock, 3.0+0.55/7.1/0+10 ms cpu, 10->11->5 MB, 12 MB goal, 76 P

  3. gc 113 @99.294s 1%: 0.041+298+100 ms clock, 3.1+0.34/181/0+7605 ms cpu, 10->13->6 MB, 11 MB goal, 76 P

  4. gc 114 @100.892s 1%: 99+200+99 ms clock, 7597+0/5.6/0+7553 ms cpu, 11->13->6 MB, 13 MB goal, 76 P

On a single server has to go sdk service run about trace, then trace file is downloaded to the local look

  1. curl -o trace.out 'http://ip:port/debug/pprof/trace?seconds=20'

  2. sz ./trace.out

The following figure seen with a GC of wall time is 172ms, and gc two stw this stage, sweep termination and mark termination are more than 80 ms, the way almost fills the entire GC time, of course, not science. 

 

 

 

Cause and Solution

the reason

This service is running in the container, the container and machine tools share a kernel, container process to see the number of CPU cores CPU core number is also machine tools, Go for applications, will be the default setting P (as GOMAXPROCS) number is the number of CPU cores. we can also see from the front of the FIG, GOMAXPROCS 76, each used in a m P has bound thereto, the lot number of threads, 171. However, in the figure is assigned to the CPU quota vessel actually not much, only 0.5 core, while the number of threads and a lot.

Guess: For the linux cfs (Completely Fair Scheduler), it all within the current thread (LWP) within the container in order to ensure the efficiency of a scheduling group, for each task is running, unless because of obstruction the reason the initiative to switch, then at least to ensure that their time is running / proc / sys / kernel / schedmingranularity_ns can be seen as 4ms.

Go process container is not properly set GOMAXPROCS number, resulting in too many threads can run, the problem of scheduling delays may occur. Occurs just launched into the gc stw thread to other threads to stop after it is switched scheduler out, for a long time the thread is not scheduled, essentially causing stw time becomes very long (normal processing of 0.1ms delay because of scheduling becomes 100ms level).

Solution

The solution, because too much can run P, resulting in taking up the thread initiated stw virtual runtime, and the CPU quota is not much we need to do is make P and CPU quota match we can choose..:

  1. CPU quota increase container.

  2. Layer container in the container so that the process to see the number of CPU cores is the number Quota

  3. According to the quota set the correct GOMAXPROCS

The first method: the difference is not much effect, the quota from 0.5 becomes 1, not nature (after attempt, the problem remains). 

Point 2 methods: because I am not very familiar with k8s, wait for me to research and then to supplement. 

The first three methods: Set GOMAXPROCS The easiest way is to start the script to add the environment variables 

GOMAXPROCS = 2 ./svr_bin this is effective, but there are also shortcomings, if you deploy a quota bigger container, the script can not follow the change.

uber library automaxprocs

uber have a library, go.uber.org/automaxprocs, when the container go process starts, it will set GOMAXPROCS correct. modify the code templates. We reference the library in go.mod

  1. go.uber.org/automaxprocs v1.2.0

The import and main.go

  1. import (

  2. _ "go.uber.org/automaxprocs"

  3. )

effect

Tip automaxprocs library

Use automaxprocs library, will have the following log:

  1. For a virtual machine or physical machine

    Case of 8-core: 2019/11/07 17:29:47 maxprocs: Leaving GOMAXPROCS = 8: CPU quota undefined

  2. For more than one set of core container quota

    2019/11/08 19:30:50 maxprocs: Updating GOMAXPROCS=8: determined from CPU quota

  3. The container is disposed less than one nuclear quota

    2019/11/08 19:19:30 maxprocs: Updating GOMAXPROCS=1: using minimum allowed GOMAXPROCS

  4. If no quota docker in

    2019/11/07 19:38:34 maxprocs: Leaving GOMAXPROCS=79: CPU quota undefined

    At this suggestion explicitly set in the startup script GOMAXPROCS

Request response

After setting, then take a look at ab request, the network round-trip time of 60ms, 99% requests less than 200ms, and before is 600ms. Under the same CPU consumption, qps from almost doubling. 

 

 

runtime and gc trace information

Because the distribution is 0.5 core, GOMAXPROC identified as 1. gc-pause is very low, and dozens of us look, while the number of threads can also be seen from more than 170 down to 11. 

 

 

 

  1. gc 97 @54.102s 1%: 0.017+3.3+0.003 ms clock, 0.017+0.51/0.80/0.75+0.003 ms cpu, 9->9->4 MB, 10 MB goal, 1 P

  2. gc 98 @54.294s 1%: 0.020+5.9+0.003 ms clock, 0.020+0.51/1.6/0+0.003 ms cpu, 8->9->4 MB, 9 MB goal, 1 P

  3. gc 99 @54.406s 1%: 0.011+4.4+0.003 ms clock, 0.011+0.62/1.2/0.17+0.003 ms cpu, 9->9->4 MB, 10 MB goal, 1 P

  4. gc 100 @54.597s 1%: 0.009+5.6+0.002 ms clock, 0.009+0.69/1.4/0+0.002 ms cpu, 9->9->5 MB, 10 MB goal, 1 P

  5. gc 101 @54.715s 1%: 0.026+2.7+0.004 ms clock, 0.026+0.42/0.35/1.4+0.004 ms cpu, 9->9->4 MB, 10 MB goal, 1 P

Context switching

After the concurrent 50, a total of 8000 request processing perf stat comparative results. The default number of CPU core 76 P, context switches 130,000 times, pidstat view system cpu consumption cores 9%, while the number of quota provided according to the number of P context switching times of only 20,000, cpu consume 3% of nuclei. 

 

 

 

automaxprocs principle of analytic

This library how the quota set GOMAXPROCS it, the code bit around, after reading, in fact, the principle is not complicated. Docker use cgroup to limit CPU use container, use the container configuration cpu.cfsquotaus / cpu.cfsperiodus can get CPU quota. Therefore, the key is to find the value of the two vessels.

Mount obtain information cgroup

cat /proc/self/mountinfo

  1. ....

  2. 1070 1060 0:17 / /sys/fs/cgroup ro,nosuid,nodev,noexec - tmpfs tmpfs ro,mode=755

  3. 1074 1070 0:21 / /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,memory

  4. 1075 1070 0:22 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,devices

  5. 1076 1070 0:23 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,blkio

  6. 1077 1070 0:24 / /sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,hugetlb

  7. 1078 1070 0:25 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,cpuacct,cpu

  8. 1079 1070 0:26 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,cpuset

  9. 1081 1070 0:27 / /sys/fs/cgroup/net_cls rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,net_cls

  10. ....

cpuacct, cpu under / sys / fs / cgroup / cpu, cpuacct this directory.

Gets the container cgroup subdirectory

cat /proc/self/cgroup

  1. 10:net_cls:/kubepods/burstable/pod62f81b5d-xxxx/xxxx92521d65bff8

  2. 9:cpuset:/kubepods/burstable/pod62f81b5d-xxxx/xxxx92521d65bff8

  3. 8:cpuacct,cpu:/kubepods/burstable/pod62f81b5d-xxxx/xxxx92521d65bff8

  4. 7:hugetlb:/kubepods/burstable/pod62f81b5d-5ce0-xxxx/xxxx92521d65bff8

  5. 6:blkio:/kubepods/burstable/pod62f81b5d-5ce0-xxxx/xxxx92521d65bff8

  6. 5:devices:/kubepods/burstable/pod62f81b5d-5ce0-xxxx/xxxx92521d65bff8

  7. 4:memory:/kubepods/burstable/pod62f81b5d-5ce0-xxxx/xxxx92521d65bff8

  8. ....

cpuacct the container, cpu in particular / kubepods / burstable under / pod62f81b5d-xxxx / xxxx92521d65bff8 subdirectory

Calculated quota

  1. cat /sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/pod62f81b5d-5ce0-xxxx/xxxx92521d65bff8/cpu.cfs_quota_us

  2. 50000

  3.  

  4. cat /sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/pod62f81b5d-5ce0-xxxx/xxxx92521d65bff8/cpu.cfs_period_us

  5. 100000

Obtained by dividing both 0.5, if less than 1, GOMAXPROCS set to 1, is set to be greater than a number calculated.

Kernel

automaxprocs library core functions as follows, where cg is parsed path cgroup all configurations of respectively reading and cpu.cfs_quota_us cpu.cfs_period_us, then calculate. 

 

 

 

Official issue

Google search the next, it was also mentioned this issue 

runtime: long GC STW pauses (≥80ms) #19378 https://github.com/golang/go/issues/19378

to sum up

  1. Container process to see the number of nuclear CPU core number for the machine tool, generally the larger the value> 32, the process leading to go to P setting to a high number, and opens up a lot of threads P

  2. general quota vessel is not large, 0.5-4, linux scheduler to the container as a group scheduling thread which is fair, and each thread can run will guarantee a certain running time, because the thread, small quota Although the request is very small, but multi-context switching, it may cause the thread initiated stw scheduling delays caused by stw time to rise to the level, which greatly influence the request of 100ms

  3. By using automaxprocs library, according cpu quota allocated to the container, and the number of correctly setting GOMAXPROCS P, reducing the number of threads, such that GC pause stable <1ms a., And the same CPU consumption, QPS be doubled, the average response time reduced from 200ms to 100ms. reduced to the original thread context switching 1/6

  4. It also analyzes the simple principle that library. Cgroup directory to find the vessel, calculated cpuacct, cpu under cpu.cfs_quota_us / cpu.cfs_period_us, cpu audit shall be allocated.

  5. Of course, if the container process to see the number of CPU cores quota is allocated, it can also solve this problem. I do not quite understand this aspect of the.

Guess you like

Origin www.cnblogs.com/jackey2015/p/11827474.html