Kubernetes Kubelet 线程泄漏

xx银行 20211102 操作记录
孤儿pod
巡检node上大量使用线程的pod
查询对应的pod命令
解决办法
livenessProbe/readinessProbe 健康检查最佳实践  

孤儿pod


查看日志/var/log/messages发现有孤儿podorphaned pod存在

Nov 2 14:45:56 bots-hrx-ksw7 kubelet: E1102 14:45:56.853519 6990 kubelet_volumes.go:154] orphaned pod "bfd406e1-b35a-41fe-9bbdfb6783747383" found, but volume paths are still present on disk : There were a total of 1 errors similar to this. Turn up verbosity to see them.
Nov 2 14:45:58 bots-hrx-ksw7 kubelet: E1102 14:45:58.846728 6990 kubelet_volumes.go:154] orphaned pod "bfd406e1-b35a-41fe-9bbdfb6783747383" found, but volume paths are still present on disk : There were a total of 1 errors similar to this. Turn up verbosity to see them.
Nov 2 14:46:00 bots-hrx-ksw7 kubelet: E1102 14:46:00.853502 6990 kubelet_volumes.go:154] orphaned pod "bfd406e1-b35a-41fe-9bbdfb6783747383" found, but volume paths are still present on disk : There were a total of 1 errors similar to this. Turn up verbosity to see them.
Nov 2 14:46:02 bots-hrx-ksw7 kubelet: E1102 14:46:02.858094 6990 kubelet_volumes.go:154] orphaned pod "bfd406e1-b35a-41fe-9bbdfb6783747383" found, but volume paths are still present on disk : There were a total of 1 errors similar to this. Turn up verbosity to see 
them.

继续分析历史日志,发现这个pod是 ks-controller-manager 残留的历史文件

Oct 6 08:00:19 bots-hrx-ksw7 kubelet: I1006 08:00:19.623538 6990 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume 
started for volume "host-time" (UniqueName: "kubernetes.io/host-path/bfd406e1-b35a-41fe-9bbd-fb6783747383-host-time") pod "kscontroller-manager-756d467d5c-5x8dl" (UID: "bfd406e1-b35a-41fe-9bbd-fb6783747383")
Oct 6 08:00:19 bots-hrx-ksw7 kubelet: I1006 08:00:19.623602 6990 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume 
started for volume "webhook-secret" (UniqueName: "kubernetes.io/secret/bfd406e1-b35a-41fe-9bbd-fb6783747383-webhook-secret") pod 
"ks-controller-manager-756d467d5c-5x8dl" (UID: "bfd406e1-b35a-41fe-9bbd-fb6783747383")
Oct 6 08:00:19 bots-hrx-ksw7 kubelet: I1006 08:00:19.623623 6990 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume 
started for volume "kubesphere-config" (UniqueName: "kubernetes.io/configmap/bfd406e1-b35a-41fe-9bbd-fb6783747383-kubesphereconfig") pod "ks-controller-manager-756d467d5c-5x8dl" (UID: "bfd406e1-b35a-41fe-9bbd-fb6783747383")
Oct 6 08:00:19 bots-hrx-ksw7 kubelet: I1006 08:00:19.623643 6990 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume 
started for volume "kubesphere-token-49nzh" (UniqueName: "kubernetes.io/secret/bfd406e1-b35a-41fe-9bbd-fb6783747383-kubespheretoken-49nzh") pod "ks-controller-manager-756d467d5c-5x8dl" (UID: "bfd406e1-b35a-41fe-9bbd-fb6783747383")
Oct 6 08:00:19 bots-hrx-ksw7 kubelet: Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods
/bfd406e1-b35a-41fe-9bbd-fb6783747383/volumes/kubernetes.io~secret/webhook-secret --scope -- mount -t tmpfs tmpfs /var/lib/kubelet
/pods/bfd406e1-b35a-41fe-9bbd-fb6783747383/volumes/kubernetes.io~secret/webhook-secret
Oct 6 08:00:19 bots-hrx-ksw7 kubelet: Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods
/bfd406e1-b35a-41fe-9bbd-fb6783747383/volumes/kubernetes.io~secret/kubesphere-token-49nzh --scope -- mount -t tmpfs tmpfs /var
/lib/kubelet/pods/bfd406e1-b35a-41fe-9bbd-fb6783747383/volumes/kubernetes.io~secret/kubesphere-token-49nzh
Oct 6 08:00:19 bots-hrx-ksw7 kubelet: E1006 08:00:19.728563 6990 nestedpendingoperations.go:301] Operation for "{volumeName:
kubernetes.io/secret/bfd406e1-b35a-41fe-9bbd-fb6783747383-webhook-secret podName:bfd406e1-b35a-41fe-9bbd-fb6783747383 
nodeName:}" failed. No retries permitted until 2021-10-06 08:00:20.226577433 +0800 CST m=+2154075.895598614 (durationBeforeRetry 
500ms). Error: "MountVolume.SetUp failed for volume \"webhook-secret\" (UniqueName: \"kubernetes.io/secret/bfd406e1-b35a-41fe-9bbdfb6783747383-webhook-secret\") pod \"ks-controller-manager-756d467d5c-5x8dl\" (UID: \"bfd406e1-b35a-41fe-9bbd-fb6783747383\") : 
mount failed: fork/exec /usr/bin/systemd-run: resource temporarily unavailable\nMounting command: systemd-run\nMounting arguments: 
--description=Kubernetes transient mount for /var/lib/kubelet/pods/bfd406e1-b35a-41fe-9bbd-fb6783747383/volumes/kubernetes.
io~secret/webhook-secret --scope -- mount -t tmpfs tmpfs /var/lib/kubelet/pods/bfd406e1-b35a-41fe-9bbd-fb6783747383/volumes
/kubernetes.io~secret/webhook-secret\nOutput: "
Oct 6 08:00:19 bots-hrx-ksw7 kubelet: E1006 08:00:19.728615 6990 nestedpendingoperations.go:301] Operation for "{volumeName:
kubernetes.io/secret/bfd406e1-b35a-41fe-9bbd-fb6783747383-kubesphere-token-49nzh podName:bfd406e1-b35a-41fe-9bbdfb6783747383 nodeName:}" failed. No retries permitted until 2021-10-06 08:00:20.228590839 +0800 CST m=+2154075.897612007 
(durationBeforeRetry 500ms). Error: "MountVolume.SetUp failed for volume \"kubesphere-token-49nzh\" (UniqueName: \"kubernetes.io
/secret/bfd406e1-b35a-41fe-9bbd-fb6783747383-kubesphere-token-49nzh\") pod \"ks-controller-manager-756d467d5c-5x8dl\" (UID: \"
bfd406e1-b35a-41fe-9bbd-fb6783747383\") : mount failed: fork/exec /usr/bin/systemd-run: resource temporarily unavailable\nMounting 
command: systemd-run\nMounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/bfd406e1-b35a-41fe-
9bbd-fb6783747383/volumes/kubernetes.io~secret/kubesphere-token-49nzh --scope -- mount -t tmpfs tmpfs /var/lib/kubelet/pods
/bfd406e1-b35a-41fe-9bbd-fb6783747383/volumes/kubernetes.io~secret/kubesphere-token-49nzh\nOutput: "
Oct 6 08:00:20 bots-hrx-ksw7 kubelet: Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods
/bfd406e1-b35a-41fe-9bbd-fb6783747383/volumes/kubernetes.io~secret/kubesphere-token-49nzh --scope -- mount -t tmpfs tmpfs /var
/lib/kubelet/pods/bfd406e1-b35a-41fe-9bbd-fb6783747383/volumes/kubernetes.io~secret/kubesphere-token-49nzh
解决办法
rm -rf /var/lib/kubelet/pods/bfd406e1-b35a-41fe-9bbd-fb6783747383/

巡检node上大量使用线程的pod


ksw7上执行相关命令 

[root@bots-hrx-ksw7 ~]# printf "NUM\t\PID\tCOMMAND\n" && ps -eLf | awk 
'{$1=null;$3=null;$4=null;$5=null;$6=null;$7=null;$8=null;$9=null;print}' | sort | uniq -c | sort -rn | head -10
NUM \PID COMMAND
 31628 15128 /usr/lib/jvm/java-1.8-openjdk/bin/java -Djava.util.logging.config.file=/usr/local/tomcat/conf/logging.properties -Djava.
util.logging.manager=org.apache.juli.ClassLoaderLogManager -Dspring.profiles.active=standard -Duser.timezone=Asia/Shanghai -Dhippo.
log.home=/usr/local/tomcat/logs -XX:HeapDumpPath=/wls/heapdump/ -Xms4096M -Xmx4096M -XX:MetaspaceSize=256m -XX:
MaxMetaspaceSize=256m -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -Djdk.tls.ephemeralDHKeySize=2048 -
Djava.protocol.handler.pkgs=org.apache.catalina.webresources -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap 
-XX:MaxRAMFraction=2 -Djava.util.concurrent.ForkJoinPool.common.parallelism=2 -Dorg.apache.catalina.security.SecurityListener.
UMASK=0027 -Xms4096M -Xmx4096M -Xmn384m -XX:MetaspaceSize=256m -XX:MaxMetaspaceSize=256m -XX:+UseConcMarkSweepGC -
XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled -XX:SoftRefLRUPolicyMSPerMB=0 -XX:
+CMSClassUnloadingEnabled -XX:SurvivorRatio=8 -XX:-OmitStackTraceInFastThrow -Dfile.encoding=UTF-8 -XX:+PrintGC -XX:
+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/usr/local/tomcat/logs/data-dsp-databus-758d79c7db-scxhd/gc.log -Dspring.profiles.
active=standard -Duser.timezone=Asia/Shanghai -Dfile.encoding=UTF-8 -javaagent:/usr/local/tomcat/lib/jmx_prometheus_javaagent-0.13.0.
jar=1234:/usr/local/tomcat/conf/prometheus.yaml -Dignore.endorsed.dirs= -classpath /usr/local/tomcat/bin/bootstrap.jar:/usr/local/tomcat
/bin/tomcat-juli.jar -Dcatalina.base=/usr/local/tomcat -Dcatalina.home=/usr/local/tomcat -Djava.io.tmpdir=/usr/local/tomcat/temp org.
apache.catalina.startup.Bootstrap start
 291 24645 /docker-java-home/jre/bin/java -Djava.util.logging.config.file=/usr/local/tomcat/conf/logging.properties -Djava.util.
logging.manager=org.apache.juli.ClassLoaderLogManager -Dspring.profiles.active=standard -Duser.timezone=Asia/Shanghai -Dhippo.log.
home=/usr/local/tomcat/logs -XX:HeapDumpPath=/wls/heapdump/ -Xms2048M -Xmx2048M -XX:MetaspaceSize=512m -XX:
MaxMetaspaceSize=512m -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -Dfile.encoding=UTF-8 -Duser.
timezone=GMT+08 -Djdk.tls.ephemeralDHKeySize=2048 -Djava.protocol.handler.pkgs=org.apache.catalina.webresources -Dorg.apache.
catalina.security.SecurityListener.UMASK=0027 -Dignore.endorsed.dirs= -classpath /usr/local/tomcat/bin/bootstrap.jar:/usr/local/tomcat/bin
/tomcat-juli.jar -Dcatalina.base=/usr/local/tomcat -Dcatalina.home=/usr/local/tomcat -Djava.io.tmpdir=/usr/local/tomcat/temp org.apache.
catalina.startup.Bootstrap start
 215 12781 /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.292.b10-1.el7_9.x86_64/jre/bin/java -Djava.util.logging.config.file=/usr/local/tomcat
/conf/logging.properties -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Dspring.profiles.active=standard -Duser.
timezone=Asia/Shanghai -DAsyncLoggerConfig.RingBufferSize=1024 -Dlog4j2.AsyncQueueFullPolicy=Discard -Dlog4j2.
configurationFactory=com.pingan.hippo.config.log4j2.HippoXMLConfigurationFactory -Dlog4j2.DiscardThreshold=ERROR -Dskywalking.
agent.service_name=hmp-batch-service -Dskywalking.collector.backend_service=hippo-oap.bots-hrx-app.svc.cluster.local:11800 -javaagent:
/wls/agent/agent/skywalking-agent.jar -Xloggc:/usr/local/tomcat/logs/hmp-batch-service-7cf4bc6868-vr82j/gc.log -XX:
+CMSClassUnloadingEnabled -XX:+CMSParallelRemarkEnabled -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintGC -XX:
+PrintGCDateStamps -XX:+PrintGCDetails -XX:-OmitStackTraceInFastThrow -XX:SoftRefLRUPolicyMSPerMB=0 -XX:SurvivorRatio=8 -
Xms4096M -Xmx4096M -XX:MaxMetaspaceSize=386m -Djdk.tls.ephemeralDHKeySize=2048 -Djava.protocol.handler.pkgs=org.apache.
catalina.webresources -Dorg.apache.catalina.security.SecurityListener.UMASK=0027 -Dignore.endorsed.dirs= -classpath /usr/local/tomcat
/bin/bootstrap.jar:/usr/local/tomcat/bin/tomcat-juli.jar -Dcatalina.base=/usr/local/tomcat -Dcatalina.home=/usr/local/tomcat -Djava.io.
tmpdir=/usr/local/tomcat/temp org.apache.catalina.startup.Bootstrap start
 44 62942 /opt/jdk-12/bin/java -Xms1g -Xmx1g -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:
+UseCMSInitiatingOccupancyOnly -Des.networkaddress.cache.ttl=60 -Des.networkaddress.cache.negative.ttl=10 -XX:+AlwaysPreTouch -
Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -Dio.netty.noUnsafe=true -Dio.
netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.
jmx=true -Djava.io.tmpdir=/tmp/elasticsearch-14584481903416358876 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=data -
XX:ErrorFile=logs/hs_err_pid%p.log -Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m -Djava.
locale.providers=COMPAT -XX:UseAVX=2 -Des.cgroups.hierarchy.override=/ -Djava.net.preferIPv4Stack=true -Xms512m -Xmx512m -Des.
path.home=/usr/share/elasticsearch -Des.path.conf=/usr/share/elasticsearch/config -Des.distribution.flavor=oss -Des.distribution.
type=docker -cp /usr/share/elasticsearch/lib/* org.elasticsearch.bootstrap.Elasticsearch
 36 37594 /opt/jdk-12/bin/java -Xms1g -Xmx1g -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:
+UseCMSInitiatingOccupancyOnly -Des.networkaddress.cache.ttl=60 -Des.networkaddress.cache.negative.ttl=10 -XX:+AlwaysPreTouch -
Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -Dio.netty.noUnsafe=true -Dio.
netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.
jmx=true -Djava.io.tmpdir=/tmp/elasticsearch-18371654322352460532 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=data -
XX:ErrorFile=logs/hs_err_pid%p.log -Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m -Djava.
locale.providers=COMPAT -XX:UseAVX=2 -Des.cgroups.hierarchy.override=/ -Djava.net.preferIPv4Stack=true -Xms1536m -Xmx1536m -Des.
path.home=/usr/share/elasticsearch -Des.path.conf=/usr/share/elasticsearch/config -Des.distribution.flavor=oss -Des.distribution.
type=docker -cp /usr/share/elasticsearch/lib/* org.elasticsearch.bootstrap.Elasticsearch
 33 14025 nginx: worker process
 33 14024 nginx: worker process
 33 14023 nginx: worker process
 33 14022 nginx: worker process
 33 14021 nginx: worker process
可以看出pid=15128的进程使用了31628个线程,是第二名进程的100多倍

ksw3上执行相关命令 

[root@bots-hrx-ksw3 ~]# printf "NUM\t\PID\tCOMMAND\n" && ps -eLf | awk 
'{$1=null;$3=null;$4=null;$5=null;$6=null;$7=null;$8=null;$9=null;print}' | sort | uniq -c | sort -rn | head -10
NUM \PID COMMAND
 31888 32516 /usr/lib/jvm/java-1.8-openjdk/bin/java -Djava.util.logging.config.file=/usr/local/tomcat/conf/logging.properties -Djava.
util.logging.manager=org.apache.juli.ClassLoaderLogManager -Dspring.profiles.active=standard -Duser.timezone=Asia/Shanghai -Dhippo.
log.home=/usr/local/tomcat/logs -XX:HeapDumpPath=/wls/heapdump/ -Xms4096M -Xmx4096M -XX:MetaspaceSize=256m -XX:
MaxMetaspaceSize=256m -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -Djdk.tls.ephemeralDHKeySize=2048 -
Djava.protocol.handler.pkgs=org.apache.catalina.webresources -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap 
-XX:MaxRAMFraction=2 -Djava.util.concurrent.ForkJoinPool.common.parallelism=2 -Dorg.apache.catalina.security.SecurityListener.
UMASK=0027 -Xms4096M -Xmx4096M -Xmn384m -XX:MetaspaceSize=256m -XX:MaxMetaspaceSize=256m -XX:+UseConcMarkSweepGC -
XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled -XX:SoftRefLRUPolicyMSPerMB=0 -XX:
+CMSClassUnloadingEnabled -XX:SurvivorRatio=8 -XX:-OmitStackTraceInFastThrow -Dfile.encoding=UTF-8 -XX:+PrintGC -XX:
+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/usr/local/tomcat/logs/data-dsp-databus-758d79c7db-2nbxh/gc.log -Dspring.profiles.
active=standard -Duser.timezone=Asia/Shanghai -Dfile.encoding=UTF-8 -javaagent:/usr/local/tomcat/lib/jmx_prometheus_javaagent-0.13.0.
jar=1234:/usr/local/tomcat/conf/prometheus.yaml -Dignore.endorsed.dirs= -classpath /usr/local/tomcat/bin/bootstrap.jar:/usr/local/tomcat
/bin/tomcat-juli.jar -Dcatalina.base=/usr/local/tomcat -Dcatalina.home=/usr/local/tomcat -Djava.io.tmpdir=/usr/local/tomcat/temp org.
apache.catalina.startup.Bootstrap start
 153 10928 /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.292.b10-1.el7_9.x86_64/jre/bin/java -Djava.util.logging.config.file=/usr/local/tomcat
/conf/logging.properties -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Dobs.bucket.public=hrx-bucket -Dobs.addr.
idc=http://obs-cn-beijing-internal.cloud.papub -Dobs.addr.outer=https://hrx2.obs-cn-beijing.pinganyun.com -Dobs.addr.inner=http://obscn-beijing-internal.cloud.papub -Dobs.accessKey=QTVEQTE3MTFGOUU3NERGNEI5QzlDMUNGNkVGODY4MzE -Dobs.
secretKey=QkMxQTFGMDM1NTBDNEEwOTk2RTkwMDVBNUM3QjkyQTM -Dspring.profiles.active=standard -Duser.timezone=Asia
/Shanghai -Dhippo.log.home=/usr/local/tomcat/logs -Dlog4j2.configurationFactory=com.pingan.hippo.config.log4j2.
HippoXMLConfigurationFactory -Dskywalking.agent.service_name=hbp-s3adapter -Dskywalking.collector.backend_service=hippo-oap.botshrx-app.svc.cluster.local:11800 -javaagent:/wls/agent/agent/skywalking-agent.jar -Xloggc:/usr/local/tomcat/logs/hbp-s3adapter-
5dfc68c957-4sgws/gc.log -Xmn384m -XX:+CMSClassUnloadingEnabled -XX:+CMSParallelRemarkEnabled -XX:
+HeapDumpOnOutOfMemoryError -XX:+PrintGC -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:-OmitStackTraceInFastThrow -XX:
SoftRefLRUPolicyMSPerMB=0 -XX:SurvivorRatio=8 -XX:HeapDumpPath=/wls/heapdump/ -Xms4096M -Xmx4096M -XX:
MetaspaceSize=512m -XX:MaxMetaspaceSize=512m -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -Dfile.
encoding=UTF-8 -Duser.timezone=GMT+08 -Djdk.tls.ephemeralDHKeySize=2048 -Djava.protocol.handler.pkgs=org.apache.catalina.
webresources -Dorg.apache.catalina.security.SecurityListener.UMASK=0027 -Dignore.endorsed.dirs= -classpath /usr/local/tomcat/bin
/bootstrap.jar:/usr/local/tomcat/bin/tomcat-juli.jar -Dcatalina.base=/usr/local/tomcat -Dcatalina.home=/usr/local/tomcat -Djava.io.tmpdir=
/usr/local/tomcat/temp org.apache.catalina.startup.Bootstrap start
 113 21687 java -jar /wls/hippoboot/apps/tag-center-web.jar
 93 12677 /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.292.b10-1.el7_9.x86_64/jre/bin/java -Djava.util.logging.config.file=/usr/local/tomcat
/conf/logging.properties -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Dspring.profiles.active=standard -Dhippo.
log.home=/usr/local/tomcat/logs -Xms2048M -Xmx2048M -Xmn512m -XX:MetaspaceSize=256m -XX:MaxMetaspaceSize=256m -
DLog4jContextSelector=org.apache.logging.log4j.core.async.AsyncLoggerContextSelector -Dlog4j2.enable.threadlocals=true -
DAsyncLoggerConfig.RingBufferSize=8192 -Dlog4j2.AsyncQueueFullPolicy=Discard -Dlog4j2.DiscardThreshold=ERROR -XX:
+CMSParallelRemarkEnabled -XX:SoftRefLRUPolicyMSPerMB=0 -XX:+CMSClassUnloadingEnabled -XX:SurvivorRatio=8 -XX:-
OmitStackTraceInFastThrow -XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/usr/local/tomcat/logs/corehr-reportservice-7555c5c86d-g6zr6/gc.log -Djdk.tls.ephemeralDHKeySize=2048 -Djava.protocol.handler.pkgs=org.apache.catalina.webresources -
Dorg.apache.catalina.security.SecurityListener.UMASK=0027 -Dignore.endorsed.dirs= -classpath /usr/local/tomcat/bin/bootstrap.jar:/usr
/local/tomcat/bin/tomcat-juli.jar -Dcatalina.base=/usr/local/tomcat -Dcatalina.home=/usr/local/tomcat -Djava.io.tmpdir=/usr/local/tomcat
/temp org.apache.catalina.startup.Bootstrap start
 92 10612 /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.282.b08-1.el7_9.x86_64/jre/bin/java -Djava.util.logging.config.file=/usr/local/tomcat
/conf/logging.properties -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Dspring.profiles.active=standard -Duser.
timezone=Asia/Shanghai -Dhippo.log.home=/usr/local/tomcat/logs -XX:HeapDumpPath=/wls/heapdump/ -Xms2048M -Xmx2048M -XX:
MetaspaceSize=512m -XX:MaxMetaspaceSize=512m -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -Dfile.
encoding=UTF-8 -Duser.timezone=GMT+08 -DAsyncLoggerConfig.RingBufferSize=1024 -Dlog4j2.AsyncQueueFullPolicy=Discard -Dlog4j2.
configurationFactory=com.pingan.hippo.config.log4j2.HippoXMLConfigurationFactory -Dlog4j2.DiscardThreshold=ERROR -Dskywalking.
agent.service_name=hbp-admin-web -Dskywalking.collector.backend_service=hippo-oap.bots-hrx-app.svc.cluster.local:11800 -javaagent:
/wls/agent/agent/skywalking-agent.jar -Xloggc:/usr/local/tomcat/logs/hbp-admin-web-5b985459-zhm48/gc.log -XX:
+CMSClassUnloadingEnabled -XX:+CMSParallelRemarkEnabled -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintGC -XX:
+PrintGCDateStamps -XX:+PrintGCDetails -XX:-OmitStackTraceInFastThrow -XX:SoftRefLRUPolicyMSPerMB=0 -XX:SurvivorRatio=8 -XX:
MaxMetaspaceSize=386m -Djdk.tls.ephemeralDHKeySize=2048 -Djava.protocol.handler.pkgs=org.apache.catalina.webresources -Dorg.
apache.catalina.security.SecurityListener.UMASK=0027 -Dignore.endorsed.dirs= -classpath /usr/local/tomcat/bin/bootstrap.jar:/usr/local
/tomcat/bin/tomcat-juli.jar -Dcatalina.base=/usr/local/tomcat -Dcatalina.home=/usr/local/tomcat -Djava.io.tmpdir=/usr/local/tomcat/temp 
org.apache.catalina.startup.Bootstrap start
 40 15198 /data/jdk1.8.0_181/bin/java -Xmx4096m -XX:MaxNewSize=512m -jar /data/bin/newresume-0.0.1-SNAPSHOT.jar
 37 15202 /data/jdk1.8.0_181/bin/java -jar -Dspring.profiles.active=dev168 gw-router-lite-1.0.1.jar
 33 15200 /data/jdk1.8.0_181/bin/java -Xmx4096m -XX:MaxNewSize=512m -jar /data/bin/tika-server-1.24.1.jar
 31 7410 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock
 27 8953 /home/weave/scope --mode=probe --probe-only --probe.processes=false --probe.kubernetes.role=host --probe.publish.
interval=4500ms --probe.spy.interval=3s --probe.docker.bridge=docker0 --probe.docker=true --weave=false weave-scope-app.weave:80
可以看出pid=32516的进程使用了31888个线程

查询对应的pod命令

以下在模拟环境执行,其中pid=2448假设为使用线程数最多的那个进程
[root@i-9kergf6v ~]# nsenter -n -t 2448 ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
 inet 127.0.0.1/8 scope host lo
 valid_lft forever preferred_lft forever
2: tunl0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN group default qlen 1000
 link/ipip 0.0.0.0 brd 0.0.0.0
4: eth0@if35: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1440 qdisc noqueue state UP group default
 link/ether a2:71:4f:fb:45:27 brd ff:ff:ff:ff:ff:ff link-netnsid 0
 inet 10.233.86.25/32 brd 10.233.86.25 scope global eth0
 valid_lft forever preferred_lft forever
查看得到这个pod的ip为10.233.86.25,进入对应pod页面查看相应的pod
解决办法
修改kubelet参数,对每个pod进行pid数量限制
删除对应pod,使其重新部署

livenessProbe/readinessProbe 健康检查最佳实践


kubesphere本身提供可视化的方式配置 livenessProbe/readinessProbe 相关链接: https://v3-1.docs.kubesphere.io/zh/docs/project-user-guide

/application-workloads/container-image-settings/#%E5%81%A5%E5%BA%B7%E6%A3%80%E6%9F%A5%E5%99%A8

 

如果需要相关yaml配置方式,可以在右上角切换到编辑模式,查看到对应的yaml配置
20211102 操作记录
在每个worker node上执行下面的操作,然后重启业务pod
sysctl -w kernel.pid_max=65536

vim /var/lib/kubelet/config.yaml
podPidsLimit: 10000
evictionHard:
 memory.available: 5%
 pid.available: 10%
service kubelet status
service kubelet restart
service kubelet status
后续巡检需要用到的命令
# 在pod上查看使用线程数较多的进程
printf "NUM\tPID\tCOMMAND\n" && ps -eLf | awk '{$1=null;$3=null;$4=null;$5=null;$6=null;$7=null;$8=null;$9=null;print}' | sort | uniq -c 
| sort -rn | head -10

猜你喜欢

转载自blog.csdn.net/qq_34556414/article/details/126420611
今日推荐