How to find the POD corresponding to overlayfs

1. Problem background

Sometimes we encounter a problem in the environment, that is, during a certain period of time, the disk usage of the docker or containerd directory will soar (for example, the mounted directory is /home/deployer/containerd), triggering an alarm and soaring. After finishing, it automatically falls back. Since the time range is relatively fixed, it is suspected that a certain service or host has set a scheduled task. However, after the investigation, no scheduled tasks were found within this time period. In this case, we need to pass certain procedures ourselves. The method searches for the specific directory under the disk that caused the problem.

2. Cause of the problem

Captured through script/home/deployer/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/ directory The disk usage growth and decline caused by changes in a certain digital directory are stored here in OverlayFS.Each digital directory corresponds to a pod on the host:

OverlayFS, also known as Union File System or Union Mount, allows you to mount a file system using 2 directories: the "lower" directory (the read-only layer) and the "upper" directory (the writable layer).

basically:

The lower directories of the file system are read-only.

The upper directory of the file system can be read and written

When a process "reads" a file, the OverlayFS file system driver looks in the upper directory and reads the file from that directory, if it exists. Otherwise, it will look in the lower directory.

When a process "writes" a file, OverlayFS writes it to the upper directory, which is the writable layer.

For details, please refer to the link:Docker Principle-OverlayFS Design and Implementation-Tencent Cloud Developer Community-Tencent Cloud <OverlayFS Design and Implementation>

3. Solution

3.1 Script to capture the added directory

Debugging according to your own environment:

#!/bin/bash
directory="/home/deployer"  # 替换为您要监视的目录路径
 
while true; do
    current_time=$(date +"%Y-%m-%d %H:%M:%S")
    echo "当前时间:$current_time"
    echo "/home/deployer:"
    du -sh "$directory" --max-depth=1  # 显示目录占用空间
    echo "/home/deployer/containerd:"
    du -sh "$directory"/containerd  --max-depth=1 | sort -hr | head # 显示目录占用空间
 
    sleep 5  # 等待5秒
done

nohup {脚本} &           #后台运行,输出记录到当前目录下的文件nohup.out中

According to the log nohup.out, it is found that a certain digital directory under this directory will have a large disk usage change in the early morning. It can be confirmed that the alarm is caused by changes in this directory.

3.2 Confirm podID

Caught according to the script/home/deployer/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/ directory a numeric directory name

mount|grep containerd.snapshotter.v1.overlayfs | grep {目录名}

The container ID is highlighted in the figure below: 

3.3 Confirm the service container

ctr -n k8s.io c list | grep [podID]          #会列出当前主机所有容器id和使用镜像

After finding the corresponding service, we can analyze the specific cause. Check whether the service log is cut too large and you can appropriately optimize the cutting parameters.

Guess you like

Origin blog.csdn.net/zfw_666666/article/details/133794432