The online service was blown up, it turned out to be the pot of logs! !

This article will introduce a real case that occurred in our online environment. The problem occurred during a major promotion and caused a relatively large impact on our online cluster. This article will briefly review this issue. I asked it for everyone to understand. The actual investigation and resolution process may not be exactly the same as that described in this article, but the idea is the same.

Problem process

During a major promotion, a large number of alarms suddenly occurred in a certain online application, prompting that the disk occupancy rate was too high, reaching more than 80% at one time.

In this case, we log in to the online machine as soon as possible to check the disk usage of the online machine. Use the command: df to view the disk usage.

$df
Filesystem     1K-blocks    Used Available Use% Mounted on
/               62914560 58911440 4003120  93% /
/dev/sda2       62914560 58911440 4003120   93% /home/admin

It was found that the machine's disk consumption was quite serious, because there were a lot of requests during the big promotion, so we first began to wonder if there were too many logs, which caused the disk to run out.

Here is a background. Our online machine is configured with automatic log compression and cleaning. After a single file reaches a certain size, or the machine content reaches a certain threshold, it will be automatically triggered.

But the big promotion did not trigger the cleanup of the log on the day, causing the machine's disk to be exhausted for a while.

After investigation, we found that certain log files of the application occupy a lot of disk space and are still increasing.

du -sm * | sort -nr
512 service.log.20201105193331
256 service.log
428 service.log.20201106151311
286 service.log.20201107195011
356 service.log.20201108155838

du -sm * | sort -nr: Count the size of files in the current directory and sort them according to size

So after communicating with the operation and maintenance students, we decided to deal with it urgently.

The first method is to manually clean up the log files. After the operation and maintenance students log on to the server, they manually clean up some less important log files.

rm service.log.20201105193331

However, after executing the cleanup command, it was found that the disk usage on the machine did not decrease , and it was still increasing.

$df
Filesystem     1K-blocks    Used Available Use% Mounted on
/               62914560 58911440 4003120  93% /
/dev/sda2       62914560 58911440 4003120  93% /home/admin

So we started to investigate why the disk space was not released after the log was deleted. Through the command, we found that there is a process still reading the file.

lsof |grep deleted
SLS   11526  root   3r   REG   253,0 2665433605  104181296 /home/admin/****/service.log.20201205193331 (deleted)

The role of lsof |grep deleted is to view all open files and filter out the files that have been deleted

After investigation, this process is an SLS process, continuously reading the log content from the machine.

SLS is a log service of Ali, providing one-stop data collection, cleaning, analysis, visualization, and alarm functions. To put it simply, the logs on the server will be collected, persisted, and then used for query and analysis.

All our online logs are collected through SLS. Therefore, through analysis, we found that the disk space is not released, which is related to the reading of SLS logs.

At this point, the problem has basically been located, so let's interrupt the principle and introduce the background knowledge behind it.

background knowledge

In the Linux system, file deletion is controlled by the number of links. Only when a file does not have any link, the file will be deleted.

Generally speaking, each file has two link counters: i_count and i_nlink, that is to say: In the Linux system, only when i_nlink and i_count are both 0, the file will be deleted.   

  • i_count represents the number of users (or called) of the current file
  • i_nlink represents the number of media connections (the number of hard links);
  • It can be understood that i_count is a memory reference counter, and i_nlink is a disk reference counter.   

When a file is referenced by a certain process, the corresponding i_count number will increase; when the hard link of the file is created, the corresponding i_nlink number will increase.

In Linux or Unix systems, deleting a file through rm or a file manager only unlinks it from the directory structure of the file system. In fact, it reduces the disk reference count i_nlink, but it does not reduce the i_count number.

If a file is being called by a certain process, the user uses the rm command to "delete" the file. At this time, the file cannot be found through file management commands such as ls, but it does not mean that the file is actually deleted from the disk. .

Because there is still a process executing normally, reading or writing to the file, which means that the file is not actually "deleted", so the disk space will always be occupied.

Our online problem is this principle. Because a process is operating on the log file, the rm operation does not actually delete the file, so the disk space is not released.

problem solved

After understanding the online problem phenomenon and the above related background knowledge, we can think of ways to solve this problem.

That is to find a way to get rid of the reference of this log file by the SLS process, the file can really be deleted, and the disk space can really be released.

kill -9 11526
$df
Filesystem     1K-blocks    Used Available Use% Mounted on
/               62914560 50331648 12582912  80% /
/dev/sda2       62914560 50331648 12582912  80% /home/admin

Special reminder, before executing kill -9, you must consider the consequences of the execution. The principle behind it can be referred to: After I execute kill -9 on the server, I will be notified not to come the next day!

Afterwards, after reviewing the game, we found that there were two main reasons for this problem:

  • 1. Too many online logs are printed too frequently
  • 2. SLS log pull speed is too slow

After in-depth analysis, we found that this application printed a lot of process logs. Initially, log printing was used to facilitate online troubleshooting or for data analysis. However, the amount of logs increased sharply during the promotion period, resulting in a rapid increase in disk space usage.

In addition, because this application shares a SLS project with several other large applications, the SLS pull speed is slowed down, and the process cannot be ended.

Afterwards, we also summarized some improvement items. For the second question, we split the SLS configuration of the application and managed it independently.

Regarding the first problem, it is the policy of introducing log degradation during the big promotion. Once the disk is full, the log will be demoted.

Regarding log degradation, I developed a general tool that dynamically pushes log levels and dynamically changes online log output levels through configuration. And configure the modification of this configuration to our plan platform, and perform regular or emergency plan processing during the big promotion period to avoid this problem.

The development ideas and related codes of the log downgrade tool will be shared with you in the next article.

Thinking

After every big promotion, we will find that most of the problems are caused by the accumulation of a few inconspicuous small problems.

In the process of problem analysis, it is often necessary to apply a lot of non-development skills related knowledge, such as operating system, computer network, database, and even hardware-related knowledge.

So I always think that judging whether a programmer is good or not depends on his problem-solving ability!

Guess you like

Origin blog.csdn.net/hollis_chuang/article/details/110792243