一次线上服务器inode打满的事故排查

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/liuxiao723846/article/details/82893644

早晨来到公司收到线上服务器磁盘/目录满了的报警,于是登到服务器上,使用df -h查看磁盘情况:

# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/vda2              36G  36G   0G   100% /
tmpfs                 7.8G     0  7.8G   0% /dev/shm
/dev/vdb              296G  154M  281G   1% /data

进一步使用du -h --max-depth查看具体哪个目录占用磁盘空间:

# du -h --max-depth=1 /var/log
5.8M	/var/log/sa
212K	/var/log/tomcat6
4.0K	/var/log/httpd
4.0K	/var/log/cups
4.0K	/var/log/sssd
20K	/var/log/logstash
4.0K	/var/log/qemu-ga
28K	/var/log/prelink
8.0K	/var/log/samba
25M	/var/log/audit
8.0K	/var/log/ConsoleKit
4.0K	/var/log/ntpstats
23G	/var/log

进入/var/log目录,发现maillog文件20多G。最开始以为就是是这个文件占用磁盘大,于是执行下面命令,将maillog文件大小清空;再查看磁盘空间,发现已经降下去了。

# echo '' > /var/log/maillog
# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/vda2              36G  3.1G   31G  10% /
tmpfs                 7.8G     0  7.8G   0% /dev/shm
/dev/vdb              296G  985M  280G   1% /data

但是很快发现,该文件还在不停的疯狂写入。于是查看该文件内容:

# tail -f /var/log/maillog
Sep 29 15:50:41 VM_26_233_centos postfix/postdrop[11290]: warning: mail_queue_enter: create file maildrop/98512.11290: No space left on device
Sep 29 15:50:41 VM_26_233_centos postfix/postdrop[28639]: warning: mail_queue_enter: create file maildrop/98551.28639: No space left on device
Sep 29 15:50:41 VM_26_233_centos postfix/postdrop[13193]: warning: mail_queue_enter: create file maildrop/98611.13193: No space left on device
Sep 29 15:50:41 VM_26_233_centos postfix/postdrop[6030]: warning: mail_queue_enter: create file maildrop/98512.6030: No space left on device

根据错误日志中No space left ...,很快反映过来应该是inode被占满了(上一步仅仅把磁盘空间释放)。于是使用下面命令查看:

# df -ih
Filesystem            Inodes   IUsed   IFree IUse% Mounted on
/dev/vda2               2.3M    2.3M      0M    100% /
tmpfs                   2.0M       1    2.0M    1% /dev/shm
/dev/vdb                 19M    1.5K     19M    1% /data

果然!那么究竟是什么造成的inode被打满?google create file maildrop/367284.14836: No space left on device,基本上定位是由于postdrop 耗尽资源导致的。但根据网上提供的方法,检查了一下:

ls  /var/spool/postfix/maildrop/

这个目录文件也就10几个,明显不是postdrop造成的inode打满问题。postdrop问题具体可以参考:

http://www.duyumi.net/442.html

https://hambut.com/2015/12/22/crontab-sendmail-postdrop-system-crash/

接下来,需要分析师那个目录的inode被占满。使用下面命令,一层一层的查找:

# for i in /var/spool/*; do echo $i; find $i | wc -l; done
/var/spool/abrt
580000000
/var/spool/abrt-upload
1
/var/spool/anacron
4
/var/spool/at
2
/var/spool/cron
2
/var/spool/cups
2
/var/spool/lpd
1
/var/spool/mail
4
/var/spool/plymouth
2
/var/spool/postfix
41

发现是/var/spool/abrt目录中,文件太多把inode打满了。

接下来查看里面文件的内容:

# pwd
/var/spool/abrt
# ll
total 580000000
-rw------- 1 root root   37 Sep 29 10:07 last-via-server
drwxr-x--- 2 abrt root 4096 Sep 29 09:48 pyhook-2018-09-29-09:44:04-12072
drwxr-x--- 2 abrt root 4096 Sep 29 09:48 pyhook-2018-09-29-09:45:03-12071
drwxr-x--- 2 abrt root 4096 Sep 29 09:48 pyhook-2018-09-29-09:47:31-15236
drwxr-x--- 2 abrt root 4096 Sep 29 09:48 pyhook-2018-09-29-09:48:04-16068
drwxr-x--- 2 abrt root 4096 Sep 29 10:07 pyhook-2018-09-29-09:49:05-16791
drwxr-x--- 2 abrt root 4096 Sep 29 10:07 pyhook-2018-09-29-09:50:04-17222
drwxr-x--- 2 abrt root 4096 Sep 29 10:07 pyhook-2018-09-29-10:04:22-23859
drwxr-x--- 2 abrt root 4096 Sep 29 10:07 pyhook-2018-09-29-10:07:08-25148

# cd pyhook-2018-09-29-10:07:08-25148
# ll
total 48
-rw-r----- 1 abrt root   5 Sep 29 10:07 abrt_version
-rw-r----- 1 abrt root   6 Sep 29 10:07 analyzer
-rw-r----- 1 abrt root   6 Sep 29 10:07 architecture
-rw-r----- 1 abrt root 504 Sep 29 10:07 backtrace
-rw-r----- 1 abrt root  53 Sep 29 10:07 cmdline
-rw-r----- 1 abrt root  37 Sep 29 10:07 executable
-rw-r----- 1 abrt root  46 Sep 29 10:07 hostname
-rw-r----- 1 abrt root  34 Sep 29 10:07 kernel
-rw-r----- 1 abrt root  26 Sep 29 10:07 os_release
-rw-r----- 1 abrt root  58 Sep 29 10:07 reason
-rw-r----- 1 abrt root  10 Sep 29 10:07 time
-rw-r----- 1 abrt root   1 Sep 29 10:07 uid

# cat reason 
<string>:1:connect:error: [Errno 110] Connection timed out

# cat cmdline
/usr/bin/python /data/apps/scripts/count_nginx_respt.py

到这,一切都明朗了。原因是有一个crontab在一直执行/usr/bin/python /data/apps/scripts/count_nginx_respt.py ,由于失败导致python会往/var/spool/abrt目录下记录一个崩溃日志。

好。接下来解决方法就是让python报错时cache住,不要发崩溃日志。

将上面两句放到try中即可。

接下来,我们再了解一下/var/spool/abrt 目录。

该目录经常被写满的原因是,来自内核驱动程序的所有崩溃报告和反向跟踪都写在/ var / spool / abrt目录的子目录中。如果要永久停止back trace gathering,我们需要停止下面两个进程

# service abrtd stop
# service abrt-oops stop
And we can remove all those directories and files with following rm command:
# abrt-cli rm /var/spool/abrt/*

猜你喜欢

转载自blog.csdn.net/liuxiao723846/article/details/82893644