Meter once scared troubleshooting experience

Wuhan because of pneumonia, leading to a sharp increase in the importance of the hospital, although has been a maintenance contract is useless, but commercial relations are still here. The company also arranged for us to continue to cooperate hospitals do every day before the Tour start from No. 1 inspection.

2020 February 7, check the database log messages. Since the implementation of comrades before cutting to do. There has to remind the server log being cut.

At that time, the log is cut mechanism not understand. Always think message-20201119 file after being cut to normal.

So just looking at the message in the inspection file and message-20201119 file.

Until today, February 7th want to do the next cut logging mechanism to understand. After a check the configuration file, find the log cutting period of seven days, the result is to create a cut to cut today's date as the end of the new file

So far discovered from the start number 19, all log all lost.

/etc/logrotate.conf debugging through logrotate -d -v.

The results are running normally create messages-20200202 log files and delete messages-20201119 file.

But remove -d turn off debug mode execution logrotate, still useless to create a new log file,

Doubt whether it is written or remove a disk problem caused the failure. So df -h

The result is hang lived.

I drop a day, what the situation. What hung up the disk.

Nor ah, 19 and now business is normal, if the disk hung up, should not be kept business ah.

Colleagues children before the implementation of call, colleagues say children was not enough storage, disk volume plug a piece nfs do rman backup.

Check the / etc / fstab does have, nfs disk / back catalog inaccessible.

But showmount -e 10.20.10.17

It can display directory nfs server.

Manual only mount -t nfs 10.20.10.17/back / back successfully.

However, the use of original / etc / fstab inside rw, bg, wsize, rsize, timeo mount failure and other parameters.

Then think of a backup plan to perform inspection inside the crontab -l. Is rman backup.

ps aux | grep rman | wc -l The results showed that 65 backup process hang live.

Rman process kill off one by one

突然感到后怕.1月22到现在2月7号10多天,rman一直是失败的.是不是归档文件一直没有删除.到时候占满空间数据库会拒绝服务.

赶紧进入sql里面

select * from v$flash_recovery_area_usage

100g空间已经使用了76G.再过一周就会爆掉.

这个时间点,医院停业务估计会被狠狠的屌一顿.真的后怕

回头想想,自己有以下问题.

1) 22号医院重启了rac2服务器,导致日志文件里面产生了大量的信息.所以第一天巡检的时候,我忽略是22号的所有日志.包括当天最后一条nfs故障的告警

而且由于心理盲区,认为开始巡检的1号之前的日志和自己没有关系,7天的巡检都没有观察到这一点.

2) 巡检只按照离职同事发我的巡检文档工作,其中没有提及到rac2的rman备份,只有一台单独服务器提供的expdp备份,所以并没有检查服务器的计划任务.忽视了归档日志这一隐患.

希望这篇文章能提醒自己,运维工作看起来简单.但是不上心认真负责,风险很大.特别是接收其他人遗留的项目.

至于mount加参数后无法挂载的问题,今天太晚了 每天解决.

Guess you like

Origin www.cnblogs.com/ggykx/p/12275286.html