CentOS configure OOM monitoring alarm

Due to unreasonable program design or instantaneous high concurrent access, it is likely to trigger OOM (Out of memory), which refers to OOM at the operating system level. The specifics of what is OOM and how it happens do not go into details here, because the author thinks this is the basic common sense of IT practitioners. This article mainly records the monitoring of the OOM occurrence process in the production environment, which is convenient for us to find in time and review the problems afterwards.
When doing this monitoring, the author also did a lot of investigation and search, and imagined that there will be one or two mature open source software to achieve this monitoring, but it turned out to be counterproductive. A slightly rough program to achieve my purpose.
Implementation ideas:

    Apr 18 12:11:25 php001 kernel: Out of memory: Kill process 13546 (php-fpm) score 31 or sacrifice child

Every time the system triggers OOM, it will write down the running status of the system and the pid and score of the killed program in the / var / log / message file. So I started with this file and wrote such a script.

1、目录结构如下:
oom_monitor位于共享存储NFS上,生产环境的机器都挂载了该NFS,因此所有机器上都会有这么一个目录

[[email protected]:/mnt/alinas]$ tree -L 1 oom_monitor/
oom_monitor/
├── bin                #存放可执行的脚本文件
└── log                #存放日志文件

[[email protected]:/mnt/alinas]$ tree -L 1 oom_monitor/bin/
oom_monitor/bin/
├── oom_check.sh       #过滤"Out of memory"从/var/log/message,并生成对应文件保存在log目录下,用于后面的发送警报
├── oom_dingding.py    #发送OOM对应信息到钉钉群
├── oom_mail.py        #发送OOM对应信息到邮箱
└── oom_send.sh        #用于触发发送报警信息到钉钉和邮箱

2. The specific content is as follows:

[[email protected]:/mnt/nfs/oom_monitor/bin]$ cat oom_check.sh
#!/bin/sh
#获取主机名
host_name=`hostname`

#定义获取到的OOM日志存储位置
oom_scrape_file=/mnt/nfs/oom_monitor/log/oom_scrape_$host_name

#定义上次发生OOM时的对应信息,用于去重,防止重复报警
old_oom_scrape_file=/mnt/nfs/oom_monitor/log/old_oom_scrape_$host_name

#获取OOM报警信息
msg=$(sudo grep -i "out of memory" /var/log/messages|awk 'END {print}')

#获取报警产生的时间
time=`echo $msg | awk '{print $3}'`

#获取被杀的程序类型(java/php/mysql/...)
killed=`echo $msg | awk 'END {print $12}'`

#获取上次OOM的信息
old_msg=`cat $old_oom_scrape_file`

#判断两次信息是否相同,相同则不再记录此次信息,防止重复报警
if [ "$msg" == "$old_msg" ];then
 	exit 1
else

#如果两次报警信息不相同则把这次获取的信息覆盖上次的信息
	[ ! -z "$msg" ] && echo "$msg" > $old_oom_scrape_file
fi

#记录此次信息持久化到文件
if [ ! -z "$msg" ];then
	echo > $oom_scrape_file
	echo -e "发生时间: $time" >> $oom_scrape_file
	echo -e "日志信息: $msg" >> $oom_scrape_file
	echo -e "被杀程序: $killed" >> $oom_scrape_file
	echo -e "发生主机: $host_name" >> $oom_scrape_file
fi


[[email protected]:/mnt/nfs/oom_monitor/bin]$ cat oom_send.sh
#!/bin/sh

#定义获取到的OOM日志存储位置、OOM发送报警脚本位置
#这里用*的原因是:机器很多,每台机器一个文件,所以用*
oom_scrape_file=/mnt/nfs/oom_monitor/log/oom_scrape*                
oom_warn_script=/mnt/nfs/oom_monitor/bin/oom_mail.py
oom_dingding_script=/mnt/nfs/oom_monitor/bin/oom_dingding.py
History_file=/mnt/nfs/oom_monitor/log/history_oom_scrape
basedir=/mnt/nfs/oom_monitor/log/
cd $basedir
for file in `ls $oom_scrape_file`
do
	hostname=`grep "发生主机" $file|awk '{print $2}'`
	warn_time=`grep "发生时间" $file |awk '{print $2}'`
	killed=`grep "被杀程序" $file |awk '{print $2}'`

    #发送邮件报警,把日志信息全部发出去
	/usr/local/bin/python3 $oom_warn_script $file

    #发送钉钉报警
	/usr/local/bin/python3 $oom_dingding_script $hostname $killed $warn_time

    #报警信息发出后把日志信息追加到历史信息文件中,然后删除对应的oom信息文件,防止重复报警
	cat $file >> $History_file  && mv $file /tmp
done


[[email protected]:/mnt/nfs/oom_monitor/bin]$ cat oom_mail.py
#!/usr/bin/env python3
'''
当收到系统OOM时,触发脚本发出邮件报警信息,信息格式如下:

您的主机「hostname」发生OOM,具体信息如下:
    /var/log/message中的信息

'''
import os
import smtplib
from email.mime.text import MIMEText
from email.header import Header
import time
import sys
def send_email(file_name):
    try:

        # 读取测试报告中的内容作为邮件的内容
        with open(file_name, 'r', encoding='utf8') as f:
            mail_body = f.read()

        # 发件人地址
        send_addr = '此处替换为发件人的邮箱用户名'

        # 收件人地址
        reciver_addr = ['接收人a的邮箱地址','接收人b的邮箱地址',]

        # 发送邮箱的服务器地址,这里用的是阿里云的
        mail_server = 'smtp.mxhichina.com'
        now = time.strftime("%Y-%m-%d %H:%M:%S")

        # 邮件标题
        subject = '[OOM报警触发]' + now

        # 发件人的邮箱及邮箱密码
        username = '此处替换为发件人的邮箱用户名'
        password = '此处替换为发件人的邮箱密码'

        # 邮箱的内容和标题
        message = MIMEText(mail_body, 'html', 'utf8')
        message['From'] = send_addr
        message['To'] = ','.join(reciver_addr)
        message['Subject'] = Header(subject, charset='utf8')

        # 发送邮件,使用的使smtp协议
        smtp = smtplib.SMTP()

        #端口注意下,通常服务器的25端口是关闭的,所以我这里用了80、或者465也阔以
        smtp.connect(mail_server,80)
        smtp.login(username, password)
        smtp.sendmail(send_addr, message['To'].split(','), message.as_string())
        smtp.quit()
        print("邮件发送成功!")
    except:
        print("发送邮件失败!")

send_email(file_name=sys.argv[-1])



[[email protected]:/mnt/nfs/oom_monitor/bin]$ cat oom_dingding.py
#!/usr/bin/env python3
import json
import requests
import datetime
import sys
def sendmessage(hostname,killed,warn_time):
    now_time=datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')

#替换成你自己的钉钉群的webhook
    url = 'https://oapi.dingtalk.com/robot/send?access_token=917b00b5faee51bcb67e862c13b0a0ff605f0f74f4f692c9a70fe32351'

    HEADERS = {
        "Content-Type": "application/json;charset=utf-8"
    }
    message='''
【OOM报警触发】:
    报警主机:%s
    被杀进程:%s
    报警时间:%s
    ''' %(hostname,killed,warn_time)
    String_textMsg = {
        "msgtype": "text",
        "text": {"content": message},
        "at": {
            "atMobiles": [
                "110120119"  # 如果需要@某人,这里写他的手机号
            ],
            "isAtAll": 0  # 如果需要@所有人,这些写1
        }
    }
    String_textMsg = json.dumps(String_textMsg)
    res = requests.post(url, data=String_textMsg, headers=HEADERS)
    print(res.text)
sendmessage(sys.argv[-3],sys.argv[-2],sys.argv[-1])

3. Results presentation:

Guess you like

Origin www.cnblogs.com/chaizhenhua/p/12724710.html