使用python脚本监控服务器显卡使用情况

前言

共用服务器,有时候自己需要显卡被别人占用了…本脚本可以监控显卡显存情况,如果发现空闲会发送email到邮箱进行提醒。

代码

import pynvml
pynvml.nvmlInit()
import time
import os
#from send_email import send_msg

import smtplib
from email.mime.text import MIMEText
from email.header import Header
 
def send_msg(target_email,msg):
  sender = '[email protected]'
  receivers = [target_email]  # 接收邮件,可设置为你的QQ邮箱或者其他邮箱
 
  # 三个参数:第一个为文本内容,第二个 plain 设置文本格式,第三个 utf-8 设置编码
  message = MIMEText(msg, 'plain', 'utf-8')
  subject = 'nvidia显卡监控'
  message['Subject'] = Header(subject, 'utf-8')
 
 
  try:
      smtpObj = smtplib.SMTP('localhost')
      smtpObj.sendmail(sender, receivers, message.as_string())
      print("邮件发送成功")
  except smtplib.SMTPException:
      print("Error: 无法发送邮件")


def watch_nvidia(nvidia_ids,min_memory):
  flag = [1 for i in nvidia_ids]
  for i in nvidia_ids:
    handle = pynvml.nvmlDeviceGetHandleByIndex(i)
    meminfo = pynvml.nvmlDeviceGetMemoryInfo(handle)
    print("card {} free memory is {}GB".format(i,meminfo.free * 1.0 /(1024**3)))
    if meminfo.free * 1.0 /(1024**3) > min_memory:
      flag[i-1]=0
    else:
      flag[i-1]=1
  if 0 in flag:
    free_num = 0
    for i in flag:
      if i == 0:
        free_num += 1
    return free_num
  else:
    print("no free card!")
    return -1



nvidia_ids = [0,1,2,3] # 显卡id
min_memory = 8 # 最小可用显存 GB
while True:
  flag = watch_nvidia(nvidia_ids,min_memory)
  if flag >= 4:
    send_msg("[email protected]","{}张显卡空闲,自动启动训练".format(flag))
    os.system("sh veri.sh") # your command
    break
  time.sleep(10)
    


    



运行

保存成py文件在服务器上:

nohup python3 -u watch_nvidia.py > tmp.txt 2>&1 &

本代码仅仅是发邮件提醒,其实可以将send_msg()换成启动你训练任务的命令,这样显卡空闲就自动启动你的训练任务了。

Guess you like

Origin blog.csdn.net/qq_37668436/article/details/119605697